Our governments know a lot about us. They hold data on how much we earn, how much tax we pay, our health records, business earnings, even whether we have a fishing license.

As we saw during the worst days of the COVID-19 pandemic, data on the spread of the virus and the pace of the vaccine rollout was vital for keeping us safe and holding our governments to account. Government data is also essential for informed public policy debates, and it’s invaluable for researchers and others who advocate for better public policy.

But a lot of government data in Australia is locked up behind closed doors. And when governments do make data available, it is often published in ways that are difficult to understand and unwieldy for researchers to use.

In this special Grattan Podcast, Grattan data specialist Tyler Reysenbach is joined by Adam Sparks, a Senior Research Scientist with Western Australia’s Department of Primary Industries and Regional Development, and Matt Cowgill, Senior Economist at SEEK, to talk about how our governments could get better at data, and how things could be improved in ways that would improve public policy and, ultimately, the lives of all Australians.

Transcript

Tyler Reysenbach: Our governments know a lot about us. They hold data on how much we earn, how much tax we pay, our health records, business earnings, and even whether we have a phishing license. As we saw during the worst days of the COVID 19 pandemic, data on the spread of the virus and the pace of the vaccine rollout was vital for keeping us safe and holding our governments to account.

Government data is also essential for informed public policy debates. and it’s invaluable for researchers and others who advocate for better public policy. But a lot of government data in Australia is locked up behind closed doors. And when governments do make data available, it is often published in ways that are difficult to understand and unwieldy for researchers to use.

My name is Tyler Reysenbach and I’m a researcher and data analyst here at the Grattan Institute. And today I’m joined on the Grattan podcast by two of the best in the business, Adam Sparks and Matt Cowgill. Adam is a senior Research Scientist with West Australia’s Department of Primary Industries and Regional Development.

Welcome to you, Adam. Thank you. And Matt is a Senior Economist at SEEK and an ex Grattan alum. Hello, Matt. Hello. In this podcast, I’m going to be talking to Adam and Matt about why our governments are no good at data and how things could be improved in ways that would improve public policy and ultimately the lives of all Australians.

So let’s start at the beginning. Why does data availability and use matter? Why is it important for public policy? Adam?

Adam Sparks: Oh, look, as a researcher, we’re, in the government, we’re tasked with conducting research that is going to benefit the citizens of Western Australia. And having access to data, I mean, not only within our own organization, but from other government entities is absolutely critical for us being able to understand what’s going on and craft recommendations and conduct our research.

Tyler Reysenbach: Yeah, absolutely. I think, like, knowing whether a policy works or doesn’t work, you can only answer that question with data, and you need that data to be available and easy to use. Otherwise, you just don’t look at it, right?

Adam Sparks: Yeah, I mean for the most part, our research tends to be focused, my group does agricultural research so we deal a lot with weather data and land use change and things like that, these things are often out there but can be difficult to find for various reasons and difficult to actually get a hold of if you have them.

Which makes our jobs infinitely more difficult.

Tyler Reysenbach: And Matt, I mean, SEEK’s a private organization. Why does government making data available to them

Matt Cowgill: matter? Ideally, businesses are making all sorts of decisions about all sorts of things all the time, and you want those decisions to be well informed. Whether that comes to people might say a drop in revenue in one particular segment of the market and not know whether that’s specific to their business or whether it’s something more general, having high quality, high frequency economic and social statistics really helps businesses to look through the fog of kind of what they themselves are experiencing to get the bigger picture of what’s going on as well as, as you say, just the general kind of research imperatives as Adams alluded to.

Certainly know from my time at Grattan that we couldn’t do what we do at Grattan without high quality public data. Yeah, no,

Tyler Reysenbach: absolutely. It’s integral to Grattan’s value proposition and how we deliver better policy, right? So what do we think, like, what would we describe the main state of play of data availability and use is currently?

Matt Cowgill: I would say so. I’d start with the positives, which is that I think in many ways things have improved. So I think in recent years, maybe in the past decade, in fact, the ABS and a range of federal government agencies have made a real effort to collect it. Join and release a range of administrative data by administrative data.

I mean, data that’s collected in the administration of government programs like social security and tax and so on. There’s a range of really high quality data, high quality insights there, and it was probably being underused before. So joining those things up and making them available, at least to some researchers, Has been of huge benefit to the kind of research that you’re able to do in Australia with Australian data.

So that’s at the kind of the good end. But I’d say public data release from sort of public sector agencies generally in Australia is mixed. There are some that I would say are absolutely world’s best practice. And things like the administrative data initiatives I mentioned before are at that end of the spectrum.

And there are others where frankly the data is released on an inconsistent schedule, and Formats that change over time with just really kind of seemingly minor, but actually really important problems with the data that make it really difficult to work with. Like if the names of columns change over time or the way things are measured and described changes over time in ways that aren’t always obvious to data users.

There are absolutely examples of that and areas that could be. Adam, from

Tyler Reysenbach: your perspective, like what is. Best practice in this area. Like when you see some data out there in the world and you look at it, what makes you go like, yeah, this is really good data release practice.

Adam Sparks: Principles of Fair Data, the Findable, Accessible, Interoptable, and Reusable.

You know, my research team does a lot of modeling work, and so we want something that is machine readable off the bat. And Matt already alluded to that when he’s talked about you get different formats, or column names change, and if these things aren’t documented well. It becomes very difficult to just go quickly and just the data.

So you spend a lot of time cleaning up. If you find the data, then you spend a lot of time cleaning it up so you can actually use it to do the research that you need to do.

Tyler Reysenbach: Yeah. And so, I mean, can you run through quickly those four principles you just mentioned and kind of define them for our audience who might not be using, you know, might not be data wonks like we are?

Adam Sparks: Yeah, sure. So the principles of fair data are that it has to be findable. And that means it’s just well documented. That it’s something that if people are looking for it, you can find it. It needs to be accessible. This doesn’t mean it has to be open. That just means that the right people that need to access it or have the permission to access it can access it.

Without jumping through all sorts of hoops. Interoptable means that’s the machine readable part where you don’t have to worry about, it’s a Python script and I use R and, or, you know, I mean, for data itself, it’s not an Excel spreadsheet that has colors and merge cells and those sorts of things that it’s something that you can just point your programming language at it and say, here, read this in and the reusable is.

That’s exactly what we’re often using it for. Oftentimes it was data that was collected for one purpose, but maybe from experience it was something collected, there was an experiment done on a particular wheat disease. And there was data collected about how the disease progresses over time. And then we want to make a model.

So this project was maybe looking at how fungicides can control the disease, but we’re doing a project and trying to model more broadly for farmers to make decisions about do I spray a fungicide or not to control the disease. We can take that data, if it’s reusable, and use it for our purposes in building our model.

So it then serves two purposes. The original one was met, where the researcher got the information about does this fungicide work or not, or was the timing acceptable or not. And then we can use it developing our model as well. And so that’s all the reusable part is just. Being able to reuse the data more than one time.

Tyler Reysenbach: Yeah, and I think for me, what makes data reusable or not is often the context in which it’s produced. So, making sure that that context never is lost from the data, so that you understand what are the limitations of the data, who is in the data, who isn’t in the data, all that kind of stuff needs to be alongside the data.

A number doesn’t mean anything unless you

Adam Sparks: metadata to go with the data, yes. Absolutely, and that’s part of just the FAIR principles, is the data needs to be adequately described. Yeah,

Tyler Reysenbach: exactly. And so, I mean, what holds us back? Like, what holds government back from releasing more data? Like, this sounds pretty good to me.

I mean, I use the data when I see really good practice. I love it, and I do really good work with it. So what’s stopping us from having more? and better data around that.

Matt Cowgill: Yeah, so I’d break this question into two parts. One would be what sort of data gets released either publicly or to sort of authorized users.

And the second is how does it get released and does it meet all these criteria we’ve been talking about about machine readability and so on. The first on what gets released. I think there is still some degree of risk aversion. In government, that means that data that probably would be low risk and could be released at least to authorized researchers, if not the general public tends to still get kept in house.

That has changed over time. As I mentioned earlier, we are making more and more use of these administrative data sets, but I think there’s more we could do. Sometimes agencies can be sitting on data that they don’t realize the value of. I think that is still the case. An example of a really valuable public sector data set that comes to mind for me is something called the workplace agreements data set that is held by the federal government.

Now, this is information that is just collected in the process of administering the industrial relations system. So employees and employers make collective agreements. Those are lodged with the Fair Work Commission. They go into this database. That’s the purpose of the database is just keeping track of these agreements.

But there’s a range of really valuable information in there about the types of Wages and conditions that people are agreeing to over time. We made great use of it at Grattan Institute. Not that many researchers had actually worked with the microdata before we got to it. And that I think is an example of really valuable administrative data that had probably been underused to that point.

So that’s on the sort of what gets released front. The second on this sort of the format, if you like, in which data gets released. I think in some cases it can be a case of Organizations, agencies just not speaking enough to the users or potential users of their data to find out what it is that and what would best meet their needs and maybe making an assumption that when elaborately formatted Excel spreadsheet would be exactly what a data user wants when, as Adam has mentioned, often data users, particularly sort of more sophisticated data users.

We just want the simplest things possible. So I think it’s maybe inadequate consultation in some cases.

Tyler Reysenbach: Yeah, and I mean in there to feel for them a little bit, there’s a diverse group of users, right? A minister’s office wants something that’s really quick and easy to digest. A journo wants something that will like play well on TV.

A researcher wants like as much data as granular and easy to use as possible. Someone in a strategic policy unit might need something that’s a bit more complicated but not too complicated because they’re not doing complex analysis. And so it is hard to know what exactly they should be releasing. But I do think part of the problem here as well as that there’s just a general lack of data capability within the public service.

And so they don’t even know what is good and what isn’t good and what’s useful and what’s not useful. So I think lifting that as well would be really

Matt Cowgill: helpful. I was just going to say as well on this point about data capability in the public service, I mean, there’s some hugely capable people across the state and federal public sectors, as Adam’s a sterling example of that.

I do think in some cases, part of the problem might be that, say at the federal level, people who work in say the treasury or prime minister and cabinet. They have access to these really rich, high quality, joined up administrative data sets that mean that maybe they’re not making use so much of the kind of more public facing type stuff that external non government researchers can use, whether that’s the general public or registered researchers.

And so it means that the data releasing organizations aren’t necessarily getting feedback from someone in Treasury or the like, because the Treasury people are just diving into the micro data. They’re really detailed. Person level or business level data. And that’s great. So, yeah, maybe not hearing enough from the sort of external users of data in some cases.

Tyler Reysenbach: Yeah, I think that’s actually a really, really good point. So I know for me when I work on what’s called the multi agency data integration project, which is a great. Piece of data, admin data that I get access to through the ABS. I know a lot of the times, some of the struggles I face when I talk to treasury about, they’re like, I don’t understand what that’s like, because I just use the home affairs data directly.

I don’t have to use MedEv at all. And so, yeah, it’s like clearly a huge swathe of really good data users aren’t using the same data sources as everyone else.

Matt Cowgill: Yep. I think as well, a key point is, as you mentioned before, Tyler, there are multiple audiences for any piece of data, or at least multiple potential audiences, right?

Like at one end of the spectrum, you have academic researchers, the other end of the spectrum, you maybe have journalists and the like, who, as you say, are looking for Relatively sort of pre packaged insights that they can report on. There’s a lot of us in the middle as well who don’t necessarily need access to the full kind of underlying raw microdata, but want something more than a kind of pre packaged chart on a webpage or something like that.

And I think it’s really important for data releasing organizations to. Have that in mind. And I am a bit concerned that we might end up with a bit of a missing middle in Australian data where organizations cater relatively well for those two ends of the spectrum and the stuff in the middle kind of falls by the wayside.

Tyler Reysenbach: Yeah, no, I think that’s absolutely right. I think the current. Fad in data worlds, dashboards, which are very easy for non data users to use. If you are a data user and you want the underlying data in the dashboard, though, it’s impossible to get out. And so it’s basically useless to you. At the end of the day, you need someone who’s going to communicate what’s in the dashboard anyway.

So have you really achieved anything by making an elaborate, beautiful. Dashboard app thing that people get to play with. Like, yeah, it does seem like maybe we need to think about who we’re catering for a little bit with our data. Adam, do you have anything you want to add about what holds us back in making data more available and

Adam Sparks: accessible?

I think one of the things just is time. As public servants, we’re often flat strapped. Just getting our jobs done. And the last thing you think about is I need to get this data of make it so it’s available to other people unless it’s actually just mandated and part of the job. And that is something that Matt said, you know, over the past decade, he’s seen changes.

And I mean, I’ve been in Australia seven years and I’ve seen changes. In those seven years coming from the funding agencies that fund a lot of the research that I do have done both at university and in public sector with state government. It’s becoming much more commonplace for the agreements to be written that this data will be.

Available in a certain format and as I said, it doesn’t mean it’s open, but you have to know where to find it. It has to be not just sitting on someone’s hard drive at the end of the project sort of thing. So we’re slowly seeing those changes and there is resistance because it’s a new thing and that’s part of the problem is this is new.

People struggle with it because it’s not something they’ve been taught, I think. I mean, yes, we’re data wonks, we understand all of this and the importance of it, but honestly, most of these people are just trying to get their job done, and this is not on their radar as being something that’s important. So I’m not saying that’s bad.

I’m just saying that’s reality. You know, I understand where they’re coming from. There’s bits of my job that I don’t use important that I’m sure somebody else in the organization is going, why isn’t this guy doing this? This is so important. So, you know, it’s that just having that perspective on that, I think, is important when you look at this and go, why aren’t we doing a better job of this?

It’s just. The reality of it, it’s changing, but slowly,

Matt Cowgill: I think as well, to come back to a point we made initially is that the availability and distribution of high quality data, the benefits of that are not always immediately obvious, right? Like the benefits are diffuse and can occur over time, and they can be subtle, like you improve public policy at the margin, you improve private sector decision making relative to what it would otherwise be.

be those effects and the benefits from more and better data aren’t always obvious and can’t always be factored into some cost benefit analysis ahead of the fact. So if you’re a senior public servant who’s making tough decisions about resource allocation, putting a team of people on making data available or making it more easily available externally just might not rise to the top of your list.

And I

Tyler Reysenbach: think when we tighten that loop between releasing the data and benefit, That’s when you actually see the most progress happen, right? Like take COVID for example, the ABS suddenly started releasing monthly statistics on inflation and the labor market. And they got a lot of positive reinforcement because it was really valuable for decision makers to know what’s going on and it’s continued post COVID.

Similarly, when you look at the health department, they initially didn’t release very much in a machine readable format. And then after a while they did, they started releasing things in spreadsheets. So it is like, it’s about tightening that link between benefit and the kind of cost and effort to make it

Matt Cowgill: usable.

Yeah, that’s right. And at the more academic end of the spectrum, like I think of some of the evaluations I’ve seen of COVID era economic policies like JobKeeper, that have been really high quality evaluations of the causal impact of public policies that would not have been possible even five years ago with the data that was available then.

And, you know, those sorts of examples of being able to evaluate. With in a pretty short space of time, the impact of really important, really expensive public policies like that pays off whatever we invested in making those data sets available, pays off if it can deliver better policy down the track.

Tyler Reysenbach: Yeah, no, absolutely. Like the benefits are tantamount. One thing that we haven’t mentioned though, and, and something I’d be interested in your perspectives on is. I know when a couple of years ago there was the 10 percent Medicare sample that got released and some white hat hackers kind of were able to re identify people through it.

And that kind of was a huge scandal. And I think it really set the cat amongst the pigeons around releasing more data. Do we think that privacy settings? Matter, how much do they matter? How do we get the balance right? How do we protect people’s privacy while still getting benefit and value from the data?

Because these are all really important things when you think about how much data government has about people. Maybe Adam, do you want to jump in on this one?

Adam Sparks: Sure. No, it is. is hugely important as you alluded to. I mean, I get to work with actual on farm data, so I get to handle some farm records, and that’s not the type of stuff that Any farmer just wants to be released.

And as a part of that, I signed a data sharing agreement that it’s not my data. I can’t share it. You’ve let me use it for this purpose. But that is a lot of the driver behind a lot of this, I think. And a lot of the resistance that I mentioned earlier isn’t just that it’s new, it’s also a lot of my colleagues do research on someone’s farm and therefore there’s bits of that information that can’t be released because that’s very personal data.

And so being a risk averse institution. You tend to err on the side of totally on caution and say, well, we just don’t release any data at all because that’s much easier than trying to sift through and saying, well, we could release this or we could de identify that and make it available sort of thing. I think, to be clear, I’m not upset with that.

I understand the practicalities of it because the last thing I want to do is upset the people I’m working with and the project falls apart because. I’ve shared their data with someone I didn’t have permission and then they tell other people and nobody wants to work with me. So there’s good reasons for having that risk aversion, especially where I sit.

And I’m sure Matt has some good examples from his area

Matt Cowgill: too. Yeah, I mean, at Seek, obviously we have a range of really sensitive data about, you know, we know which companies are hiring, what roles they’re hiring for, what salaries they’re proposing to pay people, how that’s changed over time. We have people’s resumes and personal information, and we treat that with the utmost concern for privacy and security.

Like there have been some well publicized data breaches in Australia in recent years, not just the Medicare hack that Tyler mentioned, but obviously some other private sector organizations have had some high profile breaches. Nobody wants to be the next one of those. And so there is a real. Tight laser focus on avoiding that outcome, which is I think entirely appropriate for government agencies.

I think you don’t want to be at either end of the spectrum, right? You don’t want to be cavalier when it comes to people’s private information and just put everything out there. That would be a terrible outcome. You also don’t want to just go to the other extreme and lock everything down and say, okay, because there are privacy and security risks, we’re going to release nothing like obviously both ends.

Of the spectrum of bad. And what we want is some middle way that kind of resolves the tension or adequately balances these considerations. I think for the most part, I think Australian government agencies are very cognizant of the privacy and security implications. I think probably for my own taste, sometimes things go a bit overboard in terms of the protections, but you understand why they’re there for most, but things like.

You know, say the Hilda data set, for example, which is run by the Melbourne Institute, but funded by the federal government that is reasonably accessible to people, but you have to sign up and be approved as a user and sign some documents, just as Adam said, that describe what you’ll do with the data and set out a range of conditions about what you.

And it can’t do with the data. I think that’s entirely appropriate when you’re dealing with individual data that could identify individuals, but I can think of great cases, like, for example, the Australian Tax Office releases a sample every year of I think it’s 2 percent of Australian income tax filers.

This information is de identified, so there are no names or addresses. Obviously, in this data set, there’s a range of really rich information about individuals income and the taxes that they pay. That data is hugely valuable. When I was at Grattan Institute, we wrote a paper on income tax policy using micro simulation model.

So it’s a model that tries to estimate for particular types of people what tax they pay under different policy arrangements that would not have been possible to do without the data that the ATO releases. And I think that’s a case where in going to that effort, the ATO going to that effort of De identifying the data of removing fields in the data that could make it possible to identify people and in some cases of kind of fuzzing the data.

So kind of adding some random noise to make it more difficult to identify people. They’ve generated a resource that is of huge benefit to researchers and I’d like to commend that as an approach.

Tyler Reysenbach: And I think like the thing to keep in mind with all of this is there’s not just. One type of data, right?

It’s not totally open or it’s totally locked down, right? Like you said there are ways to do it to make it safer for people whether that’s vetting researchers Making the data itself more aggregated or a bit fuzzier Whether that’s making sure that only institutions That are trusted can use it or whether you can only use it within a certain lockdown system as is true for the multi agency data integration project.

All those things are really, really valuable. So I think there are different ways to kind of safeguard data. We’re almost at time. Time flies when you’re having fun data conversations. Before I sign off, do you guys have any parting thoughts or parting things you would like to say?

Adam Sparks: Look, I do want to commend my employer, Deeperd, for the fact that our weather data is actually available through an API.

All you have to do is request a key, and you can get weather data from our weather station network, which pretty well covers the southwest land division and a few places. Out in the rangelands and we’re looking to expand it and I’m thrilled that as a data person that we’re able to support something like that and it makes a lot of people’s jobs much easier to do research to have access to weather data like that and also a shout out to Queensland and for their long paddock silo data set as well, which is Basically, bomb data that has some work done to fill.

Queensland’s state government maintains the silo long paddock data set, which is available through NAPI, and they take the bomb data and do some patching and interpolation, both. Spatially and temporally. So you can get weather station data that has any missing values patched through an API, but you can also get a five kilometer gridded sample data set as well through their API and for my group’s work, having availability to Australian weather data through APIs.

Is absolutely essential. We can’t really do our job without having that. And there’s a lot of other researchers that rely on weather data for their research as well. So these are exceptional tools and I’m happy to see state governments stepping up and stepping in and doing these things.

Matt Cowgill: Yeah, that’s great.

I think there’s clearly a lot of examples of best practice. I guess when it comes to the Australian public sector, there are also places where things fall short for me. I’d like to just come back to this idea of consultation with users. So we’ve talked about all the challenges and pitfalls for public sector organizations when it comes to releasing data.

It kind of makes me sad and frustrated when organizations jump through all the hoops that they have to jump through to release some data publicly. And then release it in a way that is not convenient to their users. It’s not machine readable. It’s not easily used. Like they’re kind of falling at the final hurdle there.

And I think that that type of outcome really could be overcome by just, yeah, talking to the users. Like, almost definitely, if you’re an organization that holds and is releasing data, there is somebody within your organization who is in touch with the various stakeholders who would be using that data or interested in using that data.

And just talk to them, ask them what they need. That would be the number one thing I would request. Yeah,

Tyler Reysenbach: absolutely. I think that’s very true. And I think what I would say to all those people that are trying to make the data more available and struggling is to keep heart, know that there are end users at the end of it that really value your contribution and that you are really making a difference because releasing a data set that leads to better policy outcomes improves the lives of Australians.

And that’s actually really, really important. So, Keep going, even though it’s hard, is my advice. Well, I think that’s all we have time for today. And I want to thank both Matt and Adam for joining me today and taking an hour out of their busy schedules to hopefully convince some people out there that data is important and data availability is important and machine readability is important.

It’s not always the most sexy topic, but it’s really important. And I’d like to also take this chance to spruik our own work. If you want to see how we use public data, you can go onto Grattan’s website at grattan. com. au. If you really like what we do, I also encourage you to donate to Grattan so we can continue making beautiful charts with all this beautifully available data.

But in the meantime, I’ll say and hope you have a good afternoon.

While you’re here…

Grattan Institute is an independent not-for-profit think tank. We don’t take money from political parties or vested interests. Yet we believe in free access to information. All our research is available online, so that more people can benefit from our work.

Which is why we rely on donations from readers like you, so that we can continue our nation-changing research without fear or favour. Your support enables Grattan to improve the lives of all Australians.

Donate now.

Danielle Wood – CEO