Govdata revisited

About four years ago, the „govdata.de“ portal was launched, and relaunched, and relaunched. Back then, there was plenty of criticism and it took the German Government plenty of time to acknowledge it and start working on long term fixes. Valuable time on all sides was wasted in arguing about the acceptance of non-openly licensed content into this portal and about the flaws of an attempt by the German government to conceive its own license rather than using already tested and accepted licenses. While some things, including the license have improved considerably, the current portal is still tainted by non-free content or content that claims to be free but failed to pass any respectable test for openly licensed content.

With a new year, it is definitely time to revisit the portal and have a look at a small selection of its content.

I will pick 10 datasets at random and will note my observations.

My journey begins at the dataset offering a catalogue of datasets provided at govdata.de: https://www.govdata.de/web/guest/suchen/-/details/govdata-metadatenkatalog. The metadata suggest the catalogue was first published last March and has been updated last in August of 2016. No indication is given whether this refers to the description of the dataset or the dataset itself.

As the file contains entries with „metadata_modified“: „2017-01-01T00:57:34.759947“, I assume it is the former. No file history is provided.

Grepping the file gives me 18363 datasets with title. I use random.org to provide me with 10 integers from the range of 1 to 18363 and come up with the following numbers:

15445	12776	13373	18269	11336
9650	5206	17103	13234	3032

These numbers correspond to the following datasets:

The first dataset (3032) is a land development plan from the geo data portal of the German Land Rhineland-Palatinate. The data is marked as CC-nc 3.0 (de), a license unsuitable for open data. The dataset itself is rather poor, it consists of a weblink to a WMS service in which you can view a scanned bitmap black and white development plan in suitable resolution as a layer in a GIS. Downloading the actual file is possible in 16 single TIFF bitmap files. As far as I could tell, the data is not outdated and consistent with the development plan provided by the municipality itself.

The dataset 5206 is also a development plan, this time from the free state of Hamburg. It is licensed under Datenlizenz Deutschland Namensnennung 2.0, a license that very likely fits the requirements for open data.  The development plan was created in 1966, the dataset is basically a PDF containing the scanned Map of the 1966 plan in a mediocre quality.

The file also contains a scan of a 2012 change to the plan and a scanned written explanation, also from 1966.

Dataset 9650 is also from Hamburg, also a developement plan, also a PDF. The significant difference is that it is from a different district and from a different year: 1957 (apparently with some changes in the 60ies and 70ies). There might be some historical value in the requirements to use heating systems with soot emissions at a level that is not inconveniencing the neighborhood and to requirements of the Reichsgaragenordnung (RGaO), the Law on garages in the German Reich. A quick look at Google Maps indicates that the plan still resembles most of the current building layout.

Finally, with 11336 we discover our first dataset that contains actual data in a more narrow sense (and it is also licensed under Datenlizenz Namensnennung 2.0). It is a dataset on the number of fines issued under paragraph 18, section 1 and 2 of the MiArbG, the German Law of 1952 on minimum standards in the workplace in the year 2014. The law was repealed in the same year and replaced by the minimum wage law. In essence, the dataset is one line of content in a CSV file.

If you happen to think that the total number of cases in the year 2014 is 2, you are wrong. If you happen to think that the number of cases with a penalty over 300€ is 4, you are wrong, too. In fact, the value is 0, conveniently called „Keine Eintragungen“ in C12. You might find this self-evident or confusing. The reason for this rather strange layout choice is simple. The Dataset is the spin-off of the publication „GZR-Daten zur Schwarzarbeit„, in which the 30 tables of the publication were distributed into 30 individual datasets. Per year.

The PDF at bundesjustizamt.de looks like this:

While the publication series exists from at least 2005 to 2014, only data from 2013 and 2014 is referenced from govdata.de, split into 123 different datasets, roughly 1% of the entire govdata.de catalogue.

The next three datasets are also containing actual data and for reasons that will become clear immediately, I will discuss them together.

As of today, a little under 10% of the datasets in govdata.de follow the title pattern „Messergebnisse zur Radioaktivität in [Gegenstand] [Datum]“, translating as „Measurements on radioactivity in: [item] [date]“, provided by the institute for hygiene and environment by the state of Hamburg, licensed under Datenlizenz Namensnennung 2.0.

Each dataset is referring to a single sample that was tested against a couple of dozends of radioactive isotopes. The data contains the date of the measurement, the item being tested and the method of testing, the testing unit, the origin of the sample, the isotope tested and the result along with the unit. The Institute provides a German language explanatory page about the scope of the testing but I could not find information whether the 1639 datasets on govdata (and on transparenz.hamburg.de) represent the full data available to the Institute.

15445 is a Datenlizenz Namensnennung 2.0 licensed dataset on the watersupply and sewage of the state of Saxony. Parts of the dataset link to pages resulting in error pages, however the link to the XLSX file work. The copyright notice in the document does not align with the license description in the govdata.de description. The dataset contains information about the usage of water in business in the years 2007, 2010 and 2013, the file appears to be generated on runtime with no apparent versioning (as all the other pages). The data appears to be consistent with the HTML table provided at statistik.sachsen.de, a feature unknown to govdata.de.

17103 is supposed to be a Datenlizenz Namensnennung 2.0 licensed dataset from the statistical offices of the federal Government and the states. However, the link to both the CSV as well as the XLSX resulted in an error message as the resource providing the data was currently unavailable.

18269 is a dataset (under Datenlizenz Zero 2.0) describing the number of vanity license plates issued by the district of Kleve from 2009 to 2015 via the internet on a yearly basis.

Summary:

Govdata.de perfectly demonstrates the effects of aggregation without meaningful supervision or desire to consolidate data. It is quite possible that the high level of fragmentation of data into many datasets is seen as a feature or rather as a welcomed behavior, allowing the portal to report a highly inflated amount of datasets in lieu of high quality data.  If assistance is provided to the government entities releasing the data, it was not noticeable from the very small sample I took. While high quality datasets could theoretically exist next to poorly created ones (such as the GZR), they could still propagate the wrong impression to other institutions that no meaningful minimum requirements for open data exist, leading towards more uncurated and less-than-possible usable datasets at govdata.de. All unallocated resources should be directed to improve the data rather than the portal, as the latter can freely be substituted with aggregators on the same or other levels as well as generic search engines.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.