Source: Techsoup Blog

Techsoup Blog A Journalist's Guide to Finding the Data You Need — Part Two

Editor's Note: This is part two of our two-part series by graduate journalism teacher, Amanda Hickman of Factful. She researches ways to make contemporary state-of-the-art data processing and storage tools more accessible to investigative reporters. In this second part of the series, she completes her comprehensive roundup of data repositories, research guides, and other online tools that are valuable for reporters, investigators, and now librarians. In part one of the series, she told us about her tips and tricks from her Where to Find Data workshops, plus a list of newsroom data warehouses and newsroom collaborations. Here is the rest of the story. This post originally appeared in Source, an OpenNews project designed to amplify the impact of journalism by connecting a network of developers, designers, journalists, and editors to collaborate on open technologies. It was originally written for journalists, but we thought the piece so unique and useful to librarians and library workers that we're reposting it on TechSoup for Libraries. Find the original here. Data Repositories In addition to newsroom data warehouses and newsroom collaborations, there are some far-reaching data warehouses and repositories and tools for publishing data that are pretty remarkable, as well as a few that kind of aren't. This is an A – Z list. With Aleph, OCCRP, the Sarajevo-based Organized Crime and Corruption Reporting Project, is building a unified index of data. They have tackled a few important questions, including managing access to data that they can't advertise beyond a trusted network of reporters. Aleph is tightly focused on public accountability data and includes quite a few sources obtained through leaks. The data is well organized and includes a lot of accountability and anticorruption data that isn't available other places. Aleph is free and open-source software, so hosting your own instance is also an option. Awesome Public Data is a great big list of public datasets on Github, organized into broad topics. Anyone can propose data for addition by submitting a pull request. Awesome Public Data does a good job of continuously checking links and flagging broken links. And they point out canonical sources rather than trying to aggregate and store data. Unfortunately, there's no descriptive information, so users can't skim a list and have a sense of what kind of data is available at a particular source. Registry of Open Data on AWS is a roundup of publicly available data stored on Amazon Web Services, with great usage examples. The AWS Open Data team vets submissions, so the registry includes a range of actively maintained and clearly documented data. The collection is pretty random, however: Amazon Customer Reviews, IRS 990 forms, soil chemistry, and data from Hubble Space Telescope instruments are all there, tagged but not organized in any particular structure. CKAN is free and open-source software for data publishers. They maintain a list of almost 200 known instances, including quite a few national and regional governments. Data Portals bills itself as a comprehensive worldwide index of data portals, which it is not. At a glance, a lot of smaller cities, like Berkeley and Oakland, California, are not listed. Anyone can propose new portals, but the list definitely isn't comprehensive yet. Datasette is free and open-source software for publishing data alongside a clean view of the data. They don't maintain a commons, but if you're looking for a good way to publish data and make it accessible for both skimming and analysis, Datasette might be a good fit. Data.world is a data collaboration platform. They encourage users to add data, which many have done, but they don't enforce any particular policy about preserving provenance, and the site is cluttered with samples and tests. Data.world did identify a handful of sources and mirror them wholesale, for example, Uniform Crime Reports or US EPA, and some newsrooms including the Associated Press and NJ Advance keep their public data collections on data.world. Unfortunately, there's no hierarchy to the site, or structure of any sort. Anyone can add data, so there's definitely some outright spam on the site. It's an interesting place to search for data ideas, and maybe an interesting place to aggregate data you have worked with. But once you find something interesting, you're going to want to head upstream to make sure you've got current, complete records. Enigma Public is a relatively comprehensive collection of public and semi-public structured data. Data they consider semi-public includes information that they obtained via Freedom of Information request. Enigma has improved their provenance metadata significantly in recent years, and the data they provide is well documented but scattershot. Coverage of major U.S. cities is much more complete than international data. Their list of governments includes a handful of countries outside the U.S., but in many cases only one or two datasets are actually available. A search for Oakland 311 turns up no Oakland results but does surface NYC 311 data, last updated eight months ago, as the top result. NYC's actual 311 call data is updated daily, but an Enigma user wouldn't necessarily know that more current data is available. Enigma can be a great resource, but users will want to manually check upstream if they need or want the most current data. Global Open Data Index, compiled by Open Knowledge International (OKFN), aims to provide a comprehensive snapshot of published government data. Their data is tightly organized by nation and topic, so OKFN can show you the state of public access to national legislative or land ownership data around the world, or public data in a handful of key topic areas for any one country. It appears that the index was last updated in 2015, but their sources can help you connect with current data sources. The Global Open Data Index is particularly useful to English-speaking researchers who need to find non-English-language data and may not be able to skim a foreign language government site in search of a specific data source. Google's Dataset Search tool launched in the fall of 2018. Google crawls the web for data sources that include schema.org microdata, and incorporates it into search results. The result is that the data they're searching isn't necessarily vetted, current, or accurate. Dataset Search results include a lot of data attributed to Kaggle (see that entry, below), which is all user submitted and often detached from its original source, making it difficult to find current data upstream. As more data publishers incorporate schema microdata, however, Dataset Search will get more comprehensive. IRE's Database Library includes a few valuable business and transportation datasets that Investigative Reporters and Editors has compiled and cleaned, some dating back decades. Kaggle bills itself as a project-based data science site, but the site includes a commons of user-contributed data — there were 14,000 datasets when I last looked. Kaggle's commons is an eclectic mashup of whatever users have supplied. They encourage users to supply provenance information and human-readable data dictionaries, but they don't support automatic updates, so their data isn't especially useful as source material. Their metadata includes the date data was added to Kaggle but doesn't indicate whether newer data might be available from the source — which it often is. Google recently acquired Kaggle, and (not surprisingly) Kaggle data shows up a lot in Google's Dataset Search tool. Open Policing Project at Stanford has aggregated police stop data from 31 U.S. states and organized the data to facilitate comparisons across states. They're aiming to collect, clean, collate, and release data from all 50 U.S. states and have plans and funding to keep the data up to date. ProPublica publishes and sometimes sells some data. Data they obtained through formal public records requests (i.e., FOIA) is generally available free of charge on request; data they've cleaned or reconciled is available for purchase and licensing. Their collection is scattered and reflects their reporting rather than a concerted effort to create a unified index of data, but they have a lot of very interesting data, and they do a very good job of being explicit about provenance and limitations. Quilt is a Python package and business that facilitates Git-like data packaging that keeps provenance intact and supports tracking of any cleaning or transformation of data. Their commons includes any and all public data that users are storing there, so the quality and usefulness varies widely. Quilt is a super interesting option for reporters and newsrooms that want to publish data or share cleaned data, so if you're looking around for a better-than-GitHub way to publish data you've cleaned or transformed, Quilt is worth checking out. Socrata, like CKAN, builds software that facilitates sharing public data. Socrata doesn't publish a list of instances, but many city, state, national, and regional governments publish public data through a Socrata portal. Swirrl, or PublishMyData, is a U.K.-based linked data project with a lot of overlap with Socrata or CKAN. Swirrl primarily powers public data sites, such as Scottish Government. They include a cart functionality that facilitates cross-comparisons within a given data store. Swirrl doesn't publish a list of instances of their software, but quite a few local and national governments in the U.K. and Europe appear to use their software to publish public data. Vigilant is a business that promises to track and compile public data and make it available to their customers in standardized formats. They don't publish any data publicly. Still Looking? Try These Research Guides Ally Jarmanning, a data reporter at WBUR in Boston, maintains a comprehensive guide to obtaining state court data (Google doc). Charles Ornstein at ProPublica spent ten

Read full article »
Est. Annual Revenue
$100K-5.0M
Est. Employees
25-100
CEO Avatar

CEO

Update CEO

CEO Approval Rating

- -/100