There's a serious problem with the current state of shared data - it is almost completely unusable! Here are some ideas for sharing more effectively.
I often have a question I'd like to answer for which I know data are available. Most recently I wanted to look up the incidence (number of new cases) of various infectious diseases over the last decade. This should be easy - CDC publishes the Morbidity and Mortality Weekly Report of just that. Well, the data are indeed available - put only in PDF. Why even bother with computers? They might as well mail around a printout. If I wanted to actually analyze it, I would first need to enter a decade's worth of data by hand. Ain't nobody got time for that.
I don't mean to pick on CDC. County Health Rankings is an awesome website that aggregates and releases for download public health data from a variety of sources. I'm grateful for that, but the Excel files they release each have multiple sheets, nested headers, merged cells, and extra columns with confidence intervals. It's pretty much impossible to analyze that data in a program other than Excel. To do so, I first have to manually select and reformat the data I want, rename the variables, and then copy/paste it into a new file - which rather defeats the purpose.
There are about eight million other examples that I had to restrain myself from enumerating. The point is that sub-optimal sharing practices make it difficult for researchers (of both the professional and citizen variety) to actually use shared data. The research either a) won't get done because it's too much of a hassle, b) will have errors from manual data entry, c) will take way longer than it should. Possibly all of the above. With that in mind, I came up with some tips to level-up your data sharing.
Learn how to step up your sharing game:
Include as much detail and resolution as possible. County-level data is better than state-level data which is better than national-level data. Bonus points if you break it down further by age, sex, etc. I understand that this can't always be done for privacy reasons, but it is immensely useful when it is possible. Use a flexible file format. My preference is .csv, because it can be read by almost any program. I'll tolerate .xls, but I'm not pleased with .xlsx (not everyone uses Excel!). And please, please, please do not use pdf. If you do use a spreadsheet format, do not use multiple sheets, nested headers, merged cells, strategic cell borders, etc. Make it as plain as possible. Don't worry that you'll end up with too many files if you don't use sheets. Release them in a zipped folder instead. Use short variable names with no whitespace. Underscores are usually a safe bet, so instead of "Number of new tuberculosis cases" use "incident_tb". If you have a corresponding column, e.g. confidence intervals in the screenshot shown above, make the variable name relevant. Use "UCI_incident_tb" instead of relying on the column's proximity to "incident_tb" to indicate a pairing. Include a README that explains the variable names if you're worried they aren't descriptive enough. Actually, include a README no matter what. It can include variable names, units of measurement, notes on data collection/reporting/suppression, or anything else that is relevant. Tell me whom to cite! I'm so pleased to be able to use your data, and I'd love to give you the credit you deserve. Put your citation on your website, in your README, and everywhere else I might look for it so that I can use it appropriately. Or post it to figshare where it will automatically be assigned a doi. It's an easy way to make you data citeable, shareable, and version controlled.
Bookmark these guidelines. Next time you reach for the 'export to PDF' button, or begin to use the change-cell-border feature on Excel, pull this out and remind yourself, 'this is not machine-readable. Nobody will use my data if I release it like this.' Then rejoice that you are awesome for sharing your data, and for doing so in a way that is actually useful. And for that, I thank you.