April 6, 2018
Data is a powerful tool to tell stories and communicate arguments, but whether you’re a data consumer or you’re presenting data yourself, it’s important to ensure credibility, prioritize transparency, and avoid major pitfalls. For this month’s data literacy post, we’ll go over a few key areas of good data use, in regards to both consuming and presenting data.
Source and Citations
We’ve talked about this here on the blog before, but making sure that the sources and citations are in order is a key step in evaluating published data. If there are charts, tables, or in-text statistics in a written piece, there should also be some mention of where this data came from. Now, the spirit here can be more important than the letter – footnotes and consistent citation formats are good, and in academic writing and technical reports are necessary, but in less formal contexts, an in-text mention of the data source can be sufficient. But some acknowledgement of the data source needs to be there.
Citing data alone, outside of a paper or an article, can sometimes be tricky; not all sources have a recommended citation format listed, or have all of the information listed that you would ideally like to include in a citation. If there is a recommended citation format, use that. If not, be as thorough as you can with the information you have available. We use the source format below for Census data posted on ccrpc.org:
Source: U.S. Census Bureau; American Community Survey, 2012-2016 American Community Survey 5-Year Estimates, Table DP03; generated by CCRPC staff; using American FactFinder; <http://factfinder2.census.gov>; (8 December 2017).
Start with the agency that produced the data, and include any departments or divisions that were also involved. If the data has an associated publication date, include that after the agency and divisions. Include the actual data program or survey, and detailed information on the dataset such as table numbers and dates, if you have them. Including the link where you found the data, and the date that you retrieved it.
If the data was published in an academic paper, news or magazine article, or as part of a blog or news post on a website, just cite the paper, article, or website appropriately according to the citation style of your choice.
The next several elements are less likely to be found detailed in the text of a published piece, unless it’s an academic or technical publication with a section on data and methodology. Journalistic pieces generally focus on the narrative, and space and wordcount can be limited. So instead of looking for frequency, area, and data quality in the piece, if you want to know the specifics, you may have to look into the data source itself (which is hopefully listed).
How frequently new datasets are published is important to know when looking at any data source, because that will indicate whether the data you’re looking at really reflects current conditions or not. Take the Economic Census, for example. This is a U.S. Census Bureau product that is published every five years, and includes a large amount of detailed data on businesses and industries at various levels of geography. However, the most recently published dataset presents data from 2012.
This isn’t a criticism of the Economic Census – it’s an excellent source for the data it includes, and for some of its data points, it’s the only free and publicly available source. But a lot can change in five years. So if you’re looking at a current piece of writing that builds its arguments on data from several years ago, be aware that some of the conclusions presented may be based on data that no longer holds true.
The area that a dataset pertains to is also important to consider. Some publicly available datasets are only published at the state or county level, and not at the level of cities, towns, or neighborhoods. Data and analysis are most credible when the area being discussed is the same as, or as close as possible to, the area of the dataset being used. This is especially important in large counties, counties with a significant urban/rural divide, and any counties where conditions and circumstances differ considerably from area to area. The same can be true of cities and neighborhoods – data that covers the whole city may miss large disparities between wards, boroughs, or neighborhoods. On the other hand, area and data quality can be a trade-off when estimated data is involved. Smaller areas generally have smaller sample sizes, which lead to larger margins of error. As a reader, all you can do is be aware of both the area and the data type and quality you’re looking at.
Specificity and Comprehensiveness
It’s also a good idea to look for the closest possible fit between the population or item being discussed and the dataset being used.
Think about poverty by age. If you’re interested in poverty among older adults in your area, a published dataset or article that only includes the poverty rate of the entire population isn’t going to suit your needs. Sure, the aggregate poverty rate includes older adults – but it also includes children, college students, and adults in the younger and middle age cohorts. Older adults are in there, but the dataset is not specific enough to reflect the conditions and circumstances of your population of interest. Aggregate or total datasets are good context, of course: it’s not as helpful to know a specific age group’s poverty rate unless you can compare it to the whole population’s and see if it’s higher or lower. But the aggregate statistic can’t be assumed to be the same as the cohort statistic.
The flip side of specificity is comprehensiveness. While you can’t accept an aggregate poverty rate as reflective of an age group’s circumstances, it’s equally inappropriate to apply the poverty rate of a single cohort to the entire population. In either case, when drawing conclusions based on data presented in a study or a piece of writing, it’s important to consider whether the data cited has a good fit with the population the piece is describing.
Finally, a note on data quality. Estimates and margins of error have come up over and over again on this blog, so we won’t reiterate everything here (but we will direct you to our August 2016 and December 2016 posts for more information on margins of error and the American Community Survey, the dataset CCRPC cites most often that includes them). To sum up: margins of error are important, and illustrate how certain or uncertain a statistic is. If what you’re reading includes estimates with associated margins of error but does not include those margins of error, keep in mind that no estimates are perfect, and that both the data you’re looking at and any conclusions based off of it are incomplete.
All of the above applies to being a presenter of data and analysis, as well as a consumer. The difference is that, instead of characteristics to look for in what you’re reading, what we’ve described for sources and citations, frequency, area, specificity and comprehensiveness, and quality are things to strive for in what you’re writing.
Make sure to list your sources and be thorough about citations. Use the most recent data available, and when the most recent data isn’t actually that recent, acknowledge that. Try to get the best possible fit between your study area and population of interest and the data you’re using to describe them. Include margins of error if you’re working with a dataset that has them. If you spot any other data quality issues or things that might cause significant reader or listener confusion, be up-front about them: note the data quality issue and how it impacts what you’re saying, or add a footnote to clarify what you think might be unclear.
Writing with data is often about telling a story or making a point. Working toward the standards we’ve outlined above supports transparency, which is something that everyone should value, not just us here in the public sector. In addition to that, making sure that your data fits your argument and doesn’t have any major gaps, and that you’re clear about your sources and any shortcomings, adds to your credibility, which can make your point more convincing and more likely to be received.
So be thorough, thoughtful, and conscientious as a reader and a writer. It’s good practice, good business, and helps build a healthy, happy data culture.