Next up are several overdue reports on data wrangling of county level election data, that is, working with election officials to get legacy data needed for election results; and then putting the data into practical use. It’s where we write software to chew up whatever data we get, put it in a backend system, re-arrange it, and spit it out all tidy and clean, in a standard election data format. From there, we use the standard-format data to drive our prototype system, VoteStream.
I’ll report on each of 3 and leave it at that, even though since then we’ve forged ahead on pulling in data from other counties as well. This reports from the trenches of VoteStream will be heavy on data-head geekery, so no worries if you want to skip if that’s not your cup of tea. For better or for worse, however, this is the method of brewing up data standards.
I’ll start with Ramsey County, MN, which was our first go-round. The following is not a short or simple list, but here is what we started with:
- Some good advice from: Joe Mansky, head of elections in Ramsey County, Minnesota; and Mark Ritchie, Secretary of State and head of elections for Minnesota.
- A spreadsheet from Joe, listing Ramsey County’s precincts and some of the districts they are in; plus verbal info about other districts that the whole county is in.
- Geo-data from the Minnesota State Legislative GIS office, with a “shapefile” for each precinct.
- More data from the GIS office, from which we learned that they use a different precinct-naming scheme than Ramsey County.
- Some 2012 election result datasets, also from the GIS office.
- Some 2012 election result datasets from the MN SoS web site.
- Some more good advice from Joe Mansky on how to use the election result data.
- The VIP data format for expressing info about precincts and districts, contests and candidates, and an idea for extending that to include vote counts.
- Some good intentions for doing the minimal modifications to the source data, and creating a VIP standard dataset that defines the election (a JEDI in our parlance, see a previous post for explanation).
- Some more intentions and hopes for being able to do minimal modifications to create the election results data.
Along the way, we got plenty of help and encouragement from all the organizations I listed above.
Next, let me explain some problems we found, what we learned, and what we produced.
- The first problem was that the county data and GIS data didn’t match, but we connected the dots, and used the GIS version of precint IDs, which use the national standard, FIPS.
- County data didn’t include statewide districts, but the election results did. So we again fell back on FIPS, and added standards-based district IDs. (We’ll be submitting that scheme to the standards bodies, when we have a chance to catch our breath.)
- Election results depend on an intermediate object called “office” that links a contest (say, for state senate district 4) to a district (say, the 4th state senate district), via an office (say, the state senate seat for district 4), rather than a direct linkage. Sounds unimportant, but …
- The non-local election results used the “office” to identify the contest, and this worked mostly OK. One issue was that the U.S. congress offices were all numbered, but without mentioning MN. This is a problem if multiple states report results for “Representative, 1st Congressional District” because all states have a first congressional district. Again, more hacking the district ID scheme to use FIPS.
- The local election results did not work so well. A literal reading of the data seemed to indicate that each town in Ramsey County in the Nov. 2012 election had a contest for mayor — the same mayor’s office. Ooops! We needed to augment the source data to make plain *which* mayor’s office the contest was for.
- Finally, still not done, we had a handful of similarly ambiguous data for offices other than mayor, that couldn’t be tied to a single town.
One last problem, for the ultra data-heads. Turns out that some precincts are not a single contiguous geographical region, but a combination of 2 that touch only at a point, or (weirder) aren’t directly connected. So our first cut at encoding the geo-data into XML (for inclusion in VIP datasets) wasn’t quite right, and the Google maps view of the data, had holes in it.
So, here is what we learned.
- We had to semi-invent some naming conventions for districts, contests, and candidates, to keep separate everything that was actually separate, and to disambiguate things that sounded the same but were actually different. It’s actually not important if you are only reporting results at the level of one town, but if you want to aggregate across towns, counties, states, etc., then you need more. What we have is sufficient for our needs with VoteStream, but there is real room for more standards like FIPS to make a scheme that works nationwide.
- Using VIP was simple at first, but when we added the GIS data, and used the XML standard for it (KML), there was a lot of fine-tuning to get the datasets to be 100% compliant with the existing standards. We actually spent a surprising amount of time testing the data model extensions and validations. It was worth it, though, because we have a draft standard that works, even with those wacky precincts shaped like east and west Prussia.
- Despite that, we were able to finish the data-wrangling fairly quickly and use a similar approach for other counties — once we figured it all out. We did spend quite a bit of time mashing this around and asking other election officials how *their* jurisdictions worked, before we got it all straight.
Lastly, here is what we produced. We now have a set of data conversion software that we can use to start with the starting data listed above, and produce election definition datasets in a repeatable way, and making the most effective use of existing standards. We also had a less settled method of data conversion for the actual results — e.g., for precinct 123, for contest X, for candidate Y, there were Z votes — similar for all precincts, all contests. That was sufficient for the data available in MN, but not yet sufficient for additional info available in other states but not in MN.
The next steps are: tackle other counties with other source data, and wrangle the data into the same standards-based format for election definitions; extend the data format for more complex results data.
Data wrangling Nov 2012 Ramsey County election was very instructive — and we couldn’t have done it without plenty of help, for which we are very grateful!