Election Results: Data-Wrangling Los Angeles County
John Sebes
LA County CA is the mother of all election complexities, and the data wrangling was intense, even compared to the hardly simple efforts that I reported on previously. There are over 32,000 distinct voting regions, which I think is more than the number of seats, ridings, chairs, and so on, for every federal or state houses of government in all the parliamentary democracies in the EU.
The LA elections team was marvelously helpful, and upfront about the limits of what they can produce with the aging voting system that they are working hard on replacing. This is what we started with.
- A nicely structured CSV file listing all the districts in LA county: over 20 different types of district, and over 900 individual districts.
- Some legacy GIS data, part of which defined each precinct in terms of which districts it is in.
- The existing legacy GIS data converted into XML standard format (KML), again, kindly created byLA CC-RR IT chief, Kenneth Bennett.
- A flat text file of all the election results for the 2012 election for every precinct in LA County, and various roll-ups.
- A sort of Rosetta Stone that is just the Presidential election results, but in a well-structured CSV file, also very kindly generated for us by Kenneth.
You’ll notice that not included is a definition of the 2012 election itself – the contests, which district each contest is for, other info on the contest, info on candidates, referenda, and so. So, first problem, we needed to reverse engineer that as best as we could, from the election results. But before we could do that, we had to figure out how to parse the flat text file of results. The “Rosetta Stone” was helpful, but we then realized that we needed information about each precinct that reported results in the flat text file. To get the precinct information, we had to parse the legacy GIS data, and map it to the districts definition.
Second problem was GIS that wasn’t obvious, but fortunately we had excellent help from Elio Salazar, a member of Ken’s team who specializes in the GIS data. He helped us sort out various intricacies and corner cases. One of the hardest turned out to be the ways in which one district (say, a school district) is a real district used for referenda, but is also sub-divided into smaller districts each being for a council seat. Some cities were subdivided this way into council seats, some not; same for water districts and several other kinds of districts.
Then, as soon as we thought we had clear sailing, it turned out that the districts file had a couple minor format errors that we had to fill by hand. Plus there were 4 special case districts that weren’t actually used in the precinct definitions, but were required for the election results. Whew! At that point we though we had a complete election definition including the geo-data of each precinct in KML. But wait! We had over 32,000 precincts defined, but only just shy of 5,000 that reported election results. I won’t go into the details of sub-precincts and precinct consolidation, and how some data was from the 32,000 viewpoint and other data from the 4,993 viewpoint. Or why 4,782 was not our favorite number for several days.
Then the final lap, actually parsing all the 100,000 plus contest results in the flat text file, normalizing and storing all the data, and then emitting it in VIP XML. We thought we had a pretty good specification (only 800 words long) of the structure implicit in the file. We came up with three major special cases, and I don’t know how many little weird cases that turned out not to be relevant to the actual vote counts. I didn’t have the heart to update the specification, but it was pretty complex, and honestly the data is so huge that we could spend many days writing consistency checks of various kinds, and manual review of the input to track down inconsistencies.
In the end, I think we got to a pretty close but probably not perfect rendition of election results. A truly re-usable and reliable data converter would need some follow-on work in close collaboration with several folks in Ken’s team — something that I hope we have the opportunity to do in a later phase of work on VoteStream.
But 100% completeness aside, we still had excellent proof of concept that even this most complex use case did in fact match the standard data model and data format we were using. With some further work using the VIP common data format with other counties, the extended VIP format should be nearly fully baked and ready work with the IEEE standards body on election data.
— EJS