Tagged transparency

Blockchains for Elections, in Maine: “Don’t Be Hasty”

Many have noted with interest some draft legislation in Maine that mandates the exploration of how to use blockchain technology to further election transparency.  My comment is, to quote one well known sage, “Don’t Be Hasty”. First, though, let me say that I am very much in favor of any state resolving to study the use of innovative tech elections, even one as widely misunderstood as blockchains. This bill is no exception: study is a great idea.

However, there is already elsewhere a considerable amount of haste in the elections world, with many enthusiasts and over a dozen startups thinking that since blockchains have revolutionized anonymous financial transactions — especially via BitCoin — elections can benefit too. But actually not a lot, at least in terms of voting. As one of my colleagues who is an expert on both elections and advanced cryptography says, “Blockchain voting is just a bad idea – even for people who like online voting.” It will take some time and serious R&D to wrestle to the ground whether and how blockchains can be one of (my count) about half a dozen innovative ingredients that might make online voting worth trying.

However, in the meantime, there are plenty of immediate term good uses of blockchain technology for election transparency, including two of my favorites that could be put into place fairly quickly in Maine, if the study finds it worthwhile.

  1. In one case, each transaction is a change to the voter rolls: adding or deleting a voter, or updating a voter’s name or location or eligibility. Publication — with provenance — would provide the transparency needed to find the truth or lack thereof of claims of “voter roll purging” that crop up in every election.
  2. In the other case, each transaction is either that of a voter checking in to vote in person — via a poll book paper or digital — or having their absentee ballot received, counted, or rejected. I hope the transparency value is evident in the public knowing in detail who did and didn’t vote in a given election.

In each case, there is a public interest in knowing the entirety of a set of transactions that have an impact on every election, and in being able to know that claimed log of transaction records is the legitimate log. Without that assurance of “data provenance” there are real risks of disinformation and confusion, to the detriment of confidence in elections, and confusion rather than transparency. Publication of these types transaction data, with the use of blockchains, can provide the provenance that’s needed for both confidence and transparency. Figuring out the details will require study — Don’t Be Hasty — but it would be a big step in election transparency. Go Maine!


Cancellation of Federal Assistance to US Elections — The Good, The Bad, and The Geeky

Recently I wrote about Congress dismantling the only Federal agency that helps states and their local election officials ensure that the elections that they conduct are verifiable, accurate, and secure — and transparently so, to strengthen public trust in election results. Put that way, it may sound like dismantling the U.S. Election Assistance Commission (EAC) is both a bad idea, and also poorly timed after a highly contentious election in which election security, accuracy, and integrity were disparaged or doubted vocally and vigorously.

As I explained previously, there might be a sensible case for shutdown with a hearty “mission accomplished”  — but only with a narrow view of original mission of the EAC. I also explained that since its creation, EAC’s evolving role has come to include duties that are uniquely imperative at this point in U.S. election history. What I want to explain today is that evolved role, and why it is so important now.

Suppose that you are a county election official in the process of buying a new voting system. How do you know that what you’re buying is a legit system that does everything it should do, and reliably? It’s a bit like a county hospital administrator considering adding new medications to their formulary — how do you know that they are safe and effective? In the case of medications, the FDA runs a regulatory testing program and approves medications as safe and effective for particular purposes.

In the case of voting systems, the EAC (with support from NIST) has an analogous role: defining the requirements for voting systems, accrediting test labs, defining requirements for how labs should test products, reviewing test labs’ work, and certifying those products that pass muster. This function is voluntary for states, who can choose whether and how to build their certification program on the basis of federal certification. The process is not exactly voluntary for vendors, but since they understandably want to have products that can work in every state, they build products to meet the requirements and pass Federal certification. The result is that each locality’s election office has a state-managed approved product list that typically includes only products that are Federally certified.

Thus far the story is pretty geeky. Nobody gets passionate about standards, test labs, and the like. It’s clear that the goals are sound and the intentions are good. But does that mean that eliminating the EAC’s role in certification is bad? Not necessarily, because there is a wide range of opinion on EAC’s effectiveness in running certification process. However, recent changes have shown how the stakes are much higher, and the role of requirements, standards, testing, and certification are more important than ever. The details about those changes will be in the next installment, but here is the gist: we are in the middle of a nationwide replacement of aging voting machines and related election tech, and in an escalating threat environment for global adversaries targeting U.S. elections. More of the same-old-same-old isn’t nearly good enough. But how would election officials gain confidence in new election tech that’s not only safe and effective, but robust against whole new categories of threat?


Accurate Election Results in Michigan and Wisconsin is Not a Partisan Issue


Courtesy, Alex Halderman Medium Article

In the last few days, we’ve been getting several questions that are variations on:

Should there be recounts in Michigan in order to make sure that the election results are accurate?

For the word “accurate” people also use any of:

  • “not hacked”
  • “not subject to voting machine malfunction”
  • “not the result of tampered voting machine”
  • “not poorly operated voting machines” or
  • “not falling apart unreliable voting machines”

The short answer to the question is:

Maybe a recount, but absolutely there should be an audit because audits can do nearly anything a recount can do.

Before explaining that key point, a nod to University of Michigan computer scientists pointing out why we don’t yet have full confidence in the election results in their State’s close presidential election, and possibly other States as well. A good summary is here and and even better explanation is here.

A Basic Democracy Issue, not Partisan

The not-at-all partisan or even political issue is election assurance – giving the public every assurance that the election results are the correct results, despite the fact that bug-prone computers and human error are part of the process. Today, we don’t know what we don’t know, in part because the current voting technology not only fails to meet the three (3) most basic technical security requirements, but really doesn’t support election assurance very well. And we need to solve that! (More on the solution below.)

A recount, however, is a political process and a legal process that’s hard to see as anything other than partisan. A recount can happen when one candidate or party looks for election assurance and does not find it. So it is really up to the legal process to determine whether to do a recount.

While that process plays out let’s focus instead on what’s needed to get the election assurance that we don’t have yet, whether it comes via a recount or from audits — and indeed, what can be done, right now.

Three Basic Steps

Leaving aside a future in which the basic technical security requirements can be met, right now, today, there is a plain pathway to election assurance of the recent election. This path has three basic steps that election officials can take.

  1. Standardized Uniform Election Audit Process
  2. State-Level Review of All Counties’ Audit Records
  3. State Public Release of All Counties Audit Records Once Finalized

The first step is the essential auditing process that should happen in every election in every county. Whether we are talking about the initial count, or a recount, it is essential that humans do the required cross-check of the computers’ work to detect and correct any malfunction, regardless of origin. That cross-check is a ballot-polling audit, where humans manually count a batch of paper ballots that the computers counted, to see if the human results and machine results match. It has to be a truly random sample, and it needs to be statistically significant, but even in the close election, it is far less work than a recount. And it works regardless of how a machine malfunction was caused, whether hacking, manipulation, software bugs, hardware glitches, or anything.

This first step should already have been taken by each county in Michigan, but at this point it is hard to be certain. Though less work than a recount, a routine ballot polling audit is still real work, and made harder by the current voting technology not aiding the process very well. (Did I mention we need to solve that?)

The second step should be a state-level review of all the records of the counties’ audits. The public needs assurance that every county did its audit correctly, and further, documented the process and its findings. If a county can’t produce detailed documentation and findings that pass muster at the State level, then alas the county will need to re-do the audit. The same would apply if the documentation turned up an error in the audit process, or a significant anomaly in a difference between the human count and the machine count.

That second step is not common everywhere, but the third step would be unusual but very beneficial and a model for the future: when a State is satisfied that all counties’ election results have been properly validated by ballot polling audit, the State elections body could publicly release all the records of all the counties’ audit process. Then anyone could independently come to the same conclusion as the State did, but especially election scientists, data scientists, and election tech experts. I know that Michigan has diligent and hardworking State election officials who are capable of doing all this, and indeed do much of it as part of the process toward the State election certification.

This Needs to Be Solved – and We Are

The fundamental objective for any election is public assurance in the result.  And where the election technology is getting in the way of that happening, it needs to be replaced with something better. That’s what we’re working toward at the OSET Institute and through the TrustTheVote Project.

No one wants the next few years to be dogged by uncertainly about whether the right person is in the Oval Office or the Senate. That will be hard for this election because of the failing voting machines that were not designed for high assurance. But America must say never again, so that in two short years and four years from now, we have election infrastructure in place that was designed from ground-up and purpose-built to make it far easier for election officials to deliver election results and election assurance.

There are several matters to address:

  • Meeting the three basic security requirements;
  • Publicly demonstrating the absence of the vulnerabilities in current voting technology;
  • Supporting evidenced-based audits that maximize confidence and minimize election officials’ efforts; and
  • Making it easy to publish detailed data in standard formats, that enable anyone to drill down as far as needed to independently assess whether audits really did the job right.

All that and more!

The good news (in a shameless plug for our digital public works project) is that’s what we’re building in ElectOS. It is the first openly public and freely available set of election technology; an “operating system” of sorts for the next generation of voting systems, in the same way and Android is the basis for much of today’s mobile communication and computing.

— John Sebes

NBC News, Voting Machines, and a Grandmother’s PC


I’d like to explain more precisely what I meant by “your grandmother’s PC” in the NBC TV Bay Area’s report on election technology. Several people thought I was referring to voting machines as easily hacked by anyone with physical access, because despite appearances:

Voting machines are like regular old PCs inside, and like any old PC …

  • … it will be happy to run any program you tell it to, where:
  • “You” is anyone that can touch the computer, even briefly, and
  • “Program” is anything at all, including malicious software specially created to compromise the voting machine.

That’s all true, of course, as many of us have seen recently in cute yet fear mongering little videos about how to “hack an election.” However, I was referring to something different and probably more important: a regular old PC running some pretty basic windows-XP application software, that an election official installed on the PC in the ordinary way, and uses in the same way as anything else.

That’s your “grandmother’s PC,” or in my son’s case, something old and clunky that looks a exactly like the PC that his grandfather had a decade plus ago – minus some hardware upgrades and software patches that were great for my father, but for voting systems are illegal.

But why is that PC “super important”? Because the software in question is the brains behind every one of that fleet of voting machines, a one stop shop to hack all the voting machines, or just fiddle vote totals after all those carefully and securely operated voting machines come home from the polling places. It’s an “election management system” (EMS) that election officials use to create the data that tells the voting machines what to do, and to combine the vote tally data into the actual election results.

That’s super important.

Nothing wrong with the EMS software itself, except for the very poor choice of creating it to run on a PC platform that by law is locked in time as it was a decade or so ago, and has no meaningful self-defenses in today threat environment. As I said, it wasn’t a thoughtful choice – nobody said it would be a good idea to run this really important software on something as easily hacked as anyone’s grandparent’s PC. But it was a pragmatic choice at the time, in the rush to the post-hanging-chads Federally funded voting system replacement derby. We are still stuck with the consequences.

It reminds me of that great old radio show, Hitchhiker’s Guide to the Galaxy, where after stealing what seems like the greatest ship in the galaxy, the starship Heart of Gold, our heroes are stuck in space-time with Eddie Your Ship-Board Computer, “ready to get a bundle of kicks from any program you care to run through me.” The problem, of course, is that while designed to do an improbably large number of useful things, it’s not able to do one very important thing: steer the ship after being asked to run a program to learn why tea tastes good.

Election management systems, voting machines, and other parts of a voting system, all have an individual very important job to do, and should not be able to do anything else. It’s not hard to build systems that way, but that’s not what’s available from today’s 3 vendors in the for-profit market for voting systems, and services to operate them to assist elections officials. We can fix that, and we are.

But it’s the election officials, many many of them public servants with a heart of gold, that should really be highlighted. They are making do with what they have, with enormous extra effort to protect these vulnerable systems, and run an election that we all can trust. They deserve better, we all deserve better, election technology that’s built for elections that are Verifiable, Accurate, Secure, and Transparent (VAST as we like to say). The “better” is in the works, here at OSET Institute and elsewhere, but there is one more key point.

Don’t be demoralized by the fear uncertainty and doubt about hacking elections. Vote. These hardworking public servants are running the election for each of us, doing their best with what they have. Make it worth something. Vote, and believe what is true, that you are an essential part of the process that makes our democracy to be truly a democracy.

— John Sebes

Election Standards – What’s New

The annual meeting of U.S. elections standards board is this week. In addition to standards board members, several observers are here, and will be reporting. The next few blogs are solely my views (John Sebes), but I’ll do my best to write what I think is a consensus.

However, today I’ll start with a closely related topic — election data standards — because I think it will be helpful to refresh the readers’ memory about where standards fit in, and how important they are. I’ll do that explaining 4 benefits that are under discussions today.


One type of standards-enabled interoperability is data exchange. One system needs data to do its job, and the source data is produced by another system; but the two systems don’t speak the same language to express the data. In election technology, a common example is election results. Commercial election management system (EMS) products produce election definitions and election results data in their own format, because until recently there wasn’t a standard. Election reporting systems need to consume that data, but it’s hard to do because different counties (and other electoral jurisdictions) use different formats. For example, in California, a complete collection of results from all counties would involve 5 different proprietary or legacy formats, perhaps more in cases where two counties use the same EMS product but very different versions.

Large news organizations, as well as academics and other research organizations including the TrustTheVote Project, can put a lot of effort into “data-wrangling” and come up with something that’s nearly uniform. It’s time consuming and error prone, and needs to be done several times as election results get updated from election night to final results. But more to the point, election officials don’t have a ready, re-usable technical capability to “just get the data out.”

Well, now we have a standard for U.S. election definitions and election results (more on that in  reporting from the annual conference this week). What does that mean? In the medium to long term, the vendors of all the EMS products could support the new standard, and consumers of the data (elections organizations themselves, election reporting products, in-house tools of big news organizations, and of course open source systems like VoteStream) can re-tool to use standards-compliant data. But in the short to medium term, elections organizations, and their existing technology base, need the ability to translate from existing formats to the standard. (A big part of our just-restarted work on VoteStream is to create a translator/aggregator toolset for election officials, but more on that as VoteStream reporting proceeds.)


Interoperability by itself is great in some cases, if the issue is mainly getting two systems to talk to one another. For example, at the level of an individual county, election reporting is mostly a matter of data transfer from the EMS that the county uses, to an election result publishing system. Some counties have created a basic web publishing system that consumes results from their EMS. However, it’s not so easy for any county to re-use such a solution unless they use and EMS that speaks exactly the same lingo.

For another example at the local level, a standards-compliant election definition data set can be bridge between and EMS that defines the information on each ballot, and a separate system that consumes an election definition and offers election officials the ability to design the layout of paper ballots. (In the TrustTheVote Project, we call that our Ballot Design Studio.) The point here is that data standards can enable innovations in election tech, because various different jobs can be delegated to systems that specialize in that job, and these specialized systems can inter-operate with them.


Component interoperability by itself is not so great if you’re trying to aggregate multiple datasets of the same kind, but from different sources. Taking election result reporting as the example again, here is a problem faced by consumers of election results. Part of one county votes in one Federal congressional district, and part of another county votes in the same district. Each county’s EMS assigns some internal identifier to each district, but it’s derived from whatever the county folks use; this is true even if an election result is represented in the new VSSC Standard.  In one county, the district — and by extension the contest for the representative for the district — might be called the 4th Congressional District, while in the other it might be CD-4.  If you’re trying to get results for that one contest, you need to be able tell that those are the same district and the results for the contest need to include numbers from both counties.

Currently, consumers of this data have processes for overcoming these challenges, but that ability is limited to each consumer org, in some cases private to that org. But what election officials need from standards is the ability to automatically aggregate disparate data sets.  Ahh, more standards!

This exact issue is one of the things we’re discussing this morning at the standards meeting: a need for a standard way to name election items that span jurisdictions or even elections in a single jurisdiction.


Combination is closely related to aggregation, except that aggregation is combined data sets of the same kind, while combination occurs when we have multiple data sets, each containing different but complementary information about some of the same things. That was one of the challenges we had in VoteStream Alpha: election results referred to precincts (vote counts per precinct), GIS data also (the geo-codes representing a precinct), and voter-registration statistics as well (number of registered voters per precinct, actually several stats related).  But many precincts had a different name in each data source! That made it challenging, for example, to report election results in the context of how registration and turnout numbers, and using mapping to visualize variations in registration levels and turnout numbers.

We’ll be showing how to automate the response to such challenges, as part of VoteStream Beta, using the data standards, identifiers, and enumerations under discussion right now.


That’s the report from the morning session. More later …

— John Sebes



Money Shot: What Does a $40M Bet on Scytl Mean?

…not much we think.

Yesterday’s news of Microsoft co-founder billionaire Paul Allen’s investing $40M in the Spanish election technology company Scytl is validation that elections remain a backwater of innovation in the digital age.

But it is not validation that there is a viable commercial market for voting systems of the size typically attracting venture capitalists; the market is dysfunctional and small and governments continue to be without budget.

And the challenges of building a user-friendly secure online voting system that simultaneously protects the anonymity of the ballot is an interesting problem that only an investor of the stature of Mr. Allen can tackle.

We think this illuminates a larger question:

To what extent should the core technology of the most vital aspect of our Democracy be proprietary and black box, rather than publicly owned and transparent?

To us, that is a threshold public policy question, commercial investment viability issues notwithstanding.

To be sure, it is encouraging to see Vulcan Capital and a visionary like Paul Allen invest in voting technology. The challenges facing a successful elections ecosystem are complex and evolving and we will need the collective genius of the tech industry’s brightest to deliver fundamental innovation.

We at the TrustTheVote Project believe voting is a vital component of our nation’s democracy infrastructure and that American voters expect and deserve a voting experience that’s verifiable, accurate, secure and transparent.  Will Scytl be the way to do so?

The Main Thing

The one thing that stood out to us in the various articles on the investment were Scytl’s comments and assertions of their security with international patents on cryptographic protocols.  We’ve been around the space of INFOSEC for a long time and know a lot of really smart people in the crypto field.  So, we’re curious to learn more about their IP innovations.  And yet that assertion is actually a red herring to us.

Here’s the main thing: transacting ballots over the public packet switched network is not simply about security.   Its also about privacy; that is, the secrecy of the ballot.  Here is an immutable maxim about the digital world of security and privacy: there is an inverse relationship, which holds that as security is increased, privacy must be decreased, and vice-verse.  Just consider any airport security experience.  If you want maximum security then you must surrender a bunch of privacy.  This is the main challenge of transacting ballots across the Internet, and why that transaction is so very different from banking online or looking at your medical record.

And then there is the entire issue of infrastructure.  We continue to harp on this, and still wait for a good answer.  If by their own admissions, the Department of Defense, Google, Target, and dozens of others have challenges securifying their own data centers, how exactly can we be certain that a vendor on a cloud-based service model or an in-house data center of a county or State has any better chance of doing so? Security is an arms race.  Consider the news today about Heartbleed alone.

Oh, and please for the sake of credibility can the marketing machinery stop using the phrase “military grade security?”  There is no such thing.  And it has nothing to do with an increase in the  128-bit encryption standard RSA keys to say, 512 or 1024 bit.  128-bit keys are fine and there is nothing military to it (other than the Military uses it).  Here is an interesting article from some years ago on the sufficiency of current crypto and the related marketing arms race.  Saying “military grade” is meaningless hype.  Besides, the security issues run far beyond the transit of data between machines.

In short, there is much the public should demand to understand from anyone’s security assertions, international patents notwithstanding.  And that goes for us too.

The Bottom Line

While we laud Mr. Allen’s investment in what surely is an interesting problem, no one should think for a moment that this signals some sort of commercial viability or tremendous growth market opportunity.  Nor should anyone assume that throwing money at a problem will necessarily fix it (or deliver us from the backwaters of Government elections I.T.).  Nor should we assume that this somehow validates Scytl’s “model” for “security.”

Perhaps more importantly, while we need lots of attention, research, development and experimentation, the bottom line to us is whether the outcome should be a commercial proprietary black-box result or an open transparent publicly owned result… where the “result” as used here refers to the core technology of casting and counting ballots, and not the viable and necessary commercial business of delivering, deploying and servicing that technology.

The “VoteStream Files” A Summary

The TrustTheVote Project Core Team has been hard at work on the Alpha version of VoteStream, our election results reporting technology. They recently wrapped up a prototype phase funded by the Knight Foundation, and then forged ahead a bit, to incorporate data from additional counties, provided by by participating state or local election officials after the official wrap-up.

DisplayAlong the way, there have been a series of postings here that together tell a story about the VoteStream prototype project. They start with a basic description of the project in Towards Standardized Election Results Data Reporting and Election Results Reload: the Time is Right. Then there was a series of posts about the project’s assumptions about data, about software (part one and part two), and about standards and converters (part one and part two).

Of course, the information wouldn’t be complete without a description of the open-source software prototype itself, provided Not Just Election Night: VoteStream.

Actually the project was as much about data, standards, and tools, as software. On the data front, there is a general introduction to a major part of the project’s work in “data wrangling” in VoteStream: Data-Wrangling of Election Results DataAfter that were more posts on data wrangling, quite deep in the data-head shed — but still important, because each one is about the work required to take real election data and real election result data from disparate counties across the country, and fit into a common data format and common online user experience. The deep data-heads can find quite a bit of detail in three postings about data wrangling, in Ramsey County MN, in Travis County TX, and in Los Angeles County CA.

Today, there is a VoteStream project web site with VoteStream itself and the latest set of multi-county election results, but also with some additional explanatory material, including the election results data for each of these counties.  Of course, you can get that from the VoteStream API or data feed, but there may be some interest in the actual source data.  For more on those developments, stay tuned!

Election Results: Data-Wrangling Los Angeles County

LA County CA is the mother of all election complexities, and the data wrangling was intense, even compared to the hardly simple efforts that I reported on previously. There are over 32,000 distinct voting regions, which I think is more than the number of seats, ridings, chairs, and so on, for every federal or state houses of government in all the parliamentary democracies in the EU.

The LA elections team was marvelously helpful, and upfront about the limits of what they can produce with the aging voting system that they are working hard on replacing. This is what we started with.

  • A nicely structured CSV file listing all the districts in LA county: over 20 different types of district, and over 900 individual districts.
  • Some legacy GIS data, part of which defined each precinct in terms of which districts it is in.
  • The existing legacy GIS data converted into XML standard format (KML), again, kindly created byLA CC-RR IT chief, Kenneth Bennett.
  • A flat text file of all the election results for the 2012 election for every precinct in LA County, and various roll-ups.
  • A sort of Rosetta Stone that is just the Presidential election results, but in a well-structured CSV file, also very kindly generated for us by Kenneth.

You’ll notice that not included is a definition of the 2012 election itself – the contests, which district each contest is for, other info on the contest, info on candidates, referenda, and so. So, first problem, we needed to reverse engineer that as best as we could, from the election results. But before we could do that, we had to figure out how to parse the flat text file of results. The “Rosetta Stone” was helpful, but we then realized that we needed information about each precinct that reported results in the flat text file. To get the precinct information, we had to parse the legacy GIS data, and map it to the districts definition.

Second problem was GIS that wasn’t obvious, but fortunately we had excellent help from Elio Salazar, a member of Ken’s team who specializes in the GIS data. He helped us sort out various intricacies and corner cases. One of the hardest turned out to be the ways in which one district (say, a school district) is a real district used for referenda, but is also sub-divided into smaller districts each being for a council seat. Some cities were subdivided this way into council seats, some not; same for water districts and several other kinds of districts.

Then, as soon as we thought we had clear sailing, it turned out that the districts file had a couple minor format errors that we had to fill by hand. Plus there were 4 special case districts that weren’t actually used in the precinct definitions, but were required for the election results. Whew! At that point we though we had a complete election definition including the geo-data of each precinct in KML. But wait! We had over 32,000 precincts defined, but only just shy of 5,000 that reported election results. I won’t go into the details of sub-precincts and precinct consolidation, and how some data was from the 32,000 viewpoint and other data from the 4,993 viewpoint. Or why 4,782 was not our favorite number for several days.

Then the final lap, actually parsing all the 100,000 plus contest results in the flat text file, normalizing and storing all the data, and then emitting it in VIP XML. We thought we had a pretty good specification (only 800 words long) of the structure implicit in the file. We came up with three major special cases, and I don’t know how many little weird cases that turned out not to be relevant to the actual vote counts. I didn’t have the heart to update the specification, but it was pretty complex, and honestly the data is so huge that we could spend many days writing consistency checks of various kinds, and manual review of the input to track down inconsistencies.

In the end, I think we got to a pretty close but probably not perfect rendition of election results. A truly re-usable and reliable data converter would need some follow-on work in close collaboration with several folks in Ken’s team — something that I hope we have the opportunity to do in a later phase of work on VoteStream.

But 100% completeness aside, we still had excellent proof of concept that even this most complex use case did in fact match the standard data model and data format we were using. With some further work using the VIP common data format with other counties, the extended VIP format should be nearly fully baked and ready work with the IEEE standards body on election data.


Election Results: Data-Wrangling Travis County

Congratulations if you are reading this post, after having even glanced at the predecessor about Ramsey County data wrangling — one of the longer and geekier posts in recent times at TrustTheVote. There is a similar but shorter story about our work with Travis County Texas. As with Ramsey, we started with a bunch of stuff that Travis Elections folks gave us, but rather than do the chapter and verse, I can summarize a bit.

In fact, I’ll cut to the end, and then go back. We were able to fairly quickly develop data converters from the Travis Nov 2012 data to the same standards-based data format we developed for Ramsey. The exception is the GIS data, which we will circle back to later. This was a really good validation of our data conversion approach. If it extends to other counties as well, we’ll be super pleased.

The full story is that Travis elections folks have been working on election result reporting for some time, as have we at TrustTheVote Project, and we’ve learned a lot from their efforts. Because of those efforts, Travis has worked extensively on how to use the data export capabilities of their voting system product’s election management system. They have enough experience with their Hart Intercivic EMS that they know exactly the right set of export routines to use to dump exactly the right set of files. We then developed data converters to chew up the files and spit out VIP XML for the election definitions, and also a form of VIP XML for the vote tallies.

The structure of the export data roughly corresponds to the VIP schema; one flat TXT file that presents a list of each of the 7 kinds of basic item (precinct, contest, etc.) that we represent as VIP objects; and 4 files that express relations between types of objects, e.g. precincts and districts, or contests and districts. As with Ramsey, the district definitions were a bit sticky. The Travis folks provided a spreadsheet of districts, that was a sort of extension of the exports file about districts. We had to extend the extensions a bit, for similar reasons outlined in the previous account of Ramsey data-wrangling. The rest of the files were a bit crufty, with nothing to suggest the meaning of the column entries other than the name of the file. But with the raw data and some collegial help from Travis elections folks, it mapped pretty simply to the standard data format.

There was one area though, where we learned a lot more from Travis. In Travis with their Hart system, they are able to separately track vote tallies for each candidate (of course, that’s the minimum) as well as: write-ins, non-votes that result from a ballot with no choice on it (under-votes), and non-votes that result from a ballot with too many choices (over-votes). That really helped extend the data format for election results, beyond what we had from Ramsey. And again, this larger set of results data fit well into our use of the VIP format.

That sort of information helps total up the tallies from each individual precinct, to double check that every ballot was counted. But there is also supplementary data that helps even more, noting whether an under or over was from early voting, absentee voting, in person voting, etc. With further information about rejected ballots (e.g. unsigned provisional ballot affadavits, late absentee ballots), one can account for every ballot cast (whether counted or rejected), every ballot counted, every ballot in every precinct, every vote or non-vote from individual ballots — and so one — to get a complete picture down to the ground in cases where there are razor thin margins in an election.

We’re still digesting all of that, and will likely continue for some time as we continue our election-result work beyond the VoteStream prototype effort. But even at this point, we think that we have the vote-tallies part of the data standard worked out fairly well, with some additional areas for on-going work.


Election Results: Data-Wrangling Ramsey County

Next up are several overdue reports on data wrangling of county level election data, that is, working with election officials to get legacy data needed for election results; and then putting the data into practical use. It’s where we write software to chew up whatever data we get, put it in a backend system, re-arrange it, and spit it out all tidy and clean, in a standard election data format. From there, we use the standard-format data to drive our prototype system, VoteStream.

I’ll report on each of 3 and leave it at that, even though since then we’ve forged ahead on pulling in data from other counties as well. This reports from the trenches of VoteStream will be heavy on data-head geekery, so no worries if you want to skip if that’s not your cup of tea. For better or for worse, however, this is the method of brewing up data standards.

I’ll start with Ramsey County, MN, which was our first go-round. The following is not a short or simple list, but here is what we started with:

  • Some good advice from: Joe Mansky, head of elections in Ramsey County, Minnesota; and Mark Ritchie, Secretary of State and head of elections for Minnesota.
  • A spreadsheet from Joe, listing Ramsey County’s precincts and some of the districts they are in; plus verbal info about other districts that the whole county is in.
  • Geo-data from the Minnesota State Legislative GIS office, with a “shapefile” for each precinct.
  • More data from the GIS office, from which we learned that they use a different precinct-naming scheme than Ramsey County.
  • Some 2012 election result datasets, also from the GIS office.
  • Some 2012 election result datasets from the MN SoS web site.
  • Some more good advice from Joe Mansky on how to use the election result data.
  • The VIP data format for expressing info about precincts and districts, contests and candidates, and an idea for extending that to include vote counts.
  • Some good intentions for doing the minimal modifications to the source data, and creating a VIP standard dataset that defines the election (a JEDI in our parlance, see a previous post for explanation).
  • Some more intentions and hopes for being able to do minimal modifications to create the election results data.

Along the way, we got plenty of help and encouragement from all the organizations I listed above.

Next, let me explain some problems we found, what we learned, and what we produced.

  •  The first problem was that the county data and GIS data didn’t match, but we connected the dots, and used the GIS version of precint IDs, which use the national standard, FIPS.
  • County data didn’t include statewide districts, but the election results did. So we again fell back on FIPS, and added standards-based district IDs. (We’ll be submitting that scheme to the standards bodies, when we have a chance to catch our breath.)
  • Election results depend on an intermediate object called “office” that links a contest (say, for state senate district 4) to a district (say, the 4th state senate district), via an office (say, the state senate seat for district 4), rather than a direct linkage. Sounds unimportant, but …
  • The non-local election results used the “office” to identify the contest, and this worked mostly OK. One issue was that the U.S. congress offices were all numbered, but without mentioning MN. This is a problem if multiple states report results for “Representative, 1st Congressional District” because all states have a first congressional district. Again, more hacking the district ID scheme to use FIPS.
  • The local election results did not work so well. A literal reading of the data seemed to indicate that each town in Ramsey County in the Nov. 2012 election had a contest for mayor — the same mayor’s office. Ooops! We needed to augment the source data to make plain *which* mayor’s office the contest was for.
  • Finally, still not done, we had a handful of similarly ambiguous data for offices other than mayor, that couldn’t be tied to a single town.

One last problem, for the ultra data-heads. Turns out that some precincts are not a single contiguous geographical region, but a combination of 2 that touch only at a point, or (weirder) aren’t directly connected. So our first cut at encoding the geo-data into XML (for inclusion in VIP datasets) wasn’t quite right, and the Google maps view of the data, had holes in it.

So, here is what we learned.

  • We had to semi-invent some naming conventions for districts, contests, and candidates, to keep separate  everything that was actually separate, and to disambiguate things that sounded the same but were actually different. It’s actually not important if you are only reporting results at the level of one town, but if you want to aggregate across towns, counties, states, etc., then you need more. What we have is sufficient for our needs with VoteStream, but there is real room for more standards like FIPS to make a scheme that works nationwide.
  • Using VIP was simple at first, but when we added the GIS data, and used the XML standard for it (KML), there was a lot of fine-tuning to get the datasets to be 100% compliant with the existing standards. We actually spent a surprising amount of time testing the data model extensions and validations. It was worth it, though, because we have a draft standard that works, even with those wacky precincts shaped like east and west Prussia.
  • Despite that, we were able to finish the data-wrangling fairly quickly and use a similar approach for other counties — once we figured it all out. We did spend quite a bit of time mashing this around and asking other election officials how *their* jurisdictions worked, before we got it all straight.

Lastly, here is what we produced. We now have a set of data conversion software that we can use to start with the starting data listed above, and produce election definition datasets in a repeatable way, and making the most effective use of existing standards. We also had a less settled method of data conversion for the actual results — e.g., for precinct 123, for contest X, for candidate Y, there were Z votes — similar for all precincts, all contests. That was sufficient for the data available in MN, but not yet sufficient for additional info available in other states but not in MN.

The next steps are: tackle other counties with other source data, and wrangle the data into the same standards-based format for election definitions; extend the data format for more complex results data.

Data wrangling Nov 2012 Ramsey County election was very instructive — and we couldn’t have done it without plenty of help, for which we are very grateful!