Tagged ballots

Accurate Election Results in Michigan and Wisconsin is Not a Partisan Issue

counties

Courtesy, Alex Halderman Medium Article

In the last few days, we’ve been getting several questions that are variations on:

Should there be recounts in Michigan in order to make sure that the election results are accurate?

For the word “accurate” people also use any of:

  • “not hacked”
  • “not subject to voting machine malfunction”
  • “not the result of tampered voting machine”
  • “not poorly operated voting machines” or
  • “not falling apart unreliable voting machines”

The short answer to the question is:

Maybe a recount, but absolutely there should be an audit because audits can do nearly anything a recount can do.

Before explaining that key point, a nod to University of Michigan computer scientists pointing out why we don’t yet have full confidence in the election results in their State’s close presidential election, and possibly other States as well. A good summary is here and and even better explanation is here.

A Basic Democracy Issue, not Partisan

The not-at-all partisan or even political issue is election assurance – giving the public every assurance that the election results are the correct results, despite the fact that bug-prone computers and human error are part of the process. Today, we don’t know what we don’t know, in part because the current voting technology not only fails to meet the three (3) most basic technical security requirements, but really doesn’t support election assurance very well. And we need to solve that! (More on the solution below.)

A recount, however, is a political process and a legal process that’s hard to see as anything other than partisan. A recount can happen when one candidate or party looks for election assurance and does not find it. So it is really up to the legal process to determine whether to do a recount.

While that process plays out let’s focus instead on what’s needed to get the election assurance that we don’t have yet, whether it comes via a recount or from audits — and indeed, what can be done, right now.

Three Basic Steps

Leaving aside a future in which the basic technical security requirements can be met, right now, today, there is a plain pathway to election assurance of the recent election. This path has three basic steps that election officials can take.

  1. Standardized Uniform Election Audit Process
  2. State-Level Review of All Counties’ Audit Records
  3. State Public Release of All Counties Audit Records Once Finalized

The first step is the essential auditing process that should happen in every election in every county. Whether we are talking about the initial count, or a recount, it is essential that humans do the required cross-check of the computers’ work to detect and correct any malfunction, regardless of origin. That cross-check is a ballot-polling audit, where humans manually count a batch of paper ballots that the computers counted, to see if the human results and machine results match. It has to be a truly random sample, and it needs to be statistically significant, but even in the close election, it is far less work than a recount. And it works regardless of how a machine malfunction was caused, whether hacking, manipulation, software bugs, hardware glitches, or anything.

This first step should already have been taken by each county in Michigan, but at this point it is hard to be certain. Though less work than a recount, a routine ballot polling audit is still real work, and made harder by the current voting technology not aiding the process very well. (Did I mention we need to solve that?)

The second step should be a state-level review of all the records of the counties’ audits. The public needs assurance that every county did its audit correctly, and further, documented the process and its findings. If a county can’t produce detailed documentation and findings that pass muster at the State level, then alas the county will need to re-do the audit. The same would apply if the documentation turned up an error in the audit process, or a significant anomaly in a difference between the human count and the machine count.

That second step is not common everywhere, but the third step would be unusual but very beneficial and a model for the future: when a State is satisfied that all counties’ election results have been properly validated by ballot polling audit, the State elections body could publicly release all the records of all the counties’ audit process. Then anyone could independently come to the same conclusion as the State did, but especially election scientists, data scientists, and election tech experts. I know that Michigan has diligent and hardworking State election officials who are capable of doing all this, and indeed do much of it as part of the process toward the State election certification.

This Needs to Be Solved – and We Are

The fundamental objective for any election is public assurance in the result.  And where the election technology is getting in the way of that happening, it needs to be replaced with something better. That’s what we’re working toward at the OSET Institute and through the TrustTheVote Project.

No one wants the next few years to be dogged by uncertainly about whether the right person is in the Oval Office or the Senate. That will be hard for this election because of the failing voting machines that were not designed for high assurance. But America must say never again, so that in two short years and four years from now, we have election infrastructure in place that was designed from ground-up and purpose-built to make it far easier for election officials to deliver election results and election assurance.

There are several matters to address:

  • Meeting the three basic security requirements;
  • Publicly demonstrating the absence of the vulnerabilities in current voting technology;
  • Supporting evidenced-based audits that maximize confidence and minimize election officials’ efforts; and
  • Making it easy to publish detailed data in standard formats, that enable anyone to drill down as far as needed to independently assess whether audits really did the job right.

All that and more!

The good news (in a shameless plug for our digital public works project) is that’s what we’re building in ElectOS. It is the first openly public and freely available set of election technology; an “operating system” of sorts for the next generation of voting systems, in the same way and Android is the basis for much of today’s mobile communication and computing.

— John Sebes

Vote-Flipping in Pennsylvania is Not the Problem, But Recounts?

The reports of “vote flipping” on voting machines in PA are certainly alarming to the voters using the machines, but it’s unfortunate that there are calls to treat it as a law enforcement issue. It’s a known issue with the decade-or-older flakey touch screens, and one that local election officials deal with in most elections. In some cases it may be user error; in others, a result of poor screen calibration. Sometimes the appearances are even more problematic, as with a mis-recorded straight-party vote, which affects every contest on the ballot.

Though voters and poll workers may disagree on what actually happened in these cases, what’s not controversial is the small scale — about 24 out of 24,000 machines statewide; only one voter affected per machine; and in at least some of these cases, the voter admitted that after some work, they got their votes recorded properly.

So concerns about “rigging” of individual machines is misplaced. Even leaving aside the technical fact that these are electro-mechanical issues — not riggable software — it’s a poor choice for rigging to choose a method that’s apparent to the voters, and in such small numbers.

But suppose that the resolution of the PA election depends in-part on refuting claims of rigging? That these machines have real problems. With no paper trail, there is no way to re-check the voters’ choices. A recount is, in one sense, an exercise in re-doing or rerunning the addition of the vote tallies from each machine. But it’s more complicated than that.

In each county with these paperless touch-screen machines, for each machine, the election officials have to maintain records of custody of the machines and their removable data cartridges, with record-keeping procedures sufficient to withstand substantial challenges. It’s not impossible to refute claims of rigging in these circumstances, but it is grindingly detailed work, and with a lot of grist for the mill of legal challenges.

— John Sebes

Old School, New Tech: What’s Really Behind Today’s Elections

Many thanks to coverage by Bloomberg’s Michaela Ross, on election tech and cyber-security.

Given so much at stake for this election with its credibility rocked by claims of rigging, and so much more at stake as we move ahead to replace and improve our election infrastructure, I’m rarely enthused about reading more about how some people think Internet voting is great, and others think it is impossible.  However, Ms. Ross did a great job of following that discussion about how “Old School May Be Better” with supporting remarks from many long time friends and colleagues in election administration and technology worlds.

Where I’d like to respond is to re-frame the “old” part of “old school” and to reject one remark from a source that Ross quoted: They’re pretending what we do today is secure … There’s not a mission critical process in the world that uses 150-year-old technology.” Three main points here:

  1. There is plenty of new technology in the so-called old school;
  2. No credible election expert pretends that our ballots are 100% secure, not even close; and
  3. That’s why we have several new and old protections on the election process, including some of that new technology.

Let me address that next in three parts, mostly about what’s old and what’s new, then circle back to the truth about security, and lastly a comment on iVoting that I’ll most defer to a later re-up on the iVoting scene.

Old and New

Here is what’s old: paper ballots. We use them because we recognize the terrible omission in voting machines from the late 19th century mechanical lever machines (can be hacked with toothpicks, tampered with screwdrivers, and retain no record of any voter’s intent other than numbers on odometer dials) and many of today’s paperless touchscreens: “hack-able” and “tamper-able” even more readily, and likewise with no actual ballot other than bits on a disk. We use paper ballots (or paper-added touchscreens as a stop-gap) because no machine can be trusted to accurately record every voter’s intent. We need paper ballots not just for disputes and recounts, but fundamentally as a way to cross check the work of the machines.

Here’s what’s new: recently defined scientific statistical methods to conduct a routine ballot audit for every election, to cross check the machines’ work, with far less effort and cost than today’s “5% manual count and compare” and variant methods used in some states. It’s never been easier to use machines for rapid counts and quick unofficial results, and then (before final results) to detect and correct instances of machine inaccuracies whether from bugs, tampering, physical failure, or other issues. It’s called Risk Limiting Audit or RLA.

Here’s what new-ish: the new standard approach is for paper ballots to be rapidly machine counted using optical scanners and digital image processing software. There are a lot of old clunky and expensive (to buy, maintain, and store) op-scanners still in use, but this isn’t “150 years old,” any more than our modern ballots are like the old 19th-century party-machine-politics balloting that was rife with fraud that led to the desire for the old lever machines. However, these older machines have low to no support for RLA.

Here’s what’s newer: many people have mobile computers in their pocket that can run optical-capture and digital image processing. It’s no longer a complicated job to make a small, inexpensive device that can read some paper, record what’s on it, and retain records that humans can cross check. There’s no reason why the op-scan method needs to be old and clunky. And with new systems, it is easy to keep the type of records (technically, a “cast vote record” for each ballot) needed for easy support for RLA.

And finally, here’s the really good part: innovation is happening to make the process easier and stronger, both here at the OSET Institute and elsewhere ranging from local to state election officials, Federal organizations like EAC and NIST, universities, and other engines of tech innovation. The future looks more like this:

  • Polling place voting machines called “ballot marking devices” that use a familiar inexpensive tablet to collect a voter’s ballot choices, and print them onto a simple “here’s all and only what you chose” ballot to
    be easily and independently verified by the voter, and cast for optical scanning.
  • Devices and ballots with professionally designed and scientifically tested usability and accessibility for the full range of voters’ needs.
  • Simple inexpensive ballot scanners for these modern ballots.
  • Digital sample ballots using the voter’s choice of computer, tablet, or phone, to enable the voter to take their own time navigating the ballot, and creating a “selections worksheet” that can be scanned into a
    ballot marking device to confirm, correct if needed, and create the ballot cast in a polling place
  • or to be used in a vote-by-mail  process, without the need to wait for an official blank ballot to arrive in the mail.
  • And below that tip of the iceberg for the critical ballot-related operations, there is a range of other innovations to streamline voter registration, voter check-in, absentee ballot processing, voter services
    and apps to navigate the whole process and avoid procedural hurdles or long lines, interactive election results exploration and analytics, and more
  •   and all with the ability for election official to provide open public data on the outcome of the whole election process, and every voter’s success in participation or lack thereof.

That’s a lot of new tech that’s in the pipeline or in use already, but in still in the old school.

Finally, two last points to loop back to Michaela’s article.

Election Protection in the Real World

First, everyone engaged in elections knows that no method of casting and counting ballots is secure.

  • Vote by mail ballots go to election officials by mail passing through many hands, not all of which may seem trustworthy to the voters.
  • Email ballots and other digital ballots go to election officials via the Internet — again via many “virtual hands” that are definitely not trustworthy — and to computers that election officials may not fully control.
  • Polling place ballots in ballot boxes are transported by mere mortals who can make mistakes, encounter mishaps, and as in a very few recent historical cases, may be dishonest insiders.
  • Voting machines are easily tampered with by those with physical access, including temp workers and contractors in warehouses, transportation services, and pre-election preparations.
  • The “central brains” behind the voting machines is often an ordinary antique PC with no real protection in today’s daunting threat environment.
  • The beat goes on with voter records systems, electronic poll books, and more.

That’s why today’s election officials work so hard on the people and processes to contain these risks, and retain control over these vital assets throughout a complex process that — honestly, going forward — could be a lot simpler and easier with innovations designed to reduce the level of effort and complexity of these same type of protections.

The Truth About iVoting Today

Secondly, lastly, and mostly for another time: Internet voting. It’s desirable, it will likely happen someday, and it will require a solid R&D program to invent the tech that can do the job with all the protections — whether against, fraud, coercion, manipulation, and accidental or intention disenfranchisement — the we have today in our state-managed, locally-operated, and (delightfully but often frustratingly) hodge podge process of voting in 9,000+ jurisdictions across the US.  I repeat, all, no compromises; no waving the magic fairy wands of trust-me-it-works-because-it-is-cool or blockchains or so-called “military grade” encryption or whatever the latest cool geek cred item is.

In the meantime short-term, we have to shore up the current creaky systems and process, especially to address the issues of “rigging,” and the crazy amount of work election professionals have to do get the job done and maintain order and trust.

And then we have to replace the current systems in the existing process with innovations that also serve to increase trust and transparency. If we don’t fix the election process that we have now, and soon, we risk the hasty addition of i-voting systems that are just as creaky and flawed, hastily adopted, and poorly understood, the same as the paperless voting machines that adopted more than a decade ago.

We can do better, in the short-term and long, and we will.  A large and growing set of election and tecnology folks, in organizations of many kinds, are dedicated to making these improvements happen, especially as this election cycle has shown us all how vitally important it is.

— John Sebes

Election Standards, Day 2

The second day of the annual meeting of the Voting System Standards Committee was Friday Feb. 6. I’m concluding my reporting on the meeting with a round up of existing and proposed standards activity that we discussed on day 2. Each item below is about an existing working group, a group in formation, or proposed groups.

Election Process Modeling

This working group isn’t making a standard, but rather a guideline: a semi-formal model for the typical or common processes for U.S. election administration and election operations. The intent here is to document, in a structured manner, the various use cases where there needs to be data transfer from one system, component, or process to another. That will make it much easier to identify and prioritize such cases where data interoperability standards may be needed; from there, folks may choose to form a working group to address some of these needs.

This group is well along under the leadership of LA’s Kenneth Bennett, but still it’s a work in progress. I’ll be reporting more as we go along.

Digital Poll Books

This just formed working group, led by Ohio’s John Dziurlaj, will develop a standard data format for digital poll books. The starting point is to define a format that can accommodate data interchange between a voter registration system and digital pollbook — for example, a list of voters, each one with a name, address, voter status (e.g. in person vs. absentee voter). The reverse flow — all that plus a note of whether each voter checked in to vote, when, etc. — is included as well of course, but there are some other subtler issues. For example, it’s not enough to simply provide the data in that reverse flow; you need to also include data that ensures that the check-in records are from a legitimate source, and not modified. Without that, systems would be vulnerable to tampering that causes some ballots to be counted that shouldn’t, and vice versa. Also, not every pollbook does its job based on purely local pollbook records. Some rely on a callback to a central system that co-ordinates information flow among lots of digital pollbooks, and there are several hybrid models as well.

Also, there are privacy issues. In the paper world, every pollbook record was legally a public document, without including what we would now call “personal identifying information” (PII). More recently, with strong voter ID requirements, a voter check-in needs to include a comparison of a presented ID number (such as a driver’s license number) with the ID number that’s part of a voter’s registration record. Today, such ID numbers are often included in e-pollbook data, but that’s not ideal because each e-pollbook becomes a trove of PII at risk. In the upcoming data standards work, we may be able to include some optional privacy guards, like a way to store PII in cryptographically hashed form, to protect privacy but still enable a valid equivalence check — just the same way that stored-password system do.

Voting Methods Models

This newer group, led by Laura Massa-Lochridge, is also creating not a standard but a guideline to be used as a “standard” reference for other work. In this group, the focus is on the various individual approaches to voting on a ballot item and counting the votes. A familiar one is “vote for one” where the candidate with the most votes is the winner. Also familiar and well understood is “vote for N out of M”. Further, each of these has different semantics; for example, some vote-for-one contests have no winner when no candidate reaches a threshold, thus triggering a run-off. Familiar to some, but not so well understood, is “instant run-off”. In fact there different flavors of IRV, and in some cases it is not actually obvious which one is wanted or used in a particular jurisdiction. From there we get into heavy-duty election geek-dom with ranked choice voting and single transferrable vote.

The goal of this working group is to develop a formal mathematical model specifying precisely what’s meant for each variation of each voting method, with consensus from all who choose to participate, with process, oversight, and validation from an official international standards body. The result should be a great reference for elections officials and legislators to refer to, instead of (as is common now) simply referring to voting method by name, or by writing a counting algorithm into law.

Voting Machine Event Logs

This standard is nearly done, thanks to the leadership of NIST’s John Wack. It deserves more than a bit of explanation because it is a great example of both how the standards work, and the value of standard open data.

Every voting system has components for casting or counting ballots, and U.S. requirements for them include the requirement to do some logging of events that would provide researchers with data to analyze in order to assess how well the components operate, how effectively voters are able to use them, or so forth. Every product does some kind of logging, but each one’s log data format is in a different, proprietary format. So, VSSC has a data standard for log data, to enable vendors to provide logs in a common format that enables log analysis tools to combine and collate data from various systems.

So far, not the most thrilling part of standards work, but necessary to ensure that techies can understand what’s going wrong — or right — with real systems in operation. As many are aware, the current crop of voting system products do seem to misbehave during elections, and it’s important for tech assessment to learn whether there really were any faults (as opposed to operator error) and if so what. However, the curious part of this standard is not that provides a standard format for data common to pretty much every system (that’s why we call them common data formats!) like date/time, event code, event description, etc. Rather, the curious part is that it doesn’t try to provide a complete enumeration of all common events. Sure, most systems have an event that means “Voter cast the ballot” or “Completed scanning a ballot” but one vendor may call this “event 37” and another “event 29”.

Why not enumerate these in the standard? Well, for one thing ,it is hard to get a complete list, and as systems add more logging capabilities over time, the list grows. We want to issue the standard now, and don’t want to bake into it an incomplete list. (Once a standard is issued, it is some work to update it, and typically a standards group would prefer to use their efforts to standardize new stuff rather than revise old standards.) So the approach taken is different. It’s typical of many of the standards we’re working on, which is why I want to explain it for this standard. The approach is to have a particular part of the data format that’s expected to be filled by an event identifier that could be one of a canonical list defined elsewhere. It’s like the standard is saying “this ID field is just a string, but systems can choose to fill it with a string that’s from some canonical list that’s beyond the scope of this standard.” Also, the data format allows for a sort of glossary to be part of a dataset, to enable a dataset to essentially say “you’re going to see a bunch of event 37’s and in my lingo that means voter cast a ballot.”

The intent of course is that systems that conform to the standard will also choose to use this canonical list, which can grow over time, without requiring modifications to the standard. That’s nice but it begs the question: who maintains this list, and how does the maintainer allow people to submit additions to it? Good question. No answer yet, but that’s not a barrier to using the standard, and the early adopters will in essence start the list and figure out who is going manage it.

Event Logs for Voter Records

This is a topic for a to-be-formed working group, focused on issues very similar to those of the Event Log group described above, but for events of a voter records systems. The type of events we’re talking about here are things like: voter registration request rejected (including why); voter’s address change accepted; voter’s absentee ballot accepted for counting; voter’s provisional ballot rejected (and why); voter checked in to vote in person; and so on. The format will likely be pretty similar to the other event log format, and much of the discussion will be similar to above groups: whether there is a complete enumeration of actions or objects; whether to rely on external canonical lists; how to not expose PII, but allow a record can uniquely identify the voter in question (so that we can recognize when multiple events were about the same voter).

What types of interoperability would this support? Automated reporting, and data mining in general — again, larger issue — but one example is that is would support automated reporting that compares military voters to other voters in terms of voting outcomes: numbers and percentages of voters who voted absentee vs. in person, absentee voters who ballots where counted vs. rejected and if so why …

This type of reporting is already required of localities and states to the Federal government, and it is currently very burdensome for many election officials to create. As a result, one of the enthusiastic supporters of this nascent effort is a recently appointed EAC Commissioner who until recently was a state election who was official grumpy over the burden of this type reporting, but is now on the Federal commission requiring the reporting. So you can see that although not the most thrilling human endeavor, standards work can have its elements of irony. 🙂

Cast Vote Records 

Another topic for a to-be-formed working group, this is about how to extend the existing .2 standard (election result reporting) to describe just the votes recorded from a single ballot, together with other data (for example an image of a paper ballot from which the votes were recorded) that would be needed to support ballot audits. The whole larger issue of ballot audits is … larger; but you can read more about it in past posts here (search for “audit”) or elsewhere by a web search on “risk limiting audits elections.”

Ballot Specifications

Another topic for a to-be-formed working group, this is about how to extend the existing .2 standard’s description of ballot items (contests, candidates, referenda, questions, etc.) that is currently limited to what’s needed for results reporting.  The extensions could be limited to extensions needed to display online sample ballots, but could extend much further. Some of us have a particular interest in supporting “interactive sample ballots” which is, again, a larger issue, but more on that as the work unfolds.

Common Identifiers and OCD

Lastly, we also discussed more of the common identifier issue that I reported on earlier in day one. It turns out that this is another instance, thought slightly more complicated, of the issue facing a number of standards that I described above: semantic interoperability. In the .2 standard, we don’t want to bake into it an incomplete list of every possible district, office, precinct, etc. — even though we need common identifiers for these if two datasets can be interpreted as referred to the same things.

So, again, we have the issue of a separate canonical list. However, in this case, the space is huge, and the names (unlike event types) wouldn’t be self-identifying; and the things named could have multiple valid names. So there will no doubt be large directories of information about these political units, using common naming schemes. But to avoid these becoming a large muddle, we do have a smaller problem of smaller canonical lists, for example, a list of the names of all the types of district used in each state. With that, we could use existing naming schemes in a canonical way.

The most promising (by consensus of those working on standards anyway) naming scheme is that of the Open Civic Data project, including IDs of exacting this sort. The scope for OCD-IDs is broad: defining a handle for pretty much any government entity in any country, so that various organizations that have data on those entities can publish that data using a common identifier, enabling others to aggregate the data about those entities. It’s much broader than U.S. electoral districts. However, it’s already in use, including U.S. electoral districts. However, as I described above, the fly in the ointment is that plethora of types of electoral district; for a common unique name, you need to include the type of district, for example, the fire control district in CA’s San Mateo County that’s known as Fire District #3.

OK, so what, who, how will this registry, or directory, or curated list — whatever you might call it — get created and managed? Still a good question, but at least we have some clarity on what needs to be done, and maybe a bit of the how, as well. Stay tuned.

If we had this missing link (a canonical scheme for names of U.S. electoral districts) then we could use OCD-IDs (or extensions of FIPS geo codes for that matter) as an optional but commonly used and standards-based approach for constructing unique identifiers for electoral districts. Organizations that choose to use the naming scheme could issue VSSC.2 datasets that could be aggregated with others that also use the scheme. And then, people could have a much easier time aggregating those election result datasets to get large scale election results. At the risk of fore-shadowing, that’s actually a big deal to data-heads, public interest groups, and news organizations alike, as eloquently explained by a speaker at the next annual conference, which was this week in DC.

Coming Soon

At that conference — NIST/EAC Future of Voting Systems Symposium II — will be the topic of my next few reports!

— EJS

 

The “VoteStream Files” A Summary

The TrustTheVote Project Core Team has been hard at work on the Alpha version of VoteStream, our election results reporting technology. They recently wrapped up a prototype phase funded by the Knight Foundation, and then forged ahead a bit, to incorporate data from additional counties, provided by by participating state or local election officials after the official wrap-up.

DisplayAlong the way, there have been a series of postings here that together tell a story about the VoteStream prototype project. They start with a basic description of the project in Towards Standardized Election Results Data Reporting and Election Results Reload: the Time is Right. Then there was a series of posts about the project’s assumptions about data, about software (part one and part two), and about standards and converters (part one and part two).

Of course, the information wouldn’t be complete without a description of the open-source software prototype itself, provided Not Just Election Night: VoteStream.

Actually the project was as much about data, standards, and tools, as software. On the data front, there is a general introduction to a major part of the project’s work in “data wrangling” in VoteStream: Data-Wrangling of Election Results DataAfter that were more posts on data wrangling, quite deep in the data-head shed — but still important, because each one is about the work required to take real election data and real election result data from disparate counties across the country, and fit into a common data format and common online user experience. The deep data-heads can find quite a bit of detail in three postings about data wrangling, in Ramsey County MN, in Travis County TX, and in Los Angeles County CA.

Today, there is a VoteStream project web site with VoteStream itself and the latest set of multi-county election results, but also with some additional explanatory material, including the election results data for each of these counties.  Of course, you can get that from the VoteStream API or data feed, but there may be some interest in the actual source data.  For more on those developments, stay tuned!

Election Results: Data-Wrangling Los Angeles County

LA County CA is the mother of all election complexities, and the data wrangling was intense, even compared to the hardly simple efforts that I reported on previously. There are over 32,000 distinct voting regions, which I think is more than the number of seats, ridings, chairs, and so on, for every federal or state houses of government in all the parliamentary democracies in the EU.

The LA elections team was marvelously helpful, and upfront about the limits of what they can produce with the aging voting system that they are working hard on replacing. This is what we started with.

  • A nicely structured CSV file listing all the districts in LA county: over 20 different types of district, and over 900 individual districts.
  • Some legacy GIS data, part of which defined each precinct in terms of which districts it is in.
  • The existing legacy GIS data converted into XML standard format (KML), again, kindly created byLA CC-RR IT chief, Kenneth Bennett.
  • A flat text file of all the election results for the 2012 election for every precinct in LA County, and various roll-ups.
  • A sort of Rosetta Stone that is just the Presidential election results, but in a well-structured CSV file, also very kindly generated for us by Kenneth.

You’ll notice that not included is a definition of the 2012 election itself – the contests, which district each contest is for, other info on the contest, info on candidates, referenda, and so. So, first problem, we needed to reverse engineer that as best as we could, from the election results. But before we could do that, we had to figure out how to parse the flat text file of results. The “Rosetta Stone” was helpful, but we then realized that we needed information about each precinct that reported results in the flat text file. To get the precinct information, we had to parse the legacy GIS data, and map it to the districts definition.

Second problem was GIS that wasn’t obvious, but fortunately we had excellent help from Elio Salazar, a member of Ken’s team who specializes in the GIS data. He helped us sort out various intricacies and corner cases. One of the hardest turned out to be the ways in which one district (say, a school district) is a real district used for referenda, but is also sub-divided into smaller districts each being for a council seat. Some cities were subdivided this way into council seats, some not; same for water districts and several other kinds of districts.

Then, as soon as we thought we had clear sailing, it turned out that the districts file had a couple minor format errors that we had to fill by hand. Plus there were 4 special case districts that weren’t actually used in the precinct definitions, but were required for the election results. Whew! At that point we though we had a complete election definition including the geo-data of each precinct in KML. But wait! We had over 32,000 precincts defined, but only just shy of 5,000 that reported election results. I won’t go into the details of sub-precincts and precinct consolidation, and how some data was from the 32,000 viewpoint and other data from the 4,993 viewpoint. Or why 4,782 was not our favorite number for several days.

Then the final lap, actually parsing all the 100,000 plus contest results in the flat text file, normalizing and storing all the data, and then emitting it in VIP XML. We thought we had a pretty good specification (only 800 words long) of the structure implicit in the file. We came up with three major special cases, and I don’t know how many little weird cases that turned out not to be relevant to the actual vote counts. I didn’t have the heart to update the specification, but it was pretty complex, and honestly the data is so huge that we could spend many days writing consistency checks of various kinds, and manual review of the input to track down inconsistencies.

In the end, I think we got to a pretty close but probably not perfect rendition of election results. A truly re-usable and reliable data converter would need some follow-on work in close collaboration with several folks in Ken’s team — something that I hope we have the opportunity to do in a later phase of work on VoteStream.

But 100% completeness aside, we still had excellent proof of concept that even this most complex use case did in fact match the standard data model and data format we were using. With some further work using the VIP common data format with other counties, the extended VIP format should be nearly fully baked and ready work with the IEEE standards body on election data.

— EJS

Election Results: Data-Wrangling Travis County

Congratulations if you are reading this post, after having even glanced at the predecessor about Ramsey County data wrangling — one of the longer and geekier posts in recent times at TrustTheVote. There is a similar but shorter story about our work with Travis County Texas. As with Ramsey, we started with a bunch of stuff that Travis Elections folks gave us, but rather than do the chapter and verse, I can summarize a bit.

In fact, I’ll cut to the end, and then go back. We were able to fairly quickly develop data converters from the Travis Nov 2012 data to the same standards-based data format we developed for Ramsey. The exception is the GIS data, which we will circle back to later. This was a really good validation of our data conversion approach. If it extends to other counties as well, we’ll be super pleased.

The full story is that Travis elections folks have been working on election result reporting for some time, as have we at TrustTheVote Project, and we’ve learned a lot from their efforts. Because of those efforts, Travis has worked extensively on how to use the data export capabilities of their voting system product’s election management system. They have enough experience with their Hart Intercivic EMS that they know exactly the right set of export routines to use to dump exactly the right set of files. We then developed data converters to chew up the files and spit out VIP XML for the election definitions, and also a form of VIP XML for the vote tallies.

The structure of the export data roughly corresponds to the VIP schema; one flat TXT file that presents a list of each of the 7 kinds of basic item (precinct, contest, etc.) that we represent as VIP objects; and 4 files that express relations between types of objects, e.g. precincts and districts, or contests and districts. As with Ramsey, the district definitions were a bit sticky. The Travis folks provided a spreadsheet of districts, that was a sort of extension of the exports file about districts. We had to extend the extensions a bit, for similar reasons outlined in the previous account of Ramsey data-wrangling. The rest of the files were a bit crufty, with nothing to suggest the meaning of the column entries other than the name of the file. But with the raw data and some collegial help from Travis elections folks, it mapped pretty simply to the standard data format.

There was one area though, where we learned a lot more from Travis. In Travis with their Hart system, they are able to separately track vote tallies for each candidate (of course, that’s the minimum) as well as: write-ins, non-votes that result from a ballot with no choice on it (under-votes), and non-votes that result from a ballot with too many choices (over-votes). That really helped extend the data format for election results, beyond what we had from Ramsey. And again, this larger set of results data fit well into our use of the VIP format.

That sort of information helps total up the tallies from each individual precinct, to double check that every ballot was counted. But there is also supplementary data that helps even more, noting whether an under or over was from early voting, absentee voting, in person voting, etc. With further information about rejected ballots (e.g. unsigned provisional ballot affadavits, late absentee ballots), one can account for every ballot cast (whether counted or rejected), every ballot counted, every ballot in every precinct, every vote or non-vote from individual ballots — and so one — to get a complete picture down to the ground in cases where there are razor thin margins in an election.

We’re still digesting all of that, and will likely continue for some time as we continue our election-result work beyond the VoteStream prototype effort. But even at this point, we think that we have the vote-tallies part of the data standard worked out fairly well, with some additional areas for on-going work.

— EJS

Election Results: Data-Wrangling Ramsey County

Next up are several overdue reports on data wrangling of county level election data, that is, working with election officials to get legacy data needed for election results; and then putting the data into practical use. It’s where we write software to chew up whatever data we get, put it in a backend system, re-arrange it, and spit it out all tidy and clean, in a standard election data format. From there, we use the standard-format data to drive our prototype system, VoteStream.

I’ll report on each of 3 and leave it at that, even though since then we’ve forged ahead on pulling in data from other counties as well. This reports from the trenches of VoteStream will be heavy on data-head geekery, so no worries if you want to skip if that’s not your cup of tea. For better or for worse, however, this is the method of brewing up data standards.

I’ll start with Ramsey County, MN, which was our first go-round. The following is not a short or simple list, but here is what we started with:

  • Some good advice from: Joe Mansky, head of elections in Ramsey County, Minnesota; and Mark Ritchie, Secretary of State and head of elections for Minnesota.
  • A spreadsheet from Joe, listing Ramsey County’s precincts and some of the districts they are in; plus verbal info about other districts that the whole county is in.
  • Geo-data from the Minnesota State Legislative GIS office, with a “shapefile” for each precinct.
  • More data from the GIS office, from which we learned that they use a different precinct-naming scheme than Ramsey County.
  • Some 2012 election result datasets, also from the GIS office.
  • Some 2012 election result datasets from the MN SoS web site.
  • Some more good advice from Joe Mansky on how to use the election result data.
  • The VIP data format for expressing info about precincts and districts, contests and candidates, and an idea for extending that to include vote counts.
  • Some good intentions for doing the minimal modifications to the source data, and creating a VIP standard dataset that defines the election (a JEDI in our parlance, see a previous post for explanation).
  • Some more intentions and hopes for being able to do minimal modifications to create the election results data.

Along the way, we got plenty of help and encouragement from all the organizations I listed above.

Next, let me explain some problems we found, what we learned, and what we produced.

  •  The first problem was that the county data and GIS data didn’t match, but we connected the dots, and used the GIS version of precint IDs, which use the national standard, FIPS.
  • County data didn’t include statewide districts, but the election results did. So we again fell back on FIPS, and added standards-based district IDs. (We’ll be submitting that scheme to the standards bodies, when we have a chance to catch our breath.)
  • Election results depend on an intermediate object called “office” that links a contest (say, for state senate district 4) to a district (say, the 4th state senate district), via an office (say, the state senate seat for district 4), rather than a direct linkage. Sounds unimportant, but …
  • The non-local election results used the “office” to identify the contest, and this worked mostly OK. One issue was that the U.S. congress offices were all numbered, but without mentioning MN. This is a problem if multiple states report results for “Representative, 1st Congressional District” because all states have a first congressional district. Again, more hacking the district ID scheme to use FIPS.
  • The local election results did not work so well. A literal reading of the data seemed to indicate that each town in Ramsey County in the Nov. 2012 election had a contest for mayor — the same mayor’s office. Ooops! We needed to augment the source data to make plain *which* mayor’s office the contest was for.
  • Finally, still not done, we had a handful of similarly ambiguous data for offices other than mayor, that couldn’t be tied to a single town.

One last problem, for the ultra data-heads. Turns out that some precincts are not a single contiguous geographical region, but a combination of 2 that touch only at a point, or (weirder) aren’t directly connected. So our first cut at encoding the geo-data into XML (for inclusion in VIP datasets) wasn’t quite right, and the Google maps view of the data, had holes in it.

So, here is what we learned.

  • We had to semi-invent some naming conventions for districts, contests, and candidates, to keep separate  everything that was actually separate, and to disambiguate things that sounded the same but were actually different. It’s actually not important if you are only reporting results at the level of one town, but if you want to aggregate across towns, counties, states, etc., then you need more. What we have is sufficient for our needs with VoteStream, but there is real room for more standards like FIPS to make a scheme that works nationwide.
  • Using VIP was simple at first, but when we added the GIS data, and used the XML standard for it (KML), there was a lot of fine-tuning to get the datasets to be 100% compliant with the existing standards. We actually spent a surprising amount of time testing the data model extensions and validations. It was worth it, though, because we have a draft standard that works, even with those wacky precincts shaped like east and west Prussia.
  • Despite that, we were able to finish the data-wrangling fairly quickly and use a similar approach for other counties — once we figured it all out. We did spend quite a bit of time mashing this around and asking other election officials how *their* jurisdictions worked, before we got it all straight.

Lastly, here is what we produced. We now have a set of data conversion software that we can use to start with the starting data listed above, and produce election definition datasets in a repeatable way, and making the most effective use of existing standards. We also had a less settled method of data conversion for the actual results — e.g., for precinct 123, for contest X, for candidate Y, there were Z votes — similar for all precincts, all contests. That was sufficient for the data available in MN, but not yet sufficient for additional info available in other states but not in MN.

The next steps are: tackle other counties with other source data, and wrangle the data into the same standards-based format for election definitions; extend the data format for more complex results data.

Data wrangling Nov 2012 Ramsey County election was very instructive — and we couldn’t have done it without plenty of help, for which we are very grateful!

— EJS

VoteStream: Data-Wrangling of Election Results Data

If you’ve read some of the ongoing thread about our VoteStream effort, it’s been a lot about data and standards. Today is more of the same, but first with a nod that the software development is going fine, as well. We’ve come up with a preliminary data model, gotten real results data from Ramsey County, Minnesota, and developed most of the key features in the VoteStream prototype, using the TrustTheVote Project’s Election Results Reporting Platform.

I’ll have plenty to say about the data-wrangling as we move through several different counties’ data. But today I want to focus on a key structuring principle that works both for data and for the work that real local election officials (LEOS) do, before an election, during election night, and thereafter.

Put simply, the basic structuring principle is that the election definition comes first, and the election results come later and refer to the election definition. This principle matches the work that LEOs do, using their election management system to define each contest in an upcoming election, define each candidate, and do on. The result of that work is a data set that both serves as an election definition, and also provides the context for the election by defining the jurisdiction in which the election will be held. The jurisdiction is typically a set of electoral districts (e.g. a congressional district, or a city council seat), and a county divided into precincts, each of which votes on a specific set of contests in the election.

Our shorthand term for this dataset is JEDI (jurisdiction election data interchange), which is all the data about an election that an independent system would need to know. Most current voting system products have an Election Management System (EMS) product that can produce a JEDI in a proprietary format, for use in reporting, or ballot counting devices. Several states and localities have already adopted the VIP standard for publishing a similar set of information.

We’ve adopted the VIP format as the standard that that we’ll be using on the TrustTheVote Project. And we’re developing a few modest extensions to it, that are needed to represent a full JEDI that meets the needs of VoteStream, or really any system that consumes and displays election results. All extensions are optional and backwards compatible, and we’ll be submitting them as suggestions, when we think we got a full set. So far, it’s pretty basic: the inclusion of geographic data that describes a precinct’s boundaries; a use of existing meta-data to note whether a district is a federal, state, or local district.

So far, this is working well, and we expect to be able to construct a VIP-standard JEDI for each county in our VoteStream project, based on the extant source data that we have. The next step, which may be a bit more hairy, is a similar standard for election results with the detailed information that we want to present via VoteStream.

— EJS

PS: If you want to look at a small artificial JEDI, it’s right here: Arden County, a fictional county that has just 3 precincts, about a dozen districts, and Nov/2012 election. It’s short enough that you can page through it and get a feel for what kinds of data are required.

 

Election Results Reporting – Assumptions About Standards and Converters (concluded)

Last time, I explained how our VoteStream work depends on the 3rd of 3 assumptions: loosely, that there might be a good way to get election results data (and other related data) out of their current hiding places, and into some useful software, connected by an election data standard that encompasses results data. But what are we actually doing about it?

Answer: we are building prototypes of that connection, and the lynchpin is an election data standard that can express everything about the information that VoteStream needs. We’ve found that the VIP format is an existing, widely adopted standard that provides a good starting point. More details on that later, but for now the key words are “converters” and “connectors”. We’re developing technology that proves the concept that anyone with basic data modeling and software development skills can create a connector, or data converter, that transforms election data (including but most certainly not limited to vote counts) from one of a variety of existing formats, to the format of the election data standard.

And this is the central concept to prove — because as we’ve been saying in various ways for some time, the data exists but is locked up in a variety of legacy and/or proprietary formats. These existing formats differ from one another quite a bit, and contain varying amounts of information beyond basic vote counts. There is good reason to be skeptical, to suppose that is a hard problem to take these different shapes and sizes of square data pegs (and pentagonal, octahedral, and many other shaped pegs!) and put them in a single round hole.

But what we’re learning — and the jury is still out, promising as our experience is so far — that all these existing data sets have basically similar elements, that correspond to a single standard, and that it’s not hard to develop prototype software that uses those correspondence to convert to a single format. We’ll get a better understanding of the tricky bits, as we go along making 3 or 4 prototype converters.

Much of this feasibility rests on a structuring principle that we’ve adopted, which runs parallel to the existing data standard that we’ve adopted. Much more on that principle, the standard, its evolution, and so on … yet to come. As we get more experience with data-wrangling and converter-creation, there will certainly be a lot more to say.

— EJS