Vioxx, the drug once prescribed for arthritis, was sold for over five years before its manufacturer, Merck, withdrew it from the market in 2004. Though small-scale studies found a correlation between Vioxx and increased risk of heart attack, the FDA did not have convincing evidence until it analyzed data on 1.4 million HMO members. By the time Vioxx was pulled, it had caused between 88,000 and 139,000 unnecessary heart attacks, and 27,000-55,000 avoidable deaths. When the U.S. mortality rate dropped sharply in 2007, one plausible explanation for the happy news was the disappearance of Vioxx.
The Vioxx debacle is a haunting illustration of the importance of large-scale data research. Dr. Platt, an FDA drug risk researcher, described possible "what if" scenarios in 2007 testimony. If researchers had had access to 7 million patient records, the relationship between Vioxx and heart attack would have been clear in under three years. With access to 100 million records, it would have been discovered in just three months.
These are the consequences of the overcautious privacy rules of HIPAA, the federal health privacy statute. HIPAA allows doctors and insurers to release patient data for research use only if they remove a lot of information that may be critical to research. These privacy rules were designed to avoid the risk that data could be used to identify a patient, but that identification is not as easy as commonly believed.
Latanya Sweeney's re-identification of William Weld, then-Governor of Massachusetts, using a pre-HIPAA database of hospital records is the quintessential re-identification attack. Paul Ohm describes the famous attack nicely:
At the time [the research data was released], William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor's hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor's health records (which included diagnoses and prescriptions) to his office.
However, this description of Sweeney's attack is far too pat. Public records are never complete. For example, a significant portion of the population is not registered to vote. How was Sweeney so sure there was not another man with Weld's birth date and zip code who was not registered to vote?
Daniel Barth-Jones has a fascinating new article that revisits Weld's re-identification. To start with, Sweeney's estimate of Cambridge's population is way off. There were nearly 100,000 people living in Cambridge then. This should have been the first hint that Sweeney's methodology was simplistic. She reported a population of 54,000 because that many Cambridge residents were registered to vote. Sweeney used these records as if they described the entire population.
By comparing Sweeney's count of Cambridge voter registrants with Census records, Barth-Jones confirmed that many Cambridge voting-age adults (35 percent) were not registered to vote. These non-registrants are obviously immune from the record-matching attack that Sweeney performed, but they also provide unwitting protection to people who are registered voters. (Non-voters DO perform a civic function!) In Weld's case, Census data show that approximately 174 men in his zip code were Weld's age. We don't know their precise birth dates, but the chance another man living in Weld's zip code shared his birth date was about 35% percent.
Even if Sweeney confirmed that no other registered voter shared Weld's gender, zip, and birth date, she could not be sure about the roughly 50 Cambridge residents who were Weld's age and were not registered to vote. At best, Weld's chance of having a unique birth date, zip code, and gender combination is 87 percent. This means the chance that Sweeney's matching attack would have been wrong using these three variables alone was 13 percent -- much worse than the 5 percent statistical confidence required for scientific research.
Barth-Jones' study nicely illustrates why a matching attack using voter registration records would not be sufficient to re-identify William Weld. The voter registry also wasn't necessary. Weld collapsed publicly while giving a commencement speech, and local television and newspapers covered the dates and details of his treatment at Deaconess-Waltham Hospital. These news reports made Weld identifiable with certainty without matching anything to voter records. Thus, the attack may say something about the vulnerability of celebrities and of compulsive live-bloggers, but it never demonstrated the proposition for which it has come to be known. Individuals cannot be re-identified "with astonishing ease," as Ohm claims.
A 1997 MIT study showed that, because of the public availability of the Cambridge, Massachusetts voting list, 97 percent of the individuals in Cambridge whose data appeared in a data base which contained only their nine digit ZIP code and birth date could be identified with certainty.
This statement cannot be true. Some were not voting age, and of those that were, one-third were not registered to vote.
Barth-Jones concludes that despite the misinformation, HHS developed a de-identification rule that appropriately strikes the balance between privacy and utility in research databases. But the Vioxx "What-If" study should give us pause. We labor under the inertia of significant status quo bias when we continue to accept existing HIPAA regulations. Re-identification risk is speculative. Attacks do not happen in practice. Meanwhile, the opportunity costs of HIPAA's research regulations include a body count.