The SysBioCube: The Government’s Efforts in Merging Privacy and Technology

Bryan Helwig | 8/22/16

Previously, I wrote about the issue of genetic information privacy that I encountered while directing a Department of Defense research laboratory.[i] I expand on the discussion here to privacy issues associated with military-wide efforts to advance Warfighter safety through implementation of a large integrated research database.

From 2010 to 2012 I served as the Deputy Director of the SysBioCube. The SysBioCube is the data warehousing and storage arm of the Department of Defense’s Systems Biology Enterprise. The goal of the Systems Biology Enterprise is to comprehend the etiology of and find solutions for military-relevant health issues.[ii] The Systems Biology Enterprise invites scientific researchers to submit integrative research data to help identify biomarkers of disease, investigate traumatic brain injury, and better understand physiological responses and detect environmental toxins.

The SysBioCube is the data-warehousing component of the Systems Biology Enterprise and provides an integrated data warehouse and analysis platform facilitating systems biology studies of diseases of military relevance.[iii] On a large scale, the SysBioCube works as follows. First, an investigator conducts a study, analyzes her results and submits the work for publication. Second, after the investigator has exhausted the data for her needs and published all articles of interest, she submits her data to the SysBioCube. Third, the data is accepted to the SysBioCube and made available for members to mine further. This cycle is repeated by hundreds of investigators.

The data submitted to the SysBioCube is large datasets, that—when cumulated—creates a dataset significantly larger than a single investigator or institution could collect. The advantages lie in investigators using powerful computing methods to mine and identify unknown relationships between factors that couldn’t be measured in a single study. For instance, investigator A may upload genetic data while investigator B may upload heart rate, blood pressure and other physiological measures. Because all this data is now in a centralized repository, investigator C can mine both sets of data for relationships that neither study alone could identify. Importantly, a basic principle of statistics is that the larger the sample size the more closely the data represents an entire population. The SysBioCube contains millions of pieces of data, providing a treasure trove of information to scientists working to better understand conditions that plague Warfighters.[iv]

When working with data from human subjects, privacy is always a top concern. I worked with the creation and implantation of the SysBioCube, just prior to when the repository went live. We faced a two-fold challenge. First, how to ensure the data uploaded adequately protected study subject privacy and second how to ensure the database couldn’t easily be hacked. Researchers realize that the days of leaving patient charts on the subway have been supplanted by the loss of a single memory stick or hacking of health databases, with devastating results, providing access to an entire health network of patient data. On a monetary basis, data breaches cost U.S. companies ~$6.53 million per incident (2014 figures) with half of all data breaches due to human or computer error.[v] From an image standpoint, recovering consumer trust after a data breach is an arduous task. But isn’t all patient and research study data de-identified and protected by HIPPA?

The short answer is yes, and the overwhelming majority of the data in the SysBioCube comes from studies where patient name and social security number have been replaced with a study subject identification number. Such blinding techniques work reasonably well when low-level data such as health history; medication profiles, etc. are collected. Further, these methods comply with HIPPA guidelines that require “reasonable and appropriate administrative, technical, and physical safeguards” in the handling of health data.[vi] HIPPA does not generally restrict the use of de-identified health data in research,[vii] but this is minor concern because almost all Institutional Human Use Review Boards restrict how de-identified data can be used.

The bigger concern is the volume of data collected from a single subject. In my work I collected genetic profile data, protein data and demographic data (e.g. exercise, disease, smoking habits, height, weight, sex). I also obtained permission to use blood and DNA samples in future studies. The patient names were always stripped and I rarely interacted with the research subject. Thus, a barrier existed that prevented me from ever being able to re-connect research data with a subject.

However, in 2003 Latanya Sweeny, then at Carnegie Mellon, published algorithms and methods for using data from scientific studies to re-identify research subjects.[viii] Then in 2013 Professor Sweeney, now the director of the Data Privacy Lab at Harvard, re-identified over 40% of the subjects in the Personal Genome Project at Harvard.[ix] Re-identification is not novel; a modified form is used by Homeland Defense to identify suspected terrorists; however, its application to science is often dismissed when in reality the results on privacy are devastating.

So, how do researchers prevent re-identification? At the Department of Defense we used a variety of indirect methods. We limited access to the database and limited the type and format of data that could be uploaded to the database. Professor Sweeney’s work relied heavily on the combination of demographic data such as zip code, date of birth and gender with medical and genomic information including medications and procedures. Often, one or more of these pieces of data was missing from uploads to the SysBioCube. Secondly, database access was tightly controlled and provided only to internal government collaborators and SysBioCube funded labs. Finally, because data had to be collected in the same way and saved as a uniform file format to be used in future analysis, some data, by virtue of format alone, was not accessible to users.

Housing the database on a fully established government owned server at the National Cancer Institute mitigated our second challenge: preventing attacks on the database. An established server was advantageous as it eliminated the need to build a secure data repository from the ground-up. Piggy backing provided substantial advantages in terms of security and database support.

In summary, re-identification is an issue that the science community cannot ignore. The current reliance of data security on HIPPA needs reconsideration especially because large data sets in research are the norm.  HIPPA was never intended to protect large volumes of research data. When developing SysBioCube, no single approach completely eliminated the threat of re-identification. However, a team with experts across technologies, limitations on data entry, tight control of users and housing the repository on an already established, highly secure server provided significantly aided the SysBioCube.

The SysBioCube is up and running, check it out at https://sysbiocube-abcc.ncifcrf.gov for more information.

 

[i] The Thin Red Line of Predictive Genetic Testing in the Military – On the Edges of Science and Law On the Edges of Science and Law, http://blogs.kentlaw.iit.edu/islat/2015/03/16/the-thin-red-line/ (last visited August 19, 2016).

[ii] USAMRMC Strategic Information Paper. http://mrmc.amedd.army.mil/assets/docs/media/USACEHR_StratComm.pdf (last accessed May 10, 2016).

[iii] Id.

[iv] Chowbina et al. SysBioCube: A Data Warehouse and Integrative Data Analysis Platform Facilitating Systems Biology Studies of Disorders of Military Relevance. AMIA Jt Summits Transl Sci Proc. 2013 Mar 18;2013:34-8. eCollection 2013.

[v] Ponemom Institute Research Report. 2015 Cost of Data Breach Study: United States.  May 2015. http://www.jmco.com/media/Ponemon-Data-Beach-2015-Report.pdf (last accessed May 12, 2016).

[vi] HIPPA Security Rule. http://www.hhs.gov/hipaa/for-professionals/security/lawsregulations/ (last accessed May 10, 2016).

[vii] Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. http://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/deidentification/index.html#howlong (last accessed May 1, 2016).

[viii] B. Malin, L. Sweeney, and E. Newton. Trail re-identification: learning who you are from where you have been. LIDAP-WP12. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh, PA: March 2003.

[ix] L. Sweeney, A. Abu, and J. Winn, “Identifying Participants in the Personal Genome Project by Name,” Data Privacy Lab, IQSS, Harvard University. 2013. (Accessible at  http://privacytools.seas.harvard.edu/publications/identifying-participants-personalgenome-project-name).