The pace of discovery is accelerating across scientific fields. The biggest reason: an abundance of data. With better tools to collect, track and analyze information, researchers are making enormous strides in areas as diverse as climate modeling, disease research, molecular biology, education reform and more.
Modern big data technologies break down traditional data silos and make it possible to collect all manner of information in one place — structured and unstructured data, DNA snippets, images, documents and much more. Pooled together, this data can now be visualized, analyzed and even employed for predictive analysis using powerful new visualization and data science tools like SAS, R and many others.
But as the ocean of new information gets larger every day, and the possibilities for new insights grow, researchers are bumping up against a pressing problem: how do you make all of this data available to all of the people who could take advantage of it, while keeping information properly secured and private?
Conventional de-identification methods create a serious bottleneck. They are often too time-consuming and rigid to work at scale. Or alternatively, in the effort to make data more broadly consumable, they reduce the data’s quality and value.
This is not a small problem. Indeed, it’s becoming a major roadblock to scientific research. But there’s a way to get around it. Using data-layer strategies that borrow proven privacy concepts from the world of networking, organizations can enforce strict data governance at scale, in real time. It’s a concept called Zero Trust Data.
The growing data dilemma
To understand how data privacy requirements can hinder the pace of scientific investigation, consider the following scenario:
A large public health agency is collecting data on a population from thousands of data sources. By pooling all that information together and putting it in the hands of researchers, investigators can uncover major new insights into disease, identify previously unseen trends, and help shape better public policy. With big data technologies and modern data science tools, the pieces are all in place. But to take advantage of them, researchers need quality data. Ideally, that means continually updated information and fast, ongoing access to new data.
But today, agencies mounting such projects in the United States, Canada and elsewhere can take as long as 18 months to release new data to investigators. Why? Because a public health agency — or any organization collecting sensitive data — is legally and ethically obligated to ensure that personally identifiable information (PII) is seen only by authorized parties for legitimate purposes. And the typical mechanisms for enforcing that are less than ideal.
One option is to de-identify information as it’s fed into the system. But in doing that, you lose provenance, so you can’t easily track data back to the source or link information across different data stores. The data set is no longer extensible vertically (adding more up-to-date clinical results for individuals, for example) or horizontally (integrating a new genomics or medication database that links individuals in the pre-existing data set).
The second option is to de-identify information on request for each consumer. But this represents an intensive manual process — a human being going through the data to assess the risk of re-identification, deciding what to make available to each requestor, and then publishing a custom data mart. The organization then has a new data sprawl problem, managing dozens or even hundreds of data marts. The extensibility of the data is still limited. And time to data access is even longer.
Some large public health agencies, for example, have made huge strides in data collection, and can now release updated information to researchers every few months. But what if they could publish weekly — or even make data available continuously on demand? What if their data systems could take on most of the effort to de-identify information and build data sets autonomously, instead of requiring continuous, resource-intensive de-identification projects?
Compromising data privacy is not an option. But Zero Trust Data approaches can mechanize privacy enforcement and radically reduce the time and costs involved.
Zero Trust Data
Zero Trust Data is based on concepts from the world of networking. In a Zero Trust network, no network segment is “trusted.” Every single access request is interrogated for appropriate authorization, every time.
Zero Trust Data extends these principles a layer deeper, into the data itself. Rather than de-identifying information as it’s fed into the system, a Zero Trust Data system catalogs all data and encodes it with a metadata “wrapper” as it’s collected. The metadata describes not only what the data contains, but where it resides within the organization’s privacy and governance framework. So, a given data asset (and even information within a single asset) can be immediately designated as PII or protected health information (PHI) that must be handled according to institutional policy.
The data system then uses an approach based on attribute-based access control (ABAC) to enforce that policy. All data is encrypted by default. Unless the requestor has the proper attributes — which can include role, location, device type, time of day and more — the request yields a null response.
Finally, a Zero Trust Data system provides on-the-fly data processing. Based on institutional privacy and governance policy, and the requesting user’s attributes, the system can deliver exactly the information the requestor is authorized to see, and no more.
A physician participating in a precision medicine program, for example, might be able to view the complete data set for her patient. A researcher with the program might be able to see clinical and genomic data, but no PII. And a bioinformatician with an affiliated public health agency might view aggregated data for a given population, but no individual records. And all of this happens automatically, at scale, on demand.
Solving the data dilemma
By implementing a Zero Trust Data model, organizations gain the freedom to share information among a much wider pool of stakeholders without compromising data privacy and governance. At the same time, they eliminate the delays and overhead associated with manually creating de-identified data marts.
Researchers gain much more flexible and extensible data, and the ability to access it much faster. Organizations greatly reduce the data sprawl (data replication) problem. And when Zero Trust Data is implemented correctly, it delivers these advantages without impacting the performance or latency of the system. All of a sudden, the biggest roadblocks between collecting data and taking advantage of it disappear. And the pace of scientific research in virtually any field can start to move much faster.
Adam Lorant is cofounder and vice president of products and solutions at PHEMI.