By Mike Connell, COO & Chief Digital Transformation Officer at Enthought
Data is at the foundation of digital strategy
Digital transformation is fundamentally changing the way we live, work and relate to each other, both personally and professionally. New entrants are disrupting established industries with novel, digitally enabled operating and business models seemingly overnight, and virtually every company across every industry needs a digital strategy to ensure they remain competitive in the fight for both talent and customers. Leaders with foresight will recognize that, beyond the ability to keep pace, digital strategies and technologies can provide their organization with data that will allow them to leap ahead of the competition. A smaller group of visionary leaders go even further to achieve the holy grail – the ability to not only move ahead through a one-shot digital overhaul of their current business, but to stay ahead in the market over time, through digitally enabled continuous innovation and adaptation of both operations and the overall business model.
The R&D segment of a business is a great place to start because, while data is at the foundation of every digital transformation, scientific data often contains significant untapped value. Streamlining data collection, management and access creates immediate opportunities to generate significant business value, while at the same time laying a foundation for continuous future innovation.
R&D data requires special handling
Data is indeed at the core of every digital strategy. But scientific data is different from other kinds of data, and unlocking its value presents unique challenges that require special expertise, tools and handling. Consider the following examples:
- Because scientific data is often unstructured (images, graphs, videos, distributions, spectra, chemical structures, genetic sequences and so on), a search becomes a more complex task. Standard database tools, like a table lookup from a SQL database, are woefully insufficient. The process often requires a computational step to extract features from the data – such as the width of a line in an image or the location of a peak in a spectrum – that are the actual target of the search. Further complicating the issue, these queries are best handled by the scientists or engineers with domain expertise, not database administrators or data scientists.
- Science data is also often stored in binary files as opposed to text files. With these (often proprietary) binary file formats, it is difficult to extract the data for exploration and analysis, let alone to centralize or combine data sets. In addition, a very different approach is required to efficiently compute with a binary file, related to how the data is stored and how readily accessible it is. It is entirely different from CSV and text-based processing.
- The lack of structured data in R&D often results in significant wasted resources, a dramatic slowdown in key processes and increased potential for operational risk due to key-person dependencies. Without the ability to search accurately and efficiently, scientists are left fumbling with ineffective keyword searches or walking around the lab asking more senior people for suggestions — and if a senior person disappears, their knowledge goes with them and must be rediscovered.
- Another complication with science data is that a data file often contains only part of the important information. There is also metadata around the file that can be important to extracting the real value of the data. For example, what was the temperature in the room when the experiment was performed, and the data output was taken from the instrument? What is the ID of the sample of material on which a series of measurements was taken? Who was the operator who ran the experiment? Without sufficient metadata, the data becomes useless after the primary task is complete. To perform secondary analysis, data mining or even experiment replication, effectively capturing and storing comprehensive metadata becomes key.
- Finally, given the exploratory nature of R&D, there is a need for flexibility, and for the ability to easily accommodate evolving data models that can be used to rapidly test new ideas and hypotheses. Unlike data in marketing or finance, for example, where often the data fields and data sets are structured, externally defined and immutable to the analyst, the scientific data model needs to evolve with the needs of the scientist as they dynamically add additional measurements to the data set and extract new features from their unstructured data to test out an idea.
While the value of R&D data is clear, finding a way to empower scientists to work with it efficiently and effectively can be daunting given the special handling required to extract its value. In fact, 75% of surveyed R&D executives believe advanced analytics techniques should play a pivotal role in their future R&D activities, but only 25% state that their R&D organizations are actually using these analytics today.
Using Dynamic data models to achieve innovation in R&D
The bottom line is that R&D data is special in two ways: in the substantial untapped value it often contains, and in the special handling required to extract that value. Because scientific discovery is an iterative process of exploration and hypothesis testing, it must be done by expert scientists with tools that are both structured (for data analysis) and flexible (for easily and rapidly accommodating new hypotheses and ideas).
As the figure at left shows, commonly used data management tools are insufficient for R&D. Data lakes, for example, are flexible but lack the structure needed to support ready analysis. Data warehouses provide structured data, but lack the flexibility to accommodate rapid, iterative exploration of new ideas and hypotheses by scientists. Excel lacks both structure and what we might call “managed flexibility.” That is, people prize Excel for its extreme flexibility at the level of the individual doing a single task, but it is unmanaged – the flexibility is gained by dispensing with a data model altogether. That kind of unmanaged flexibility comes at a great cost in that it often renders the individual data unusable by anyone else or for any purpose other than the original task. It’s not possible to search data stored in a corpus of ad hoc Excel files generated by many different researchers, for example.
The ideal solution should be able to provide structured data through an API organized around the scientific domain (chemistry or biology, for instance) that is usable directly by scientists, while supporting the flexibility necessary for discovery through Dynamic Data Models. These tools, purpose-built for science, dramatically accelerate today’s scientific workflows, while also laying a foundation for tomorrow’s innovations.
R&D Data in the Hands of Scientists Enables Two Levels of Innovation: Sustaining and Disruptive
To have a real and transformative impact, R&D organizations must rethink their approach to data to optimize its utility for science and its value to the business. General purpose data solutions designed for other sectors or other areas of the business are ill-equipped to confront the challenges of scientific data. Labs need to hide all the complexity of scientific data from their end users (scientists and engineers), thereby freeing them up to focus on the science and innovation with a data approach designed especially for scientific use cases.
The immediate and most obvious benefits of providing analysis-ready R&D data to scientists typically include dramatic improvements in current R&D operations – including greater throughput, better reliability, reduced business risk, cost savings, reduced waste and shortened time to value. More profound, but often less visible benefits of having all the data organized in one place and ready for analysis, include powerful new capabilities such as the ability to optimize design of experiments, instantly find better starting points for a formulation (and therefore better end results, faster), having data to calibrate and validate simulations that can reduce experimental iterations, the opportunity to free expert scientists from routine decision making steps in a manual process such as visual inspection of an image for defects or other features and feeding data to advanced modeling and analytics including AI and Machine Learning. These are examples of sustaining innovations that enhance the value that businesses deliver today.
These same capabilities for delivering sustaining innovations that enhance today’s business also provide the foundation for generating the breakthrough ideas – disruptive innovations – that may become the major drivers of the business tomorrow. When scientists have all their data at their fingertips and the tools to explore ideas easily, rapidly and inexpensively, they can use that capability to do their current work better and faster, but they can just as easily explore next-generation product features or product platforms in materials science or new therapeutic modalities in biotech using these same affordances.