Putting Data in Its Place - Research & Development World

If an R&D organization isn’t careful, data can cost money rather than enhance productivity or competitiveness. A new wave of informatics tools can bring it back under control.

click to enlarge

Software from informatics solutions vendor Global IDs suggests possible outliers in enterprise data. Five types of outliers can be identified: data type outlier, frequency outlier, length outlier, pattern outlier, and value outlier. The user can mark these as errors or can consider them as false positives. Such steps help to improve the quality of data used in a final analysis. Image: Global IDs

In recent years, data collection has not been a problem for research laboratories. However, inexpensive computing resources, mobile instrumentation, and automated test procedures have quickly multiplied the rate at which research and development organizations collect data.

The speed of acquisition has not been matched by the speed of product development. In some cases, time-to-market for high-technology products has increased. Many factors can play into the slowdown, including stricter regulations and high research costs. But many studies have pointed to data analytics as the main challenge, slowing innovation and hindering the movement of products to market.

Informatics may hold a solution for companies struggling to find better ways to use the data that they have, whether it’s at the enterprise level, or in the day-to-day research processes in the laboratory.

Figures recently presented at the Gartner Master Data Management Summit held in Los Angeles earlier this year predict that the master data management (MDM) market, a primary segment of informatics, will reach $1.9 billion market value in 2012 and surpass $3.2 billion in 2015.Informatics is already big business, but these numbers indicate something dramatic is afoot with regard to how businesses handle their numbers. According to Chris Molloy, vice president of corporate development at IDBS, Guildford, U.K., there are more than 500 electronic laboratory notebook (ELN) companies right now.

“What we’re seeing in R&D is that most organizations are looking for fewer and more experienced vendors. Right now, there are far too many systems and data silos in play,” says Molloy. “They are trying to be more productive and process efficient with better production systems.”

The key trend, says Molloy, is interoperability. If a vendor can bring a company’s disparate systems together, it delivers success for the customer. Departmental or disciplinary “silos” are making it increasingly difficult to use information throughout an enterprise.

“As it stands now, a huge amount of data and intellectual property is being lost by companies who don’t have a good level of interoperability. That organization will spend 75% of its time collecting data and 25% of its time and sponsors’ money collapsing that data into a single report,” Molloy says.

As a result, the requirements for interoperability are being forced onto businesses. Worse, he says, research-oriented organizations are being forced to watch time devoted to research decrease while time devoted to development increase because of limitations on data management.

Finding a solution to fit innovation
According to Michael Doyle, director of product marketing and principal scientist at Accelrys, San Diego, Calif., the increasing sophistication of R&D has produced significant challenges for information technology (IT) departments. Some organizations are beginning to struggle as high-throughput instrumentation that analyzes chemical structures or electronics is increasingly able to operate in an automated or inline fashion. Unlike the structured data that is commonly processed through product lifecycle management (PLM) and enterprise resource planning (ERP) systems, says Doyle, R&D information is exponentially more varied and complex.

The notion that data management can assist an organization is not new, but until recently companies offering all-encompassing data handling solutions, like SAP, DataFlux, and others, haven’t necessarily offered the kind of solution that would benefit a research-oriented organization as highly as it could benefit, say, a financial firm. The beneficiaries have been in the manufacturing fields. But this is changing.

From Doyle’s point of view, resources like ERP and PLM have come further along than similar tools specifically designed to aid R&D, and should be adapted to the resource environment. The adaptation is not easy, however, because unlike industrial processes, innovation is not a linear process.

“PLM systems have been instrumental in squeezing time, cost, and waste out of manufacturing and supply chain activities. Taking their cue from the ‘assembly line’ school of thought, they are designed to facilitate highly structured, stage-gate processes that move information through the product manufacturing and distribution pipeline as quickly, accurately, and efficiently as possible,” says Doyle.

The innovation process, however, calls for processes that are ad-hoc or out of sequence. Information sequences can often fall back rather than move forward. A PLM solution can get in the way of this process. On the other hand, Doyle also sees the need for automating processes associated with moving R&D data between systems and applications. What’s needed, then, is a system that supports the R&D process, but also doesn’t hinder it.

IDBS’ ActivityBase is one example of a product that can handle data that might otherwise be compromised by one of a number of bottlenecks, such as inflexibility in the company general data handling software, complexity in the data being collected, or the novelty of the data being collected as part of a new technology.

LS9, based in San Francisco, is a developer of biofuels that can be used in existing combustion engines. These fuels differ from those based on natural oils or production of alcohols or gases using microbial activity in that they use engineered bacteria and recombinant DNA technology. The company’s methods draw on polymerase chain reaction (PCR) techniques to generate protein sequences they can use as a basis for new varieties of fuel. An original gene sequence is subjected to PCR to duplicate the genes of interest reaction in conditions that create random errors in the sequence. The result is mutated DNA strands. PCR, however, is error-prone, and part of the difficulty with using the method is that specific information relating to the expressed protein is unknown.

IDBS’ ActivityBase software was brought in to create database records for the products of the error-prone PCR. Records are stored either in the familiar Microsoft Excel format or IDBS’ ActivityBase formats. Visualizations are produced in ActivityBase to give users verification of results, and a statistic engine that produces curve-fitting models gives users a better picture of erroneous data.

click to enlarge

Accelrys’ PipelinePilot tool helps researchers automate analysis of data. The software builds scientific protocols that aggregate and provide immediate access to volumes of disparate research data locked in silos. Image: Accelrys

A continued drive to go “paperless”
“The idea of a paperless environment is not new,” says Kim Shah, director of marketing for Thermo Fisher Scientific‘s (Waltham, Mass.) informatics division. “It’s come back now for a variety of reasons. The big recession of the 2000s caused a lot of companies to refrain from investing in new things. They started asking questions about whether they were reaping the benefits from the technologies they had already bought.”

In addition, a significant amount of workflow migrated offshore. Companies and laboratories were forced to begin looking at how to automate more processes, simply to facilitate the exporting of work.

The third reason for the shift, says Shah, was the arrival of new vertical markets that required both data acquisition and data analysis.

According to Susan Najjer, marketing director for informatics at Thermo Fisher Scientific, inline monitoring in the field has rapidly grown in capability and complexity.

“It’s paradoxical, but more complexity and more mobile technology have made a variety of tasks simpler. We can collect data automatically with the use of mobile devices,” she says, rendering traditional paper or unconnected electronic logging devices obsolete. Even now, makers of tablet computers are struggling with specific designs that fulfill the needs of industrial workers and field researchers.

“A few years back, you’d be hard-pressed to find an instrument with an IP address,” says Shah. The laboratory environment is changing fundamentally, he says, and part of that change has to do with the fact that researchers are seeing new ways of collecting and managing data and thinking about how to apply those methods to processes they used to think were immutable.

Ten years ago, companies and laboratories wanting a centralized repository for their data could opt for a laboratory information management system, or LIMS. A customer in the market for a LIMS had to weigh the features of one solution against another. The other major stumbling block was productivity and efficiency. A LIMS could help keep track of experiment data, but rarely could it be used to design a better experiment or laboratory process. Pen and paper processes persisted.

“Informatics is not just about LIMS anymore,” says Shah. “Today, a company using LIMS needs to comply with a lot of different software solutions from a lot of different companies.”

Thermo Scientific CONNECTS was launched by Shah’s company two years ago, primarily to assist the existing LIMS customers base who wanted a more comprehensive laboratory management solution. Thermo Fisher’s goal was to develop middleware technology that was able to connect with any system, anywhere, at any time. Constant connectivity, says Shah, would let the company’s customers pass data back and forth at will, no matter the source.

In Thermo Fisher Scientific’s case, its corporate acquisitions have allowed it to survey the field to determine the best format. CONNECTS converts all of the data to the common XML format, which can be utilized across many different platforms and systems.

Another model that Thermo Fisher Scientific has adopted is to work with smaller software specialists that have built tools for certain industries. Switzerland-based Vialis is one such partner.

“They have something called multi-moment analysis, where they track, literally down to the minute, what someone is doing. Then they determine what the benefit would be in introducing some form of automation,” says Shah.

Semantic Web and the enterprise
The arrival of Web-based computing resources capable of serving the needs of enterprise has also served to change the data management landscape. At one time, organizations interested in all-encompassing enterprise software wanted a solution that confined their resources in well-guarded barricades, and their intellectual property sealed off from prying eyes.

Now, however, an increasing number of small- to mid-sized businesses are finding that they can save time and resources by adopting cloud-based data solutions. The tools are fast and becoming faster, and, for most of them, the cloud offers adequate data integrity. Outweighing any security concerns is the ability to share and access data anywhere.

Better yet, the cloud has allowed the development of practical uses for Semantic Web theories initiated by the World Wide Web Consortium (W3C). At its most basic, the Semantic Web describes the ability of semantic, or relational, content to be embedded in Web pages, allowing unstructured documentation to be organized as a web of searchable, interconnected data.

click to enlarge

A typical screen in Global IDs’ Data Governance Suite. In this image, semantic domains are classified in an enterprise data landscape. The domains are color coded. Image: Global IDs

In 2006, Tim Berners-Lee, one of the original developers of the World Wide Web, pointed to SPARQL— a flexible and powerful query tool—as one of the key enabling technologies to the Semantic Web. It’s also one of the key foundational technologies for Revelytix, a Cockeysville, Md.-based company that has built its technology on the emerging field of ontology. Originally, ontology was defined as a branch of analytical philosophy that deals with questions about the existence of entities and how they can be organized and subdivided in a hierarchy. It emerged in computer science redefined as a description of the concepts and relationships that can formally exist for an agent or a community of agents.

David Schaengold, director of business solutions at Revelytix, has been using Rules Interchange Format (RIF) to validate distributed enterprise data. RIF, a W3C tool related to SPARQL, is based on the fact that many rules languages exist and it serves to understand what is needed to exchange rules between them. To account for these different sets of rules, RIF has a variety of “dialects” it can use to rationalize rules for successful exchange.

The Revelytix enterprise data integration software treats operational or legacy systems as SPARQL “end points”. While all existing data stores retain their function as the authoritative and persistent source, Revelytix tools can interpret these data sources as a single queryable dataset.

Schaengold has most recently found that RIF is a powerful tool for ensuring the consistency and correctness of data. Whenever multiple data sources are integrated at run time for enterprise applications, a distinct, abstraction-level set of validation rules should be applied. The validations operate beyond any existing data store-level validations already in place. A so-called “forward-chaining” rules engine built with RIF has been a helpful addition for data validation use-cases, validation being especially important to any organization performing QA/QC tasks.

Revelytix isn’t the only company pursuing ontological solutions development. Global IDs Inc., a New York-based company founded by Arka Mukherjee, offers semantic solutions design to not just save costs and time, but also offer better data traceability, which can help customers reduce risk.

The company has four major products, including its Metadata Governance Suite, based on a single code base and metadata repository. Its tools can scan both unstructured and structured data, looking for semantic content. A database is scanned and grouped to business development areas. From these groups a map is built that creates connections to customer “tables”. A semantic object is mapped to all of the different customer tables; conceivably 3,000 semantic domains can coalesce into 30 core objects.

Semantics and ontologies seem esoteric, says Mukherjee, but they spring from the reality that current ways of looking at data can be crippling to a business.

“We know, of course, that relational databases are very inadequate ways of representing data information. Instead, technology has forced us to take these real-world models and force them together. Ontologies offer much better relationships that our software is able to address,” he says.

Driven by a need to be competitive
According to a survey conducted in 2011 by DataFlux Corp.—a data management company based in a Cary, N.C.—few companies have been able to build a “single” picture of their customer base because of the silos preventing a comprehensive analysis and use of the data resources at hand.

Clearly, however, the pressure is on to know the customer. Of the companies surveyed, nearly a third reported they are actively in the process of building a single view of their customers. Nearly 30% said they are also developing a comprehensive data management strategy, and more than 20% reported their company already having such a strategy in place. These numbers jump higher for large companies of more than 10,000 employees.

Not every company is on board with master data management solutions, however. The DataFlux survey showed that while a majority of the companies they surveyed want better data quality and integration (59% and 55%, respectively), they don’t necessarily have all-encompassing MDM plans in place (those that do, 31%).

Why the relative lag in master data projects?

Some organizations, says Paul Planje, director of sales for Vialis, don’t believe they are a good fit for just such strategies. Often these include companies that, despite being in a scientific realm, still don’t operate as a paperless business. Or, they may be small enough, he says, with just a dozen or so employees, that they feel they can afford the time lost to manual recording of information.

Another explanation could be that comparatively few data management issues are addressed by a given company outside the information technology department. This is true even at a laboratory. According to DataFlux, just 17% of data management problems are handled by executives. Most fall to IT or operations.

Related Articles Read More >

Rydberg Technologies demonstrates first long-range atomic RF communication with quantum sensor at US Army event

Nokia unveils plans for new state-of-the-art R&D facility in New Jersey tech hub

New report identifies pathways to strengthen U.S. competitiveness in key technology areas

DoD awards $9M to create research institute at the VCU College of Engineering

Search R&D World