Managing the Flood of Proteomic Data
PRISM performs data management and automated analysis for high-throughput proteomics research
Gordon A. Anderson
Conducting proteomics research at any significant level of throughput using mass spectrometry requires automated information management, as the volume of data is too large and the processing rates required are too rapid to be managed manually. Researchers in the Environmental Molecular Sciences Laboratory at the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL) have developed an information management system that converts data produced by multiple mass spectrometers of differing types into information about the proteins present in biological samples — the Proteomics Research Information System and Management or PRISM. The system tracks the history of biological samples and provides automated analyses of proteomic data. Designed with the flexibility and scalability to readily accommodate future needs and expanded functionality, PRISM has grown and evolved to meet increasingly sophisticated research needs.
Making data useful
Proteomic analysis of biological samples produces large volumes of data, permitting the identification of peptides and proteins as well as measurement of their abundances. This research often requires hundreds or thousands of separate mass spectrometric experiments (MS), performed using a liquid chromatographic (LC) separation step coupled to both MS and tandem MS experiments. Data analysis tools are then used to perform database searches that identify peptides from tandem MS datasets. Other tools interpret and extract detected masses from MS datasets and assign peptide identifications to those masses using additional information from LC elution times.
Such complex multistage analyses require tracking of experimental conditions and sample pedigree. In addition, quality control analyses performed at several stages of the process are necessary to ensure the quality of instrument performance and sample preparation. The PRISM technology performs these data management tasks and automates the data process pipeline.
Funded primarily by the DOE Office of Biological and Environmental Research, PRISM has been in operation for nearly five years, managing information on thousands of proteomics experiments. The system is tightly integrated into the laboratory’s processes and instrumentation and is highly automated to support high-throughput proteomics investigation while minimizing time demands on the scientific and technical staff.
How PRISM works
PRISM is composed of distributed software components that operate cooperatively on commercially available computer systems, communicating by means of standard network connections. It collects data files directly from all 14 PNNL mass spectrometers, manages file storage and tracking, and automates data processing into both intermediate results and final products. The system also collects and maintains information about the biological samples used in research experiments as well as the laboratory protocols and procedures used to prepare them. A Web interface gives users access to the information contained within the system. As shown in Figure 1, PRISM consists of two major subsystems, the Data Management System (DMS) and the Mass Tag System (MTS).
Data Management System
DMS is the core of PRISM operations, coordinating automated analyses, tracking researcher data and providing an interface to that data for the researcher’s use. PRISM arranges information using a simple hierarchy to represent the relationships between data elements. The system uses these hierarchical entities to store metadata about the raw spectra, and analysis results files to maintain contextual information on the origin of the data. Types of metadata can include:
• information regarding the preparation of the biological material, such as growth and stress conditions;
• sample digestion and processing conditions;
• experimental factors, such as chromatographic separation and MS parameters and conditions;
• details regarding data post-processing, such as the peptide identification parameters and results filtering criteria used.
Data file storage is the simplest aspect of the DMS, consisting of a large file space that is accessible to the tracking database and the various “manager” programs. However, the sheer size and volume of data to be tracked has made its management one of the most challenging facets of DMS design. To this end, an active space manager program watches over the disk storage arrays, moving the least-analyzed data to near-online archive storage. If interest in these archived datasets is renewed at a later time, the system automatically recalls them in a delicate balancing act of user demand and available local storage.
Intermediate data processing
One or more analysis software tools process the raw data produced by the mass spectrometers. DMS automates this process with several analysis manager programs, each of which operates an instance of an analysis tool. This process is governed by an analysis job that specifies a dataset entity to be analyzed, the tool to be used and any parameters or settings required by that tool to perform the analysis.
It is relatively easy to scale the processing throughput capability of DMS by adding more auxiliary processors with copies of the analysis managers and their associated analysis tools installed. It is also straightforward to re-task any auxiliary processor to relieve a backlog condition when a particular type of analysis job accumulates.
Mass and Time tag System
Using the intermediate analysis results and tracking information from DMS, the Mass and Tag System (MTS) produces and maintains compilations of Accurate Mass and Time (AMT) tags developed in the course of biological studies. Each compilation is contained in a separate mass tag database in the MTS, the contents of which are determined by selection criteria defined in terms of the tracking information that DMS associates with each analysis job.
Accessing the data
PRISM makes information available to the user via a Web interface that allows browsing of the metadata and links to the locations of the underlying data. This interface provides a common pattern of Web pages for each available DMS tracking entity, displayed as several different types of consolidating report pages. The Web pages also provide hyperlinks that follow the associations among tracking entities, allowing users to quickly move from a given entity to its child or parent.
Similarly, the Web user interface for the MTS provides a way for the researcher to select a mass tag database and view the information that it contains. It is highly hyperlinked to allow drill-down from summary views of the information into more detailed views of selected data items.
Because the tracking information is completely linked between MTS and DMS, it is possible to begin with a protein sequence or AMT tag and to easily drill back to the raw mass spectra that produced it. The researcher may also specify one or more sets of selection criteria based upon DMS tracking information and observe, for example, the effects upon mass tag statistics.
A record of success
The DMS has been in operation since March 2000 with negligible unplanned downtime and no loss or corruption of data under its control. The tracking information makes locating and accessing both files and metadata straightforward and efficient, and users have found the Web interface convenient and comfortable. Many large and small modifications have been made to DMS in the course of a planned evolution in capability and changes in requirements resulting from the growing sophistication of the research supported by the system. The DMS architecture has proven robust in the face of these modifications because of its modularity and the loose coupling between the system components that allows operational changes to affect only a single component.
The MTS has been in operation since March 2002 and also has operated reliably. It has databases of AMT tags and their associated intermediate data for dozens of different organisms. In all, this body of knowledge includes several million peptide identifications from the tandem MS analyses and high mass measurement accuracy Fourier Transform Ion Cyclotron Resonance MS studies. The MTS also has rapid turn-around for the creation of new AMT tag databases. If the intermediate analysis results already exist in DMS, a new AMT tag database can be created quickly.
Recent achievements in proteomics research at PNNL demonstrate the power to the PRISM system. For example, PRISM was and remains essential to a project that revealed the most complete human blood-plasma proteome map to date. Identification of 4,000 distinctive proteins in human blood plasma is a critical step toward cataloging biological markers for early diagnosis of cancer and other diseases.
Gordon Anderson is chief engineer and director of the instrument development laboratory at DOE’s William R. Wiley Environmental Molecular Sciences Laboratory, an international scientific user facility located at PNNL and home to the National Proteomics Research Resource Center. He may be contacted at [email protected].