To support the widespread use of next-generation sequencing (NGS) instruments in clinical research for mutation detection, it is advantageous to simplify the bioinformatics and experimental requirements so that front-line researchers who are not experts in bioinformatics can independently perform experiments. Targeted sequencing—paired with graphical interfaces for robust open-source bioinformatics algorithms and desktop sequencing instruments—presents the opportunity for low-cost, low-infrastructure expansion of NGS to the larger research community,which currently still depends on Sanger sequencing.
In this example, a HaloPlex target enrichment protocol from Agilent Technologies Inc. (Santa Clara, Calif.) is used to capture custom-targeted genomic regions of interest for analysis using an Illumina Inc. (San Diego, Calif.) or Life Technologies Corp. (Carlsbad, Calif.) Ion Torrent sequencer. The wizard-guided, Windows or Mac-based SureCall software is used from aligning raw data to mutation categorization and visualization. Crucially, the
software also produces an audit trail of all steps.
Three custom panels to enrich the coding sequences of genes of interest were designed using the SureDesign application: one for cardiac disease, one for Noonan spectrum disorder and one for collagen tissue disease.
The first step is to enter target genes, select the options for a region of interest (ROI)—for example exons plus flanking bases or UTRs—and specify which genomic databases to use. In this case, the ROI for the designs included all exons of the target genes plus 10bp of flanking intronic sequence. The exact genomic regions can be reviewed in a design summary and viewed in a genome browser before starting probe design.
In the next step, HaloPlex probes are selected by SureDesign to achieve three- to four-fold redundant coverage of the ROI. The probes and a PDF report are generated within a few minutes. The report contains summary information on the design and the coverage for each target region. For all panels, the coverage was over 98%.
HaloPlex and sequencing
Genomic DNA (gDNA) was extracted from eight samples: two cardiac disease, three Noonan spectrum disorder and three collagen tissue disease. The HaloPlex target enrichment system was used to enrich for the genomic ROI. The HaloPlex system uses single-tube target amplification and removes the need for
library preparation to reduce total sample processing time to only 8 hours, without the need for dedicated instrumentation. In this case, the samples were pooled (five samples) and run on an Illumina MiSeq personal sequencer, although use with an Ion Torrent sequencer is also supported.
Data was analyzed using Agilent SureCall software, provided at no cost to HaloPlex users for quality control (QC) metric calculation, assembly, visualization and contextualization of sequencing results. For the cardiac disease panel 98.9% ± 0.4%, for the Noonan spectrum disorder 98.8% ± 0.3%, and for the collagen tissue disease panel 97.1% ± 0.4% of the analyzable target regions were covered by >20 reads.
Based on the most widely accepted opensource libraries and algorithms, analysis in SureCall begins with raw reads from HiSeq, MiSeq or Ion Torrent sequencing of genomic DNA enriched with HaloPlex. After removal of the adaptor sequences, the reads are aligned to the genome using the Burrows-Wheeler Aligner (BWA) or Torrent Mapping Alignment Program (TMAP) software packages. Sequence Alignment/Map (SAM) Tools utilities are used to recalibrate the base call quality scores, perform local realignment and index the reads for improved performance. SAM Tools is also used to identify mutations from the local read pileup at each location and to assess the significance of the mutations. SureCall also contains a variant caller to detect low-frequency variants in cancer samples, which was not used in this case.
Each mutation is evaluated based on its location, amino acid change, effect on protein function (SIFT) and impact on structure and function of the protein using the Polymorphism Phenotyping v2 (PolyPhen-2) tool. Further information regarding the mutation is then aggregated from various public sources, including the National Center for Biotechnology Information (NCBI), Catalog of Somatic Mutations in (COSMIC), PubMed and locus-specific databases. After collecting the various inputs for classification, the proprietary mutation classifier evaluates the significance of the mutation following default or customized guidelines.
Each mutation is then categorized, with the user triaging each mutation and reviewing supporting evidence in the built-in viewer, including raw data and confidence measures, as well as links to external databases such as Online Mendelian Inheritance in Man (OMIM) and the NCBI’s Database of Genomics Structure Variation (dbVar). SureCall also identifi es additional information,
including the chromosomal location of the mutation with Human Genome Variation Society (HGVS) nomenclature. In addition, researchers can include notes associated with the mutation or the sample. Figure 1 shows the Mutation Report for one of the collagen tissue disease samples. The most important variant found in this sample was a category II mutation in the TGFBR2 gene at position 30,732,945. The other two variants are present in Single Nucleotide Polymorphism Database (dbSNP), their rsID numbers are included in the Mutation Report.
Results described here demonstrate the high performance of the HaloPlex assay to capture a ROI. Data produced by the HaloPlex assay can be easily analyzed in SureCall software without the need for an extensive bioinformatics infrastructure.
Anniek De Witte is the Sr. Product Manager for CGH and NGS software at Agilent Technologies. She was previously an R&D scientist for Agilent’s microarray-based CGH platform. Prior to Agilent, Anniek worked at Roche and at the Stanford Human Genome Center where she contributed to the Human Genome Project. Anniek holds an master of science degree in molecular biology from the University of Brussels.