Enter the 2019 R&D 100 Awards!
FastID and TachysSTR was a 2018 R&D 100 Award winner. All of the R&D 100 Awardees were announced at the R&D 100 Awards Gala held in Orlando, Florida on Nov. 16, 2018.
The R&D 100 Awards have served as the most prestigious innovation awards program for the past 57 years, honoring R&D pioneers and their revolutionary ideas in science and technology.
Submissions for the 2019 R&D 100 Awards are now being accepted. Any new technical product or process that was first available for purchase or licensing between January 1, 2018 and March 31, 2019, is eligible for entry in the 2019 awards.
Start or complete your entry now: visit: https://rd1002019.secure-platform.com/a For more info: www.rd100conference.com/awards
In the United States, most forensic DNA analysis is done by comparing short tandem repeat (STR) patterns—regions of DNA where a small string of nucleotides is repeated multiple times—at specific locations (loci) on chromosomes. On Jan. 1, 2017, the FBI increased the number of core loci used for STR comparisons in the Combined DNA Index System (CODIS) from 13 to 20, to strengthen the certainty of any potential matches.
Between December 2016 and December 2018, over 1.9 million profiles were added to CODIS’s national database, bringing the total number from just over 16 million to nearly 18 million in two years. With two alleles at each locus to be compared to the alleles of each sample in the index, the number of these comparisons necessary to test one sample is immense.
This nationwide forensic DNA system may one day incorporate even more precise comparisons, of variations in individual nucleotides at specific genome positions, known as single nucleotide polymorphism (SNPs). Because forensic SNP analysis involves tens of thousands of loci, investigators will need more time and computer power to process this data and put criminals behind bars, or give names back to the unidentified deceased.
One researcher, from MIT Lincoln Laboratory, a Department of Defense Federally Funded Research and Development Center (FFRDC), is seeking to simplify and speed up forensic DNA comparisons bit by bit—literally. Darrell Ricke, Ph.D., is leading research at the lab that would reduce the time and computer power needed for DNA analysis, by encoding the millions of loci and alleles into units of computer data that can then be compared using basic logical commands. Two algorithms developed through Ricke’s research to make these comparisons—FastID and TachysSTR—earned Lincoln Lab one of its 10 R&D 100 Awards this past November.
This computer logic based approach to DNA comparisons could also be used to help analyze complex DNA mixtures—of as many as a dozen contributors, Ricke told R&D Magazine in an interview. In tests so far, FastID has managed to run comparisons at a rate of 5.08 seconds per 2,500 SNP loci against a database of 20 million profiles, on a 64-bit laptop with a 4 GHz Intel Core i7 processor. Additionally, TachysSTR has made accurate identifications from 12-person DNA mixtures in lab tests, with calculations taking fractions of a second to complete, according to Ricke.
Scanning DNA like a ‘barcode’
By encoding DNA profiles into binary code—simple bits represented by 0s and 1s—analysts can create something similar to a barcode containing the necessary information to make comparisons, Ricke explained.
“If I’m doing 1,000 SNPs, I still have two alleles (per locus), and then just to compare with two people, I have to do that whole two-way comparison matchup 1,000 times,” Ricke said. “But, if I take and I encode all that data for both people as a barcode … Imagine just taking two barcodes and seeing if they match or not.”
This type of matching requires fewer hardware instructions to complete for each sample, and using FastID, one can compare 63 SNPs in just one hardware instruction on a common 64-bit computer, Ricke added. As explained in the paper “FastID: Extremely Fast Forensic DNA Comparisons,” alleles are first translated into 0s and 1s based on the combinations of major and minor alleles: two major alleles (MM), the most common combination, are encoded as 0, while a combination of a major and minor allele (Mm), or two minor alleles (mm), is encoded as 1.
The resulting binary code that represents a SNP locus is then mapped into a compact bit array—the barcode in Ricke’s analogy—and compared to the bit arrays of other samples’ SNPs using the logical “exclusive or” (XOR) operation. The output of this operation will reveal if any of the bits (allele pairs) of the SNPs don’t match up, and a “population count” operation will tally up how many differences there are between the loci. A likelihood ratio of whether the two samples came from the same person can then be calculated.
“If they’re identical, I get a (population count) result of zero. And anywhere they’re different, I’ll get a different marker between these two,” said Ricke. “What that allows me to do is take these barcodes and see if they’re identical or different, but instead of doing all this checking back and forth and keeping track of stuff, I can do it all at once.”
Familial, mixture analysis simplified
In addition to DNA SNP comparisons (with FastID) and STR comparisons (with TachysSTR), Ricke’s algorithms are also designed to aid in DNA mixture analysis, and to potentially determine kinship. TachysSTR, which gets its namesake from the Greek word “tachys,” meaning rapid, can determine first-degree familial relationships between parents, children and siblings, and FastID can detect further genetic relationships, for example, between grandparents, grandchildren and cousins.
Familial searching and forensic genealogy have been involved in many investigative breakthroughs in recent months, including the apprehension of the suspected Golden State Killer Joseph DeAngelo in April, and the identification of murder victim Marcia King, who was known only as the “Buck Skin Girl” for over 35 years. Ricke mentioned another potential application of rapid kinship comparisons—human trafficking prevention at U.S. borders.
He described being told of a case in which a child gave border agents the name of a relative, but when agents recognized the name and looked back through their paperwork, they realized the name had been given before, prompting further investigation.
“It was a case of human trafficking,” Ricke said. “The kids were being told (…) go say that this is your relative. So when we get an unaccompanied minor, if the kid says ‘this is my relative,’ they will put the kid with that relative. It’s an easy way to have the child be trafficked, if you don’t actually have a way of verifying that they’re related. You can do extended kinship, so you can actually verify the relationship before you place the child.”
The third major function of the FastID and TachysSTR algorithms is mixture comparison, which is completed by adding an additional logical operation to the process. By first running a XOR operation between the reference sample and the mixture, and then a logical AND operation between the XOR output and the reference sample, one can then use a population count to determine the number of minor alleles that are present in the reference sample but not in the mixture. A low number of these unique alleles represents a strong likelihood that the person is present in the mixture, while a high number shows they are much less likely to be a part of the mixture.
“We can do that mixture analysis at the speeds I’m talking about. Identity comparisons and the mixture analysis—they both operate at this high speed,” Ricke said. “And I can do it for complex mixtures of large numbers of people in the mixture.”
Another paper authored by Ricke and other Lincoln Lab scientists presents the results of a test in which FastID was used to accurately identify nine individuals out of a 10-person mixture. Ricke also told R&D Magazine that in two tests involving 12-person mixtures, TachysSTR successfully identified 11 individuals in one test and 12 in the other.
While the sheer number of loci represented by using SNP comparison rather than STR comparison for these analyses still represents a challenge when it comes to computer power limitations, Ricke and his colleagues have also worked on developing more efficient formulas for calculating statistics in SNP comparison situations. The researchers report that their proposed Fast P(RMNE) formula—a version of the “random man not excluded,” or P(RMNE), calculation currently used for these comparisons—has so far shown to perform faster than other methods tested.
“It’s 1/1000 of a second to compute the statistics for the analysis results: Is this person in the mixture or not?” Ricke said. “The fundamentals were all based on (a previous paper by L. Voskoboinik and A. Darvasi) … We’ve extended that to allow for instrument errors and other types of things that can go wrong in real life.”
Ricke hopes his research could one day be applied to potentially solving cold cases, including those involving complex mixtures that technology has not yet been able to untangle.
“There’s a lot of them that have never been processed,” he said. “I think there’s a lot that can be (done) in these archival cases, these cold cases that haven’t been solved.”