The Exploitable Genomics of Cancer: Earlier Cancer Detection Part II
Today, novel approaches from biochemistry, sequencing hardware, and artificial intelligence (AI) are converging and transforming oncology. Soon, with a simple non-invasive blood draw, for example, clinicians could be able to detect multiple forms of cancer in their early stages. This article seeks to highlight how cancer surfaces and evolves in the human body and how different technologies are detecting the signals of cancer. Like our last article in the ‘Earlier Detection of Cancer’ series, we aim to open-source our models and incorporate community feedback into a more comprehensive whitepaper.
Cancer Is a Disease of the Genome
The precipitous cost decline of next-generation DNA sequencing (NGS) has supercharged our understanding of tumor biology. Solid tumors progress arbitrarily quickly through a process known as somatic evolution. This process describes how tumors evolve from genetic mutations that accumulate in cells over time. Unlike hereditary mutations that reside in virtually every cell of the body, somatic mutations occur spontaneously and can give rise to tumors. Dangerous somatic mutations, while exceedingly rare, occur because of errors during cell division, mistakes during DNA repair, or exposure to carcinogens.
Solid tumors are tightly packed populations of genetically altered cells. All populations, from dogs to cats to cancer cells, are subject to the laws of natural selection. Certain somatic mutations, called oncogenic mutations, can give cancer cells an evolutionary advantage that causes them to outcompete healthy cells. Once mutated, genes responsible for regulating cell division can cause rapid, uncontrolled growth in a population of cancer cells.
Fortunately, innovations involving NGS, bioinformatics, synthetic biology, and data processing have broadened our collective understanding of cancer. Newer genomic domains like single-cell sequencing, long-read sequencing, optical mapping, and digital spatial profiling also are enhancing our understanding of tumor biology. ARK believes that the foundation of knowledge built during the past two decades will enable clinicians to detect cancer earlier and treat it more successfully.
Cell-Free DNA Is Cancer’s Achilles’ Heel, But Presents a Challenge
The strange biology that makes it dangerous also gives scientists the opportunity to detect cancer earlier. As cancer cells grow, divide, and die, they release fragments of genetic material into the bloodstream called circulating tumor DNA (ctDNA). Some ctDNA fragments contain the genetic changes responsible for or associated with the tumor. Unfortunately, healthy cells also release DNA—called cell-free DNA (cfDNA)—into the bloodstream. Detecting the rare mutant-ctDNA in a sea of healthy-cfDNA from a blood test is the central, technical challenge to early cancer detection—finding the signal through the noise.
The ratio of cancer-related ctDNA to background cfDNA is called the variant allele fraction (VAF) or tumor fraction, quantifying the ‘needle-in-a-haystack’ problem associated with early cancer detection. VAF is directly proportional to the size of a tumor in the body. For some early-stage tumors, the VAF can be as low as 1:10,000 (0.01%), as illustrated below.
In the next section, we describe how liquid biopsies seem to be overcoming the central VAF challenge. Importantly, exponential improvements in technology platforms like NGS, artificial intelligence (AI), and synthetic biology are enabling the high performance of liquid biopsies and their commercialization for cancer screening.
I. Somatic DNA Mutations
Somatic DNA mutations in oncogenes—genes that have the potential to cause cancer when mutated—seem to provide the most confident signals that a tumor is forming. While the list is growing, only around 1-2% of human genes seem to have oncogenic potential. Still, detecting somatic mutations at such low VAFs has been both error-prone and cost-prohibitive.
During the sample preparation process, lab technicians often use a technique called a polymerase chain reaction (PCR) to make multiple copies of oncogenes for analysis. While PCR amplifies the signals of somatic mutations, it is not perfect. Each cycle can introduce mistakes that look like actual mutations. Uncontrolled, these PCR artifacts can result in false positives. Companies such as Guardant Health (GH), Invitae (NVTA), Natera (NTRA), Exact Sciences (EXAS), and many others are overcoming this challenge with synthetic biology and AI.
Synthetic biology (syn-bio) is an emerging field of biological manufacturing in which costs are declining and units are scaling rapidly. Syn-Bio enables companies focused on liquid biopsy to correct PCR errors using very carefully designed molecular barcodes, as illustrated below. These barcodes attach to and follow the original DNA molecules throughout the sequencing process. By comparing fragments with the same barcode, bioinformaticians can distinguish true mutations from spontaneous PCR artifacts. Unfortunately, this style of error correction requires lots of redundant sequencing that can increase the sequencing workload of liquid biopsies dramatically.
Detecting rare variants with confidence requires extremely deep sequencing. We measure the depth of sequencing using coverage, which in rough terms is the number of times a sequencer reads a DNA base. For reference, researchers sequence whole human genomes at 30X average coverage. The cost of a sequencing run is directly proportional to coverage. Importantly, the relationship between VAF and sequence coverage is exponential—not linear. Error correction can compound already high sequencing costs by a factor of five to ten. Thanks to the combination of rapid cost declines in NGS and scalable, highly efficient syn-bio solutions, however, error detection and correction on ultra-low-VAF mutations have become much more economical.
II. Methylation and Machine Learning (ML)
While somatic DNA mutations often are regarded as the most specific indicators of cancer, the large-scale application of AI on patient blood samples can provide alternative signals. Recent studies by GRAIL and others, for example, have used machine learning to surface DNA methylation and detect cancer earlier than DNA mutations alone.
DNA methylation refers to chemical modifications on the outside of DNA, as illustrated below. If a cell’s genome were a book, somatic mutations would be akin to spelling errors while methylation would present words with the correct spelling but in bold or italics. Methylation can change how a cell reads and reacts to or expresses its DNA instructions. In the human genome, methylation impacts only one of the four DNA bases—cytosine* (C) and aggregates in 28 million CpG sites—locations where methylation predominantly is found.
GRAIL has deployed unbiased machine learning (ML) on thousands of patient blood samples to refine a set of 100,000 CpG-sites associated with various cancers. Because we don’t understand exactly how methylation causes or accelerates cancer, its relationship to tumors is more correlative than causative. As a result, the predictive power of a single methylated site is weaker than a carcinogenic somatic DNA mutation. Together, however, the aggregate methylation signature spread across thousands of sites is a powerful predictor of the presence of cancer.
Our research suggests that machine learning has lowered the cost of methylation-based liquid biopsies dramatically, pushing them that much closer to commercialization. Thanks to the plethora of methylation sites, extremely high coverage is not necessary to detect abnormally methylated tumor DNA. Because neural networks improve with more data, bigger models, and more computation, the accuracy and cost-effectiveness of liquid biopsies that include methylation is likely to increase over time.
While a boon to earlier cancer detection, DNA methylation has been difficult to detect from blood samples in the absence of synthetic biology. Unlike standard DNA sequencing, methylation sequencing involves additional sample preparation. Lab technicians often use bisulfite treatment or enzymes to convert unmethylated cytosine (C) bases to thymine (T) bases while leaving methylated cytosine (C) bases intact. Then, downstream algorithms can identify the cytosines (Cs) methylated in the original sample. This process adds complexity and can prevent sequencers from focusing their reads on the correct areas of the genome, as illustrated below. So-called ‘off-target’ sequencing contributes nothing to test results, resulting only in higher operating costs and lost time.
A leader in synthetic biology, Twist Biosciences (TWST) has leveraged its manufacturing and data science expertise to tackle this challenge. Twist engineered methylation-specific capture probes—small molecules that target and bind to specific genomic regions—to help focus sequencing on the right areas, as illustrated above. Twist’s highly uniform capture probes have enabled GRAIL’s early detection technology. We believe because its products are low-cost, highly accurate, and customizable, Twist has the potential to democratize access to methylation research—potentially catalyzing the use of liquid biopsies for cancer screening. Additionally, Base Genomics, a private firm recently acquired by Exact Sciences (EXAS) , has developed a less aggressive sample prep approach that preserves epigenomic information and accurately detects biomarkers like methylation cost-effectively.
III. Machine Learning Enables Multi-Omic Models
The list of cancer signals does not stop with somatic DNA mutations and methylation. Multi-omics means combining multiple ‘omics datasets such as: fragmentomics (how cfDNA breaks apart), proteomics (what proteins circulate in the bloodstream), and transcriptomics (how gene expression changes), as outlined below. Several research groups also have nominated new ‘flavors’ of methylation like 5hmC or short snippets of non-natural DNA called neomers as markers of cancer.
Companies like Freenome are pioneering multi-analyte or multi-omic machine learning models that incorporate several of these signals. Importantly, many ‘omics’ signals are orthogonal, or additive, creating cancer signals that dwarf biological noise. We believe a critical enabler of multi-omics has been the sharp cost declines in AI model training.
Conclusion
We believe that thanks to the convergence of once disparate technology platforms like next-generation DNA sequencing (NGS), artificial intelligence (AI), and synthetic biology (syn-bio), test-makers are beginning to detect the presence of early-stage cancer with non-invasive blood tests. The lines between and among these technologies have dissolved, encouraging if not demanding innovation and collaboration across these disciplines. Accordingly, incumbents who do not invest aggressively in R&D are likely to lose their footing in the new world.
The exponential cost declines associated with NGS, AI, and syn-bio are enabling liquid biopsies for the detection of cancer. Wright’s Law, a relative of Moore’s Law that is a function of units instead of time, is the template for these cost declines. According to Wright’s Law, unit scaling should continue to lower costs, boosting the viability and accuracy of liquid biopsies.