The HiFi difference – Getting the right answer
The combined accuracy and length of PacBio HiFi reads is utilized by an ever-increasing number of researchers, with numerous new publications and preprints appearing every week. Several months ago, Illumina announced an attempt to imitate HiFi reads by synthetically creating longer sequence information from short reads, which they named Infinity reads. This is in response to the broad recognition of the value of HiFi long-read sequencing in the field of human genomics, with studies using HiFi reads describing the improved resolution of complex genomic regions for LD studies, the greater sensitivity to disease-causing variants in a kidney disease, a more informed genetic determination of reference ranges for a disease biomarker, the improved detection of structural variation in breast cancer samples, the characterization of inversions in the human genome, and the single-molecule architecture of human telomeric DNA and chromatin, just during this past week alone! We are grateful to the scientific community for using native long-read PacBio HiFi sequencing for an increasing number of human genomes, now also allowing the simultaneous decoding of the epigenome.
Earlier this week, Dr. Gary Schroth from Illumina presented a webinar with several examples which compared Illumina Infinity reads, standard Illumina reads, and PacBio HiFi reads. In fact, it was a powerful demonstration of the superiority of HiFi sequencing data quality, and the many significant errors and artifacts that are introduced by Infinity synthetic long reads.
Dr. Schroth presented several regions of the human genome in the form of IGV screenshots, comparing regular Illumina sequencing with Infinity synthetic long read data, and using PacBio HiFi sequencing data as the on-market gold standard and ground-truth control.
Let’s take a closer look at two of the examples presented. The first comprises 23 kb on chromosome 15 containing the STRC gene (as a side note, all the presented examples strikingly illustrate the poor performance of standard Illumina short-read sequencing in these regions):
Zooming in on a section near the center and comparing Infinity synthetic long reads to PacBio native long HiFi reads shows several errors in the Infinity data:
1. A false positive SNV
2. A homozygous SNV that is erroneously called as heterozygous
3. Incorrect phasing of three heterozygous variants. Of the 14 Infinity read rows depicted, 10 are uninformative because the Infinity reads are not long enough to span the region. Of the four remaining rows with reads that span this region, three give the wrong answer. In contrast, all 11 PacBio HiFi reads are sufficiently long to be informative, and all 11 provide the correct phasing information of the two alleles, clearly showing the first two variants in cis and the third in trans.
In a second example, Dr. Schroth presented a 145 kb region on chromosome 15, containing the CHRNA7 gene. In the region depicted, two significant drops in coverage can be seen in the Infinity data, precluding confident variant calling and phasing in these regions. The two regions are actually covered in standard Illumina sequencing, indicating that new gaps in coverage appear to be introduced through synthetic long reads. In the region highlighted in the presentation on the right in orange, a coverage “cliff” can be seen in the Infinity data, indicating data quality issues. In contrast, as for all the examples presented, PacBio HiFi coverage is even throughout, allowing confident variant calling and phasing.
The story was similar for the other examples presented. Notably, the same screenshot of the NCF1 gene that had been presented in January was shown, still containing the previously highlighted numerous errors in this region.
There are multiple reasons for why native PacBio HiFi reads have such higher quality compared to synthetic long reads:
1. Molecular integrity: because native DNA molecules that are extracted from cells are directly submitted to sequencing, the DNA fragment sizes can be much longer compared to synthetic approaches. As a case in point, for the STRC IGV screenshot above, the average length of Infinity reads is 4.5 kb (median 3.8 kb), more than 4-fold shorter than typical PacBio HiFi libraries for human WGS (18-20 kb). This greater contiguity of accurate sequence reads improves the results in the calling of all variant types, haplotype phasing, coverage uniformity, and now also 5mC calling and epigenetic phasing, as well as in many other areas, including de novo assemblies of human, plant, animal, bacterial and viral genomes and metagenome communities, to name just a few.
2. Simpler sample preparation steps, and absence of amplification: the sequencing of native DNA molecules, by definition, does not require any DNA amplification or other sequence-altering molecular biology procedures, each of which are prone to biases, introduction of errors, fragmentation and other artifacts. Synthetic long reads, on the other hand, require these confounding steps. PacBio HiFi sequencing of unaltered DNA molecules allows for a view of the whole genome without these biases, including regions which are considered “difficult-to-sequence” with NGS due to limitations of the short-read sequencer itself, and which therefore cannot be overcome with synthetic long-read constructs. Further, amplification results in the complete loss of methylation information in synthetic long reads; in contrast 5mC-sequencing is now an integral part of PacBio HiFi sequencing without any additional efforts.
3. Simpler bioinformatics: the sequencing of long native molecules also simplifies the bioinformatics at the back end. In contrast to synthetic long reads, PacBio native long HiFi reads do not require bioinformatically assigning which small fragments originate from the same original long molecule, complicated read assembly steps, correction of errors introduced during sample prep, and/or subtractions from control samples. Even if these steps are hidden on the short-read sequencer, they all carry the risk of introducing errors.
For these reasons, as we have seen over the past decade with numerous attempts at synthetic long-read approaches which were all eventually abandoned, long and accurate native PacBio HiFi reads provide the most accurate, contiguous, and complete genomic and epigenomic information, with ever increasing applications and with performance that synthetic long reads fall far short of. The webinar also highlighted how incomplete the standard offering of short-read sequencing is. As noted in the presentation, PacBio HiFi sequencing is on-market now and available today, providing a full view of genomes with best-in-class sequence quality, and giving researchers the highest confidence in their data and resulting research findings.