The imperfect generation of single-cell data
Noise is present in a variety of genomic data, posing a risk of corrupting the underlying biological signal and hindering subsequent analyses. Here we discuss the types of noise and bias that are recognized to impact various big and individual data sequencing technologies.
Despite improvements in techniques for measuring single-cell sequencing (scRNA-seq) data, the quality of different omics data varies due to multiple technical factors, including amplification bias, cell cycle effects, library size differences, and RNA trapping rate.
For example, modern droplet-based scRNA-seq techniques can analyze millions of cells in a single experiment, but these techniques particularly suffer from variability due to relatively superficial sequencing resulting from low RNA trapping. A unique challenge associated with scRNA-seq is the occurrence of “drop-outs” due to the inability to detect an expressed gene, resulting in a false “null” observation.
Dips not only mask potentially differential genetics for cell type detection but also make it more difficult for computational processing methods to capture useful signals due to large considerations of heterogeneity. Traditional methods for compensating for missing values may not be suitable for scRNA-seq data due to the non-trivial difference between the true count and the false zero resulting from skipping.
Epigenomic sequencing techniques suffer from sample- and analysis-specific limitations. Variation, which is the main source of sample-specific bias in sequence data, affects epigenomic datasets more than genomic ones. There is variation in the DNA content of different cells. However, epigenomic data determine the emergence of different cell types from the same pluripotent stem cell.
When analyzing genetic transformations, we would expect there to be less variation due to contamination from different cell types in the same tissue. However, these contaminants can provide false positive results. Likewise, each of the epigenomic sequencing technologies can have its own limitations that must be carefully considered in data processing and interpretation.
For example, DNase-seq and ATAC-seq techniques rely on sequence preferences for the DNase-I enzyme or the Tn5 fusion effector, and ChIP-seq suffers from antibody cross-reactivity and extraction of spatially close genomic regions as a side effect of cross-linking materials.
The hydroxy group protects the hydroxy group just as the methyl group protects the cytosine from the chemical reaction, making it indistinguishable between hydr-oxmethyl-cytosine and methyl cytosine in conventional bisulfite sequencing experiments. Most of these limitations therefore provide false positive results.
The complementariness of various sources of data
Many biobanks focus on multi-omes analysis, such as genome, proteome, transcriptome, epigenome, and microbiome. However, each type of data has its own strengths and weaknesses, and integrating data from multiple sources improves the interpretation of each.
For example, single-cell ATAC-sequencing (scATAC-seq) technology can uniquely reveal gene promoter regions and the regulatory landscape of the genome, but currently, it may not achieve the same effectiveness in detecting unobserved cell types such as transcriptomics.
Genome-wide association studies: Although thousands of genetic variations have been identified for complex diseases and traits, they are not ideal for revealing complex interactions between genetic variations.
Furthermore, due to the cost of data collection while constructing a reasonably large sample size, many human health datasets have only one main omics data. Therefore, special machine learning (ML) algorithms that aim to integrate multiple disjointed data sets are in great demand.
The issues of generalizability of machine learning
Data from different interfaces contain different distributions of structures and noise, which poses unique challenges to the machine learning algorithms that are applied to combine these data.
For example, scRNA-seq data produced using different sequencing technologies from the same sample usually contain large batch effects, where the expression of genes in one batch differs systematically from that in another, and such differences can mask underlying biological, or to introduce false structures into the data.
In the genomic context, machine learning often relies on models derived from a set of assumptions that do not always match the data being used.
Violating these assumptions can have significant impacts on the generalizability of the model. It is rare to find that machine learning models trained from one type of genomic data provide little insight into the other, even though these types of genomic data may reflect the same biological process.