Bioinformatic analysis · Bioinformatics software/pipeline development · Long-read RNA-seq
*These authors contributed equally to this work
Abstract The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/ ) is publicly available in GitHub, with images available on Dockerhub ( https://hub.docker.com ), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses. Database URL: https://www.encodeproject.org/
Biological systems are immensely complex, organized into a multi-scale hierarchy of functional units based on tightly-regulated interactions between distinct molecules, cells, organs, and organisms. While experimental methods enable transcriptome-wide measurements across millions of cells, the most ubiquitous bioinformatic tools do not support systems-level analysis. Here we present hdWGCNA, a comprehensive framework for analyzing co-expression networks in high dimensional transcriptomics data such as single-cell and spatial RNA-seq. hdWGCNA provides built-in functions for network inference, gene module identification, functional gene enrichment analysis, statistical tests for network reproducibility, and data visualization. In addition to conventional single-cell RNA-seq, hdWGCNA is capable of performing isoform-level network analysis using long-read single-cell data. We showcase hdWGCNA using publicly available single-cell datasets from Autism spectrum disorder and Alzheimer’s disease brain samples, identifying disease-relevant co-expression network modules in specific cell populations. hdWGCNA is directly compatible with Seurat, a widely-used R package for single-cell and spatial transcriptomics analysis, and we demonstrate the scalability of hdWGCNA by analyzing a dataset containing nearly one million cells.
Pathogenic loss-of-function SCN1A variants cause a spectrum of seizure disorders. We previously identified variants in individuals with SCN1A-related epilepsy that fall in or near a poison exon (PE) in SCN1A intron 20 (20N). We hypothesized these variants lead to increased PE inclusion, which introduces a premature stop codon, and, therefore, reduced abundance of the full-length SCN1A transcript and Nav1.1 protein. We used a splicing reporter assay to interrogate PE inclusion in HEK293T cells. In addition, we used patient-specific induced pluripotent stem cells (iPSCs) differentiated into neurons to quantify 20N inclusion by long and short-read sequencing and Nav1.1 abundance by western blot. We performed RNA-antisense purification with mass spectrometry to identify RNA-binding proteins (RBPs) that could account for the aberrant PE splicing. We demonstrate that variants in/near 20N lead to increased 20N inclusion by long-read sequencing or splicing reporter assay and decreased Nav1.1 abundance. We also identified 28 RBPs that differentially interact with variant constructs compared to wild-type, including SRSF1 and HNRNPL. We propose a model whereby 20N variants disrupt RBP binding to splicing enhancers (SRSF1) and suppressors (HNRNPL), to favor PE inclusion. Overall, we demonstrate that SCN1A 20N variants cause haploinsufficiency and SCN1A-related epilepsies. This work provides insights into the complex control of RBP-mediated PE alternative splicing, with broader implications for PE discovery and identification of pathogenic PE variants in other genetic conditions
Using third-generation sequencers, long-read RNA-seq is increasingly applied in transcriptomic studies given its major advantage in characterizing full-length transcripts. A number of methods have been developed to analyze this new type of data for transcript isoforms and their abundance. Another application, which is significantly under-explored, is to identify and analyze single nucleotide variants (SNVs) in the RNA. Identification of SNVs, such as genetic mutations or RNA editing sites, is fundamental to many biomedical questions. In long-read RNA-seq, SNV analysis presents significant challenges, due to the well-known relatively high error rates of the third-generation sequencers. Here, we present the first study to detect and analyze RNA editing sites in long-read RNA-seq. Our new method, L-GIREMI, effectively handles sequencing errors and biases in the reads, and uses a model-based approach to score RNA editing sites. Applied to PacBio long-read RNA-seq data, L-GIREMI affords a high accuracy in RNA editing identification. In addition, the unique advantage of long reads allowed us to uncover novel insights about RNA editing occurrences in single molecules and double-stranded RNA (dsRNA) structures. L-GIREMI provides a valuable means to study RNA nucleotide variants in long-read RNA-seq.
Motivation: Weighted gene co-expression network analysis (WGCNA) is frequently used to identify modules of genes that are co-expressed across many RNA-seq samples. However, the current R implementation is slow, not designed to compare modules between multiple WGCNA networks, and results are hard to interpret and visualize. We introduce the PyWGCNA Python package designed to identify and compare co-expression modules from RNA-seq data. Results: We apply PyWGCNA to two distinct datasets of brain bulk RNA-seq from MODEL-AD to identify modules associated with the genotypes. We compare the resulting modules to each other to find modules with significant overlap across the datasets. Availability: The PyWGCNA library for Python 3 is available on PyPi at https://pypi.org/project/PyWGCNA and on GitHub at https://github.com/mortazavilab/PyWGCNA.
Alternative RNA isoforms are defined by promoter choice, alternative splicing, and polyA site selection. Although differential isoform expression is known to play a large regulatory role in eukaryotes, it has proved challenging to study with standard short-read RNA-seq because of the uncertainties it leaves about the full-length structure and precise termini of transcripts. The rise in throughput and quality of long-read sequencing now makes it possible, in principle, to unambiguously identify most transcript isoforms from beginning to end. However, its application to single-cell RNA-seq has been limited by throughput and expense. Here, we develop and characterize long-read Split-seq (LR-Split-seq), which uses a combinatorial barcoding-based method for sequencing single cells and nuclei with long reads. We show that LR-Split-seq can associate isoforms with cell types with relative economy and design flexibility. We characterize LR-Split-seq for whole cells and nuclei by using the well-studied mouse C2C12 system in which mononucleated myoblast cells differentiate and fuse into multinucleated myotubes. We show that the overall results are reproducible when comparing long- and short-read data from the same cell or nucleus. We find substantial evidence of differential isoform expression during differentiation including alternative transcription start site (TSS) usage. We integrate the resulting isoform expression dynamics with snATAC-seq chromatin accessibility to validate TSS-driven isoform choices. LR-Split-seq provides an affordable method for identifying cluster-specific isoforms in single cells that can be further quantified with companion deep short-read scRNA-seq from the same cell populations.
Accurate transcription start site (TSS) annotations are essential for understanding transcriptional regulation and its role in human disease. Gene collections such as GENCODE contain annotations for tens of thousands of TSSs, but not all of these annotations are experimentally validated, nor do they contain information on cell type-specific usage. Therefore, we sought to generate a collection of experimentally validated TSSs by integrating RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression (RAMPAGE) data from 115 cell and tissue types, which resulted in a collection of approximately 50 thousand representative RAMPAGE peaks. These peaks were primarily proximal to GENCODE-annotated TSSs and were concordant with other transcription assays. Because RAMPAGE uses paired-end reads, we were then able to connect peaks to transcripts by analyzing the genomic positions of the 3’ ends of read mates. Using this paired-end information, we classified the vast majority (37 thousand) of our RAMPAGE peaks as verified TSSs, updating TSS annotations for 20% of GENCODE genes. We also found that these updated TSS annotations were supported by epigenomic and other transcriptomic datasets. To demonstrate the utility of this RAMPAGE rPeak collection, we intersected it with the NHGRI/EBI genome-wide association studies (GWAS) catalog and identified new candidate GWAS genes. Overall, our work demonstrates the importance of integrating experimental data to further refine TSS annotations and provides a valuable resource for the biological community.
Acoustic deterrents can reduce marine mammal interactions with fisheries and aquacultures, but they contribute to an increasing level of underwater noise. In Southern California, commercially produced explosive deterrents, commonly known as “seal bombs,” are used to protect fishing gear and catch from pinniped predation, which can cause extensive economic losses for the fishing community. Passive acoustic monitoring data collected between 2005 and 2016 at multiple sites within the Southern California Bight and near Monterey Bay revealed high numbers of these small-charge underwater explosions, long-term, spatio-temporal patterns in their occurrence, and their relation to different commercial purse-seine fishing sectors. The vast majority of explosions occurred at nighttime and at many nearshore sites high explosion counts were detected, up to 2,800/day. Received sound exposure levels of up to 189 dB re 1 μPa2-s indicate the potential for negative effects on marine life, especially in combination with the persistence of recurring explosions during periods of peak occurrence. Due to the highly significant correlation and similar spatio-temporal patterns of market squid landings and explosion occurrence at many sites, we conclude that the majority of the recorded explosions come from seal bombs being used by the California market squid purse-seine fishery. Additionally, seal bomb use declined over the years of the study, potentially due to a combination of reduced availability of market squid driven by warm water events in California and regulation enforcement. This study is the first to provide results on the distribution and origin of underwater explosions off Southern California, but there is a substantial need for further research on seal bomb use in more recent years and their effects on marine life, as well as for establishing environmental regulations on their use as a deterrent.
With increased usage of long-read sequencing technologies to perform transcriptome analyses, there becomes a greater need to evaluate different methodologies including library preparation, sequencing platform, and computational analysis tools. Here, we report the study design of a community effort called the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, whose goals are characterizing the strengths and remaining challenges in using long-read approaches to identify and quantify the transcriptomes of both model and non-model organisms. The LRGASP organizers have generated cDNA and direct RNA datasets in human, mouse, and manatee samples using different protocols followed by sequencing on Illumina, Pacific Biosciences, and Oxford Nanopore Technologies platforms. Participants will use the provided data to submit predictions for three challenges: transcript isoform detection with a high-quality genome, transcript isoform quantification, and de novo transcript isoform identification. Evaluators from different institutions will determine which pipelines have the highest accuracy for a variety of metrics using benchmarks that include spike-in synthetic transcripts, simulated data, and a set of undisclosed, manually curated transcripts by GENCODE. We also describe plans for experimental validation of predictions that are platform-specific and computational tool-specific. We believe that a community effort to evaluate long-read RNA-seq methods will help move the field toward a better consensus on the best approaches to use for transcriptome analyses.
Motivation Long-read RNA-sequencing technologies such as PacBio and Oxford Nanopore have discovered an explosion of new transcript isoforms that are difficult to visually analyze using currently available tools. We introduce the Swan Python library, which is designed to analyze and visualize transcript models. Results Swan finds 4909 differentially expressed transcripts between cell lines HepG2 and HFFc6, including 279 that are differentially expressed even though the parent gene is not. Additionally, Swan discovers 285 reproducible exon skipping and 47 intron retention events not recorded in the GENCODE v29 annotation. Availability and implementation The Swan library for Python 3 is available on PyPi at https://pypi.org/project/swan-vis/ and on GitHub at https://github.com/mortazavilab/swan_vis.
Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short reads. Here we introduce TALON, the ENCODE4 pipeline for platform-independent analysis of long-read transcriptomes. We apply TALON to the GM12878 cell line and show that while both PacBio and ONT technologies perform well at full-transcript discovery and quantification, each displayed distinct technical artifacts. We further apply TALON to mouse hippocampus and cortex transcriptomes and find that 422 genes found in these regions have more reads associated with novel isoforms than with annotated ones. We demonstrate that TALON is a capable of tracking both known and novel transcript models as well as their expression levels across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.
Pre-mRNA splicing is regulated through multiple trans-acting splicing factors. These regulators interact with the pre-mRNA at intronic and exonic positions. Given that most exons are protein coding, the evolution of exons must be modulated by a combination of selective coding and splicing pressures. It has previously been demonstrated that selective splicing pressures are more easily deconvoluted when phylogenetic comparisons are made for exons of identical size, suggesting that exon size-filtered sequence alignments may improve identification of nucleotides evolved to mediate efficient exon ligation. To test this hypothesis, an exon size database was created, filtering 76 vertebrate sequence alignments based on exon size conservation. In addition to other genomic parameters, such as splice site strength, gene position or flanking intron length, this database permits identification of exons that are size and/or sequence conserved. Highly size-conserved exons are always sequence conserved. However, sequence conservation does not necessitate exon size conservation. Our analysis identified evolutionarily young exons and demonstrated that length conservation is a strong predictor of alternative splicing. A published dataset of 5000 exonic SNPs associated with disease was analyzed to test the hypothesis that exon size-filtered sequence comparisons increase detection of splice-altering nucleotides. Improved splice predictions could be achieved when mutations occur at the third codon position, especially when a mutation decreases exon inclusion efficiency. The results demonstrate that coding pressures dominate nucleotide composition at invariable codon positions and that exon-size filtered sequence alignments permit identification of splice-altering nucleotides at wobble positions.