A Review of Transcriptome Analysis

Interpretation of Transcriptome Research Review Articles

Today, I will introduce the article about RNA seq analysis that I recently read. The article was posted on Genome Biology in A survey of best practices for RNA seq data analysis. Because the article is long and boring, The information that I think is important has been bold and red , you can see important information directly. Don't ask me why I'm so good. Please call me Lei Feng.


Summary


At present, RNA seq data is widely used, but there is no set of processes that can solve all problems. We focus on the important steps in RNA seq analysis: experimental design, quality control, read comparison, expression quantification, visualization, differential expression, identification of variable splicing, functional annotation, fusion gene detection, eQTL mapping, etc.

The article will discuss the key points and problems in each step of analysis, and finally explain how RNA seq is combined with other data for analysis.


background


Using transcriptome data to identify transcripts and quantify expression is the core role of transcriptome data. Because of this role, it can become a product project RNA seq sequencing independently of other omics information. Therefore, RNA seq was completely ignited. Since then, there have been many industry standards and analysis documents. This makes new users have to know and understand all the experimental steps in order to do experiments well.

The current situation is that there is no fixed process, and the whole analysis process is changed according to different species and different design purposes. In this paper, we only focus on conventional RNA seq analysis. That is the main parts in the summary.

At the same time, the article points out that check points should be added throughout the process in order to get good results.  

1. Experimental Design


To get interesting biological answers, the experimental design must be reasonable. First, we need to select the type of database, sequencing depth and biological duplication of data. The other is to ensure that the sequencing machine runs fully and produces as little invalid data as possible.

Here we know that there are two methods for transcriptome sequencing: Detection of polyA and ribosome elimination For eukaryotes, the first method is usually used, while for bacteria, there is no polyA, so the second method should be used.

 biocc_87c48c13_73b9_4339_b007_534f7843ac

It is pointed out that Transcriptome should also measure more long segments , which can provide comparison efficiency and transcript recognition capability. The use of that data depends on the purpose of the analysis. If the species studied are well annotated, just to study their expression level, it is enough to use cheap and short se. But if the annotation is not good, pe and long read can play a good role.

The sequencing depth depends on the complexity of transcripts. Too low or too high is not good.

About duplication, it should include duplication caused by technology, which is difficult to deal with. You can only be more careful and try to avoid it during the experiment. For artificially set biological repetition, statistical tools are used for filtering.


 biocc_dc92a3e4_a0ee_4635_b849_b6a158b791
In the experimental design, If there are too many samples, they should be treated according to the group This reduces errors.


2. RNA seq analysis


The RNA seq library preparation process includes: RNA fragmentation, cDNA synthesis, adapter
ligation, PCR amplification, bar-coding, and lane loading)。 Here, we should pay attention to the quality control of data and the standardization of library size, Reducing base preference : such as the use of adapters with random nucleotides at the extremities or the use of chemical-based
fragmentation instead of RNase III-based fragmentation. 

If there are too many samples, we have to use separate sequencing, or on different lanes, we must batch effect Conduct treatment to prevent other factors from affecting the experiment.


(1) Quality control point


<1> , original data

Including GC content, data quality, with or without joints, replication ratio, etc. Here, the information in the sample sequencing data of the same species should be consistent. If the difference exceeds 30%, it should be removed.

The software monitored here recommends fastqc and NGSqc. In addition, if the data on both ends of the read is of low quality, it should be cut off. The recommended tools are FASTX toolkit and Trimmatic.

<2> , read comparison

One measure is the read comparison efficiency.

In this article, 70-90% of reads were matched to the human genome.

The other is uniformity of read coverage on exons and the mapped strand Read is enriched at end 3, which may indicate low data quality

The GC content also evaluates the base preference. Recommended software: RSeQC, Qualimap.

<3> Expression quantification

Detect GC content and gene length preferences, so as to better standardize and recommend software

NOIseq EDASEQ。

<4> Biological reproduction

Right here Evaluate sample correlation , compared with spearman R2>0.9. At the same time, we must batch effect Evaluate and filter. Here we can mainly use PCA for analysis (See the previous article for details)

<5> Transcript identification

If there is a reference, it is OK to compare directly. If there is no reference at that time, here we need to assemble first, and then we need to express quantitative. The data set up here for assembly and quantification shall be subject to continuity and synchronization.


(2) , Comparison


(3-1) Transcript identification


With reference, the software used to identify transcripts includes the following according to different situations: GRIT, Cufflinks, StringTie, Augustus (auxiliary gene prediction), etc

In fact, it is difficult to obtain full-length transcripts from short sequences, and the start and end predictions are not accurate.

 biocc_16741b28_3688_4021_ac33_531ea4c585

(3-2). Re assembly


If there is no reference or the reference is poor, we need to assemble it from scratch. Main software: SOAPdenovo Trans [30], Oases [31], Trans ABySS [32] or Trinity [33]. For areas with low expression, the coverage is too low to assemble, and the read coverage is too high to assemble correctly. Here's a suggestion If there are multiple samples, mixed sample assembly is recommended


 biocc_0294e634_6081_479a_ab08_01ecab28e6

(4) Quantitative expression of transcripts


It is usually done through read comparison, and also through kmer. The raw counts of mapped read can be used for evaluation, but this indicator does not consider gene length and other factors. RPKM is an intra group standardized indicator that eliminates the influence of gene length and library. The same indicators include FPKM, RPKs, TPM, etc. Main software: Customlinks, RSEM (RNA-Seq by Expectation Maximization), eXpress, Sailfish and kalisto

(5) Differential expression analysis


There are many commonly used software, and attention should be paid to the data distribution characteristics of each software in use.

It is also important to evaluate and filter the batch effect (COMBAT

)At present, few software performs well for different data, so it is recommended that For important results, use multiple software for comprehensive analysis

(6) Variable shear analysis


Method 1: translation expression and total gene expression rSeqDiff: uses a hierarchical likelihood ratio test to detect different gene expression without scattering change and different iso expression similarly Method 2: ex based approach detects signals of alternative scattering by comparing the distributions of reads on exit s and junctions of the genes between the compared samples;

 biocc_fabe63e5_60dc_4876_b372_066ecd62a1

(7) , Visualization


Users need to visually see the changes of read coverage on genes to evaluate the robustness of the results.

Recommended software: UCSC browser, Integrated Genomics Viewer (IGV), Genome Maps, Savant, RNAseqViewer, etc.


In addition, fusion gene detection, sRNA and functional annotation are also introduced.


Then the article explored the combination of RNA seq and other data for analysis, including genome data, methyl factor data, Chromatin features, MicroRNAs, Proteomics and metabolomics.

Finally, the article explains the impact of single cell sequencing technology and third-generation sequencing on transcriptome sequencing:

single cell studies are meaningful only when a set of individual cell libraries are compared with the cell population, with the aim of identifying subgroups of multiple cells with distinct combinations of expressed genes   Long-read sequencing provides amplification-free, singlemolecule sequencing of cDNAs that enables recovery of full-length transcripts without the need for an assembly step 

0 comments

Please first Sign in Post comments
 Code farmers without code
SXR

44 articles

Writer list »

  1. Zhu Rangfei 118 Articles
  2. grapefruit 91 Articles
  3. Liu Yongxin 64 Articles
  4. admin 57 Articles
  5. Shengxin Analysis Flow 55 Articles
  6. SXR 44 Articles
  7. Helen Zhang 31 Articles
  8. Shuanger 25 Articles