DiBiG
ICBR Bioinformatics	Powered by Actor, v1.0

RNAseq - Alignment and differential expression analysis

Title: GE7334
Project: (none)
Started on: 4/1/2024 11:48:24
Hostname: login7.ufhpc
Run directory: /blue/licht/runs/Evans-MDS/GE7334/GE7334
Configuration GE7334.conf

Table of contents:

Input data
Trimming and quality control
Alignment to transcriptome
Genome coverage
Expression analysis - quantification
Differential expression - protein-coding genes
Differential expression - all genes
Differential expression - isoform level
Differential expression - combined files
MultiQC report
UCSC hub
Methods summary

1. Input data
The following table summarizes the samples, conditions, and contrasts in this analysis. A readset is either a single fastq file or a pair of fastq files (for paired-end sequencing).

Category	Data
*Summary of input data*
Reference genome:	mm10
Experimental conditions:	Parental, KMT2C_KO, DNMT3A_KO, KMT2C_DNMT3A_KO
Contrasts:	KMT2C_KO vs. Parental, DNMT3A_KO vs. Parental, KMT2C_DNMT3A_KO vs. Parental, KMT2C_DNMT3A_KO vs. KMT2C_KO, KMT2C_DNMT3A_KO vs. DNMT3A_KO
Number of samples	12
*Sequencing data data*
Total number of reads:	518,208,135
Average reads per sample:	43,184,011

Table 1. Summary of input data

Condition	Sample	Number of reads	% Reads
Parental	P2	38,386,946	7.41%
	P3	38,850,672	7.50%
	P4	41,102,912	7.93%
KMT2C_KO	K3	46,438,409	8.96%
	K4	48,825,766	9.42%
	K5	49,734,763	9.60%
DNMT3A_KO	S47	43,393,125	8.37%
	S53	47,441,282	9.15%
	S65	39,344,300	7.59%
KMT2C_DNMT3A_KO	S1	41,391,509	7.99%
	S11	41,224,494	7.96%
	S66	42,073,957	8.12%

Table 2. Number of reads in each sample.

2. Trimming and quality control
The input sequences were trimmed using fastp (version 0.23.4). The following table provides links to the quality control reports after trimming, as well as the number of reads in the trimmed files.

Sample	Readset	Reads before trim	Reads after trim	QC after trim	% Retained
P2	P2_r1	38,386,946	32,414,933	P2_r1	84.44%
P3	P3_r1	38,850,672	32,713,174	P3_r1	84.20%
P4	P4_r1	41,102,912	34,661,387	P4_r1	84.33%
K3	K3_r1	46,438,409	39,375,777	K3_r1	84.79%
K4	K4_r1	48,825,766	41,304,349	K4_r1	84.60%
K5	K5_r1	49,734,763	42,098,734	K5_r1	84.65%
S47	S47_r1	43,393,125	36,750,558	S47_r1	84.69%
S53	S53_r1	47,441,282	40,275,450	S53_r1	84.90%
S65	S65_r1	39,344,300	33,494,534	S65_r1	85.13%
S1	S1_r1	41,391,509	34,922,393	S1_r1	84.37%
S11	S11_r1	41,224,494	34,734,268	S11_r1	84.26%
S66	S66_r1	42,073,957	35,933,262	S66_r1	85.40%

Table 3. Number of reads in input files and links to QC reports.

3. Alignment to transcriptome
The input sequences were aligned to the mm10 transcriptome using 2.7.11b. The following table reports the number of alignments to the genome and the transcriptome for each sample. Please note that the number of alignments will in general be higher than the number of reads because the same read may align to multiple isoforms of the same gene. The WIG files can be uploaded to the UCSC Genome Browser as custom tracks.

Sample	Input reads	Genome alignments	Genome alignment rate	Transcriptome alignments	Transcriptome alignment rate	Alignment report
P2	32,414,933	63,581,583	1.96	24,063,737	74.24%	P2.star/Log.final.out
P3	32,713,174	65,479,078	2.00	23,821,285	72.82%	P3.star/Log.final.out
P4	34,661,387	69,171,966	2.00	25,420,511	73.34%	P4.star/Log.final.out
K3	39,375,777	80,177,808	2.04	29,563,811	75.08%	K3.star/Log.final.out
K4	41,304,349	82,986,468	2.01	31,384,829	75.98%	K4.star/Log.final.out
K5	42,098,734	83,489,482	1.98	32,009,373	76.03%	K5.star/Log.final.out
S47	36,750,558	70,114,685	1.91	27,230,995	74.10%	S47.star/Log.final.out
S53	40,275,450	79,433,102	1.97	30,396,055	75.47%	S53.star/Log.final.out
S65	33,494,534	63,695,888	1.90	24,024,522	71.73%	S65.star/Log.final.out
S1	34,922,393	69,677,752	2.00	25,231,521	72.25%	S1.star/Log.final.out
S11	34,734,268	68,503,141	1.97	25,759,005	74.16%	S11.star/Log.final.out
S66	35,933,262	72,960,547	2.03	26,917,408	74.91%	S66.star/Log.final.out

Table 4. Number of alignments to genome and transcriptome.

4. Genome coverage
The following table reports the overall and effective genome coverage in each sample. The Total nt column reports the total number of nucleotides sequenced, i.e. the number of aligned reads times the length of each read. Coverage is this number divided by the size of the genome. Effective bp reports the number of bases in the genome having coverage greater than 5, and the Effective Perc column shows what percentage this is of the genome size. Note that, especially in the case of RNA-seq, the effective genome size may be much smaller than the full size. Eff Coverage is the average coverage over the effectively covered fraction of the genome.

Name	Total nt	Coverage	Effective bp	Effective Perc	Eff Coverage
P2	42,068,204,700	15.46	415,993,891	15.30%	101.13
P3	45,685,614,712	16.79	430,722,891	15.80%	106.07
P4	50,386,959,604	18.52	432,109,209	15.90%	116.61
K3	58,706,137,150	21.57	452,606,301	16.60%	129.71
K4	57,789,012,696	21.23	443,156,488	16.30%	130.40
K5	59,880,140,992	22.00	447,318,413	16.40%	133.86
S47	47,466,613,290	17.45	422,294,235	15.50%	112.40
S53	56,097,156,087	20.61	434,528,658	16.00%	129.10
S65	38,704,866,283	14.23	414,110,886	15.20%	93.46
S1	39,831,336,928	14.64	411,031,463	15.10%	96.91
S11	50,613,812,686	18.59	435,039,650	16.00%	116.34
S66	50,003,002,095	18.38	436,380,353	16.00%	114.59

Table 5. Genome coverage by sample.

The following table reports the overall and effective genome coverage in each condition.

Name	Total nt	Coverage	Effective bp	Effective Perc	Eff Coverage
Parental	133,977,263,444	49.21	494,669,152	18.20%	270.84
KMT2C_KO	0	0.00	0	0.00%	0.00
DNMT3A_KO	138,164,129,210	50.75	493,208,575	18.10%	280.13
KMT2C_DNMT3A_KO	136,135,946,426	49.97	507,187,828	18.60%	268.41

Table 6. Genome coverage by condition

File: GE7334.sample.cov.xlsx
Size: 43.31 kB
Description: Per-chromosome coverage data, by sample.

File: GE7334.cond.cov.xlsx
Size: 16.07 kB
Description: Per-chromosome coverage data, by condition.

5. Expression analysis - quantification
Gene and transcript expression values were quantified using RSEM v1.3.1. The following files contain the raw FPKM values for all genes/transcripts in all samples. NOTE: these values are not normalized yet, please apply the appropriate normalization before using them in analysis.

File: genes.rawmatrix.csv
Size: 3.93 MB
Description: Matrix of FPKM values for all genes in all samples.

File: transcripts.rawmatrix.csv
Size: 10.38 MB
Description: Matrix of FPKM values for all transcripts in all samples.

File: genes.xpra.txt
Size: 3.17 MB
Description: Counts table suitable for ExpressAnalyst.

The following scatterplots show the level of similarity between replicates of the same condition.

Principal Component Analysis on raw (un-normalized) expression data. Click on the thumbnail to display the full-size image.

(png format, 83.93 kB)

The following image displays the Multi-Dimensional Scaling (MDS) plot for the raw (un-normalized) expression data. Click on the thumbnail to display the full-size image.

6. Differential expression - protein-coding genes
Differential gene expression was analyzed using DESeq2. The following table reports the number of differentially expressed genes in each contrast with abs(log2(FC)) >= 1.0 and FDR-corrected P-value <= 0.05. The files under the Table heading contain the log2(FC) and P-value of all significant genes, while the files under the Expressions heading contain normalized expression values for the significant genes in all replicates of the two conditions being compared. The lists of differentially expressed genes for all contrasts can also be downloaded as a single Excel file using the link below.

Test	Control	Total	Overexpressed	Underexpressed	Table	Expressions
KMT2C_KO	Parental	14	1	13	KMT2C_KO.vs.Parental.codinggeneDiff.csv	KMT2C_KO.vs.Parental.gmatrix.csv
DNMT3A_KO	Parental	3	0	3	DNMT3A_KO.vs.Parental.codinggeneDiff.csv	DNMT3A_KO.vs.Parental.gmatrix.csv
KMT2C_DNMT3A_KO	Parental	1,030	301	729	KMT2C_DNMT3A_KO.vs.Parental.codinggeneDiff.csv	KMT2C_DNMT3A_KO.vs.Parental.gmatrix.csv
KMT2C_DNMT3A_KO	KMT2C_KO	987	326	661	KMT2C_DNMT3A_KO.vs.KMT2C_KO.codinggeneDiff.csv	KMT2C_DNMT3A_KO.vs.KMT2C_KO.gmatrix.csv
KMT2C_DNMT3A_KO	DNMT3A_KO	964	308	656	KMT2C_DNMT3A_KO.vs.DNMT3A_KO.codinggeneDiff.csv	KMT2C_DNMT3A_KO.vs.DNMT3A_KO.gmatrix.csv

Table 7. Results of gene-level differential expression analysis.

File: GE7334-codingdiff.xlsx
Size: 222.64 kB
Description: Excel file containing differentially expressed genes for all contrasts (one sheet per contrast). Only includes protein-coding genes.

File: GE7334-allcodingdiff.xlsx
Size: 3.65 MB
Description: Excel file containing differential expression values for all tested genes in all contrasts (one sheet per contrast). Only includes protein-coding genes. Note: genes with very low average expression in all conditions were removed.

File: GE7334.g.deseq2norm.xlsx
Size: 2.18 MB
Description: Excel file containing normalized (DESeq2) expression values for all protein-coding genes in all conditions. Note: genes with very low average expression in all conditions were removed.

Principal Component Analysis on normalized expression data. Click on the thumbnail to display the full-size image.

(png format, 85.59 kB)

The following image displays the Multi-Dimensional Scaling (MDS) plot for the normalized expression data. In this plot, relative distances between samples reflect the similarity of their gene expression profiles. Ideally, replicates of the same condition should be close together, and well separated from other conditions.

Volcano plots for all contrasts. Use the menu to select a contrast.

7. Differential expression - all genes
The following table reports results from the same differential analysis as above, but includes all biotypes instead of coding genes only.

Test	Control	Total	Overexpressed	Underexpressed	Table	Expressions
KMT2C_KO	Parental	14	1	13	KMT2C_KO.vs.Parental.geneDiff.csv	KMT2C_KO.vs.Parental.gmatrix.csv
DNMT3A_KO	Parental	3	0	3	DNMT3A_KO.vs.Parental.geneDiff.csv	DNMT3A_KO.vs.Parental.gmatrix.csv
KMT2C_DNMT3A_KO	Parental	1,159	360	799	KMT2C_DNMT3A_KO.vs.Parental.geneDiff.csv	KMT2C_DNMT3A_KO.vs.Parental.gmatrix.csv
KMT2C_DNMT3A_KO	KMT2C_KO	1,102	375	727	KMT2C_DNMT3A_KO.vs.KMT2C_KO.geneDiff.csv	KMT2C_DNMT3A_KO.vs.KMT2C_KO.gmatrix.csv
KMT2C_DNMT3A_KO	DNMT3A_KO	1,057	360	697	KMT2C_DNMT3A_KO.vs.DNMT3A_KO.geneDiff.csv	KMT2C_DNMT3A_KO.vs.DNMT3A_KO.gmatrix.csv

Table 8. Results of gene-level differential expression analysis (all biotypes).

File: GE7334-genediff.xlsx
Size: 250.23 kB
Description: Excel file containing differentially expressed genes for all contrasts (one sheet per contrast). Includes all genes and pseudo-genes.

File: GE7334-allgenediff.xlsx
Size: 4.24 MB
Description: Excel file containing differential expression values for all genes in all contrasts (one sheet per contrast). Includes all genes and pseudo-genes.

File: GE7334-allExpressions.xlsx
Size: 2.54 MB
Description: Excel file containing normalized (RSEM) expression values for all genes in all conditions.

8. Differential expression - isoform level
The following table reports the number of differentially expressed isoforms in each contrast with abs(log2(FC)) >= 1.0 and FDR-corrected P-value <= 0.05. The lists of differentially expressed isoforms for all contrasts can also be downloaded as a single Excel file using the link below.

Test	Control	Tot isoforms	Overexpressed	Underexpressed	Table	Expressions
KMT2C_KO	Parental	67	33	34	KMT2C_KO.vs.Parental.isoDiff.csv	KMT2C_KO.vs.Parental.imatrix.csv
DNMT3A_KO	Parental	79	33	46	DNMT3A_KO.vs.Parental.isoDiff.csv	DNMT3A_KO.vs.Parental.imatrix.csv
KMT2C_DNMT3A_KO	Parental	1,786	609	1,177	KMT2C_DNMT3A_KO.vs.Parental.isoDiff.csv	KMT2C_DNMT3A_KO.vs.Parental.imatrix.csv
KMT2C_DNMT3A_KO	KMT2C_KO	1,681	602	1,079	KMT2C_DNMT3A_KO.vs.KMT2C_KO.isoDiff.csv	KMT2C_DNMT3A_KO.vs.KMT2C_KO.imatrix.csv
KMT2C_DNMT3A_KO	DNMT3A_KO	1,607	563	1,044	KMT2C_DNMT3A_KO.vs.DNMT3A_KO.isoDiff.csv	KMT2C_DNMT3A_KO.vs.DNMT3A_KO.imatrix.csv

Table 9. Results of isoform-level differential expression analysis.

File: GE7334-isodiff.xlsx
Size: 421.92 kB
Description: Excel file containing differentially expressed isoforms for all contrasts (one sheet per contrast).

File: GE7334-allisodiff.xlsx
Size: 10.32 MB
Description: Excel file containing differential expression values for all isoforms in all contrasts (one sheet per contrast).

9. Differential expression - combined files
The following file contains merged differential expression data. The first sheet contains fold changes for all genes that were found to be differentially expressed in at least one contrast. The second and third sheets contain the same information for coding genes only, and all transcripts.

File: GE7334-merged.allDiff.xlsx
Size: 460.87 kB
Description: Merged fold changes for all differentially expressed genes, coding genes, and transcripts respectively.

10. MultiQC report
MultiQC is a general Quality Control tool for a large number of bioinformatics pipelines. The report on this analysis (generated using MultiQC version 1.12) is available here:

MultiQC report

11. UCSC hub

UCSC Genome Browser: use the previous link to display the data tracks automatically, or copy the the URL https://bw:bw@data.rc.ufl.edu/secure/icbr/GE7334//GE7334/hub/hub.txt and paste it into the "My Hubs" form in this page.

WashU EpiGenome Browser: use the previous link to display the data tracks automatically, or copy the following URL into the "Datahub by URL Link" field: https://bw:bw@data.rc.ufl.edu/secure/icbr/GE7334//GE7334/hub/hub.json.

12. Methods summary

Trimming and QC on short reads were performed by fastp (v 0.23.4) [1].

The reads were aligned to the transcriptome using STAR version 2.7.11b [2].

Transcript abundance was quantified using RSEM (RSEM v1.3.1) [3].

Differential expression analysis was performed using DESeq2 [4], with an FDR-corrected P-value threshold of 0.05. The output files were further filtered to extract transcripts showing a 2.0-fold change in either direction. Results were reported for protein-coding genes only, and for all transcript types.

References

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu; fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 1 September 2018, Pages i884–i890 | doi: 10.1093/bioinformatics/bty560
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29(1):15-21 | doi: 10.1093/bioinformatics/bts635
Li B and Dewey CN (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323 | doi: 10.1186/1471-2105-12-323
Love MI, huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15,550 (2014). | doi: 10.1186/s13059-014-0550-8

Completed: 4-1-2024@11:50