This project aims to discover biomarkers and predict colorectal cancer (CRC) through studying the gene expressions of 59 CRC patients and 62 healthy controls (HCs) from Gene Expression Ominibus (GEO) database with id GSE164191. Two data objects are essential: a gene expression matrix (54,675 \(\times\) 121) and a phenotype table (121 \(\times\) 4). Duplicated Probe IDs are removed with the first gene left and duplicated Entrez IDs are removed with the lowest adjusted p-value gene left. To achieve the goal, 4 types of analysis have been conducted as followings:
Differential Expression Analysis (DEGs): I
identify the significant differently expressed genes between CRC and HC
groups with age and gender co-variants using a 0.05 p-value threshold.
As the MAS5 normalization has been performed by the paper
(“Immune-related gene expression signatures in colorectal cancer”), the
preprocessing only includes ID converstion and filtering the low
expression counts. A DGEList object is created by expression data, then
a linear model is fitted using Limma
package.
Gene Ontology Enrichment Analysis (GOEA): I
identify the top enriched gene ontology terms and plot the GO hierarchy.
A topGO
object is created to perform GO enrichment
test.
Gene Set Enrichment Analysis (GSEA): I identify
the significantly enriched pathways against KEGG pathways with a 0.05
p-value threshold. The function gseKEGG
from
clusterProfiler
is used to perform GSEA. The
pathview
library is used to plot two enriched pathways:
hsa03010 and hsa03040.
CRC Prediction: I build and tune the models to
predict CRC patients and HCs based on the expressions of top 100 DEGs
using caret
library using cross validation method.
Prediction models include linear discriminant analysis (LDA),
generalized linear models (GLM), support vector machines (SVM), neural
networks (NN), decision trees (DT), and gradient boost machines
(GBM).
Through the above analysis, significant differences in gene expression are found between CRC patients and HCs, which could be effective biomarkers. Prediction models showed a high accuracy in this tiny datasets. Their real performance requires to be validated in larger datasets.
Data preprocessing is performed before conducting analysis. Key steps
include: converting Probe IDs of Affymetrix HG-U133_Plus_2 platform into
Entrez IDs, filtering out NA and low expression data. As the MAS5
normalization has been performed by the paper for the data, counts are
then directly fitted with a linear model using Limma
. Gene
expression dispersion, p-value histogram and MA plot are made. The
objective is to identify the significant genes that differently
expressed in CRC patients and controls.
library(magrittr)
library(dplyr)
library(data.table)
library(ggplot2)
library(Biobase)
library(org.Hs.eg.db)
library(topGO)
library(clusterProfiler)
library(enrichplot)
library(limma)
library(edgeR)
require(DOSE)
library(pathview)
library(GEOquery)
library(Biobase)
Data Preparation:
The txt file contains gene expression profiling of 59 colorectal cancer (CRC) patients and 62 healthy controls (HCs).
gse = getGEO("GSE164191")
# length(gse)
meta = pData(phenoData(gse[[1]]))
names(meta)
## [1] "title" "geo_accession"
## [3] "status" "submission_date"
## [5] "last_update_date" "type"
## [7] "channel_count" "source_name_ch1"
## [9] "organism_ch1" "characteristics_ch1"
## [11] "characteristics_ch1.1" "characteristics_ch1.2"
## [13] "characteristics_ch1.3" "treatment_protocol_ch1"
## [15] "growth_protocol_ch1" "molecule_ch1"
## [17] "extract_protocol_ch1" "label_ch1"
## [19] "label_protocol_ch1" "taxid_ch1"
## [21] "hyb_protocol" "scan_protocol"
## [23] "description" "data_processing"
## [25] "platform_id" "contact_name"
## [27] "contact_email" "contact_institute"
## [29] "contact_address" "contact_city"
## [31] "contact_zip/postal_code" "contact_country"
## [33] "supplementary_file" "supplementary_file.1"
## [35] "data_row_count" "age:ch1"
## [37] "disease status:ch1" "gender:ch1"
## [39] "tissue:ch1"
#### reading count matrix
library(data.table)
ori_data = fread("GSE164191_series_matrix.txt", fill = TRUE)
# ori_data %>% head
counts = ori_data[64:54738,]
samples = ori_data[63, -1]
countsNames <- counts$V1
rownames(counts) <- countsNames
counts[,1] = NULL
colnames(counts) <- as.character(samples)
# counts %>% head
#### reading phenotypic data
pheno = subset(meta, select = c("title", "age:ch1", "disease status:ch1", "gender:ch1"))
colnames(pheno) <- c("title", "age", "status", "gender")
# pheno %>% head
#### Checking and mapping counts and phenotypic data
phenoNames = rownames(pheno)
missingNames <- setdiff(colnames(counts), phenoNames) # elements of x not in y
# missingNames
counts1 <- counts[, !c(..missingNames)]
rownames(counts1) <- countsNames
pheno1 <- pheno
# all(colnames(counts1) == rownames(pheno1))
The original identifiers provided in the GEO164191 dataset are Probe IDs of Affymetrix HG-U133_Plus_2 platform. Prob IDs are converted to ENTREZ IDs. The duplicated Probe IDs are removed first with the first gene left:
library(hgu133plus2.db)
# find the mapping table between Probe IDs and Entrez IDs
id_mapping = AnnotationDbi::select(hgu133plus2.db, countsNames, "ENTREZID")
# filer out the na rows
id_mapping = na.omit(id_mapping)
# remove the duplicated probe IDs
id_mapping = id_mapping[!duplicated(id_mapping$PROBEID),]
dim(id_mapping)
## [1] 44662 2
44662 genes are left with unique Probe IDs.
Convert all characters in counts to numbers as a matrix, add a column for PROBE IDs, then find the matching ENTREZ IDs by merging counts and id mapping tables:
counts_matrix <- matrix(as.numeric(unlist(counts1)), ncol = length(counts1), byrow = FALSE)
PROBEID <- countsNames
counts_matrix <- cbind(counts_matrix, PROBEID)
counts_matrix <- merge(counts_matrix, id_mapping, by = "PROBEID")
dim(counts_matrix)
## [1] 44662 123
Separate Probe IDs, Entrez IDs, and count matrix to different variables:
probe_ids = counts_matrix[,1]
entrez_ids = counts_matrix[,dim(counts_matrix)[2]]
counts_matrix = counts_matrix[,-c(1,dim(counts_matrix)[2])]
counts_matrix <- matrix(as.numeric(unlist(counts_matrix)), ncol = length(counts1), byrow = FALSE)
Filter out genes with very low counts across all libraries. The threshold is set as 10 here.
keep <- rowSums(counts_matrix) >= 10
counts_matrix <- counts_matrix[keep,]
probe_ids <- probe_ids[keep]
entrez_ids <- entrez_ids[keep]
dim(counts_matrix)
## [1] 44662 121
No gene is found below the threshold.
Create DGEList object with group labels (normal or cancer):
group = pheno1$status
dge = DGEList(counts_matrix,group = group)
Normalization is not needed as the data is already normalized.
d <- calcNormFactors(dge)
Create design matrix with co-variants (age and gender), group (i.e., cancer status) is kept as the main attribute, intercept is kept:
design <- model.matrix(~ age + gender + group, data=pheno1)
Plot dispersions of gene expressions against normalized counts using voom without normalization method:
v = voom(counts = d, design = design, plot = T, normalize.method = "none")
Fit a linear model:
fit <- lmFit(v, design)
fit <- eBayes(fit, trend = TRUE)
Plot the p-value histogram:
res = data.frame(fit)
thehist = hist(res$p.value.groupnormal, breaks = 50, plot=FALSE)
plot(thehist)
MA plot (log-fold-changes vs normalized counts):
# Mean-difference plot
limma::plotMA(fit)
Sort genes by adjusted p-values in ascending order. Filter out the duplicated Entrez IDs by keeping the gene with the lowest p-value. Dimension after filtering is 21367 .
top <- topTable(fit, coef="groupnormal", n=Inf, adjust="BH")
top <- top[order(top$adj.P.Val), ]
# for rows with duplicated entrez IDs, choose the first row with the highest adjusted p value
top$ENTREZID = entrez_ids[as.numeric(rownames(top))]
top <- top[!duplicated(top$ENTREZID),]
dim(top)
## [1] 21367 7
List the top 25 significant DEGs as followings:
top_25 <- head(top,25)
top_100 <- head(top,100)
top_25
## logFC AveExpr t P.Value adj.P.Val B
## 16233 -0.7215613 3.133655 -7.864248 9.397586e-12 4.197150e-07 16.307344
## 11216 -1.1283695 5.670882 -6.972724 5.763052e-10 1.286947e-05 12.484455
## 7814 0.6771076 5.668244 6.874449 9.012231e-10 1.341681e-05 12.055612
## 20520 -0.8947702 5.863532 -6.483536 5.245020e-09 5.719367e-05 10.373200
## 31088 0.9440105 5.560535 6.425712 6.788493e-09 5.719367e-05 10.127117
## 3050 0.6838653 4.037530 6.363209 8.964123e-09 5.719367e-05 9.857513
## 33070 -0.5475620 2.553364 -6.368487 8.756426e-09 5.719367e-05 9.811460
## 29861 0.5688781 3.527407 6.277456 1.310795e-08 6.017266e-05 9.487427
## 38071 -0.5615244 6.258030 -6.272107 1.342161e-08 6.017266e-05 9.474808
## 15405 1.1959499 5.674375 6.263510 1.394127e-08 6.017266e-05 9.440533
## 42910 0.6874490 3.876167 6.232676 1.597408e-08 6.017266e-05 9.308093
## 31397 0.7300965 4.488013 6.229948 1.616748e-08 6.017266e-05 9.300147
## 6971 -0.8858330 1.075623 -6.151671 2.281616e-08 7.838579e-05 8.550084
## 27792 0.5435575 4.906145 6.128744 2.523109e-08 8.049078e-05 8.876473
## 41499 0.9717034 2.546625 6.096458 2.906481e-08 8.384360e-05 8.637724
## 15465 0.8040916 5.250491 6.062001 3.379125e-08 8.384360e-05 8.597869
## 25965 1.2712551 2.465411 6.071861 3.236634e-08 8.384360e-05 8.442564
## 35173 0.7964616 1.476944 6.081404 3.104368e-08 8.384360e-05 8.231666
## 20452 0.6205344 4.432372 6.015568 4.137835e-08 8.519548e-05 8.407555
## 18476 0.7273415 4.821288 6.009019 4.257555e-08 8.519548e-05 8.379265
## 9092 0.6632160 4.581803 5.999442 4.438815e-08 8.519548e-05 8.340334
## 20074 0.8013458 5.669264 5.985363 4.719171e-08 8.519548e-05 8.277918
## 27479 0.5915544 4.540757 5.975709 4.921425e-08 8.519548e-05 8.242097
## 19363 0.6312993 4.568853 5.973063 4.978343e-08 8.519548e-05 8.231433
## 28106 0.6279386 5.925042 5.965240 5.150414e-08 8.519548e-05 8.193485
## ENTREZID
## 16233 1174
## 11216 5534
## 7814 9774
## 20520 2280
## 31088 29068
## 3050 55692
## 33070 400236
## 29861 1783
## 38071 54815
## 15405 23015
## 42910 51747
## 31397 54800
## 6971 286135
## 27792 58508
## 41499 7562
## 15465 546
## 25965 55671
## 35173 9321
## 20452 8930
## 18476 58517
## 9092 4820
## 20074 6432
## 27479 4666
## 19363 5978
## 28106 79101
Top 25 genes’ ENTREZ IDs:
top25_entrez_ids = entrez_ids[as.numeric(rownames(top_25))]
top25_entrez_ids
## [1] "1174" "5534" "9774" "2280" "29068" "55692" "400236" "1783"
## [9] "54815" "23015" "51747" "54800" "286135" "58508" "7562" "546"
## [17] "55671" "9321" "8930" "58517" "4820" "6432" "4666" "5978"
## [25] "79101"
Top 100 genes’ ENTREZ IDs:
top100_entrez_ids = entrez_ids[as.numeric(rownames(top_100))]
top100_entrez_ids
## [1] "1174" "5534" "9774" "2280" "29068" "55692" "400236" "1783"
## [9] "54815" "23015" "51747" "54800" "286135" "58508" "7562" "546"
## [17] "55671" "9321" "8930" "58517" "4820" "6432" "4666" "5978"
## [25] "79101" "54799" "57513" "130872" "23137" "3181" "283337" "83941"
## [33] "3709" "26039" "3835" "6526" "80169" "351" "26508" "55339"
## [41] "7456" "4603" "124411" "10142" "8899" "84124" "7919" "55904"
## [49] "9698" "90624" "27327" "9252" "285527" "57223" "25831" "54906"
## [57] "196441" "28951" "55102" "157680" "9101" "3005" "9736" "730092"
## [65] "51107" "10209" "79646" "56339" "10181" "1654" "22889" "11198"
## [73] "8473" "163" "1727" "257415" "2186" "139322" "10147" "54737"
## [81] "7072" "1763" "200576" "6651" "5775" "256471" "6134" "201627"
## [89] "5478" "6642" "667" "85028" "55670" "54780" "284273" "220988"
## [97] "23215" "6774" "714" "9987"
Set a adjusted p-value threshold as 0.05, make a subset of genes with adjusted p-value smaller than the threshold.
subset_genes = top[top$adj.P.Val<0.05,]
head(subset_genes)
## logFC AveExpr t P.Value adj.P.Val B
## 16233 -0.7215613 3.133655 -7.864248 9.397586e-12 4.197150e-07 16.307344
## 11216 -1.1283695 5.670882 -6.972724 5.763052e-10 1.286947e-05 12.484455
## 7814 0.6771076 5.668244 6.874449 9.012231e-10 1.341681e-05 12.055612
## 20520 -0.8947702 5.863532 -6.483536 5.245020e-09 5.719367e-05 10.373200
## 31088 0.9440105 5.560535 6.425712 6.788493e-09 5.719367e-05 10.127117
## 3050 0.6838653 4.037530 6.363209 8.964123e-09 5.719367e-05 9.857513
## ENTREZID
## 16233 1174
## 11216 5534
## 7814 9774
## 20520 2280
## 31088 29068
## 3050 55692
deg_results = top
# Take negative log10 of adjusted p-values
deg_results$neg_log10_pval <- -log10(deg_results$adj.P.Val)
# Generate volcano plot
plot(deg_results$logFC, deg_results$neg_log10_pval,
pch=20, main="Volcano Plot", xlab="log2 Fold Change", ylab="-log10 p-value")
# Highlight up-regulated genes in red and down-regulated genes in blue
up_genes <- deg_results$logFC > 0 & deg_results$adj.P.Val < 0.05
down_genes <- deg_results$logFC < 0 & deg_results$adj.P.Val < 0.05
points(deg_results$logFC[up_genes], deg_results$neg_log10_pval[up_genes],
pch=20, col="red")
points(deg_results$logFC[down_genes], deg_results$neg_log10_pval[down_genes],
pch=20, col="blue")
GOEA aims to find the most enriched gene ontology. The human gene
universe is loaded by using OrganismDbi
. GO terms are found
by the list of human genes, which are in the significant DEGs from the
previous analysis. The fisher test is run with the GOdata to identify
the top GO terms based on their scores.
Load human gene data from database “org.Hs.eg.db”:
library(org.Hs.eg.db)
library(OrganismDbi)
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
Number of genes left in the significant DEGs subset.
length(subset_genes[[1]])
## [1] 6816
Load human gene universe then choose genes from that:
library(topGO)
humanGeneUniverse <- as.character(unique(OrganismDbi::select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), column="ENTREZID")$ENTREZID))
head(humanGeneUniverse)
## [1] "1" "2" "3" "9" "10" "11"
myGeneList = entrez_ids[as.numeric(rownames(subset_genes))]
head(myGeneList)
## [1] "1174" "5534" "9774" "2280" "29068" "55692"
geneList <- factor(as.integer(humanGeneUniverse %in% myGeneList))
names(geneList) <- humanGeneUniverse
library(magrittr)
geneList %>% head
## 1 2 3 9 10 11
## 0 0 0 1 0 0
## Levels: 0 1
Create a topGOdata class object required by topGO package in order to perform GO enrichment test:
GOdata <- new("topGOdata",
ontology="BP",
allGenes=geneList,
nodeSize=5, # the minimum number of genes for each GO term.
annotationFun=annFUN.org,
mapping="org.Hs.eg.db",
ID="entrez")
Find all the GO terms mapped by the geneList:
GOresult <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
##
## -- Classic Algorithm --
##
## the algorithm is scoring 9280 nontrivial nodes
## parameters:
## test statistic: fisher
resultTable <- GenTable(GOdata, GOresult, topNodes = 50, numChar=200)
resultTable[1:10,]
## GO.ID
## 1 GO:0031323
## 2 GO:0051171
## 3 GO:0080090
## 4 GO:0019219
## 5 GO:0046907
## 6 GO:0045935
## 7 GO:0044260
## 8 GO:0044237
## 9 GO:0071840
## 10 GO:0051173
## Term
## 1 regulation of cellular metabolic process
## 2 regulation of nitrogen compound metabolic process
## 3 regulation of primary metabolic process
## 4 regulation of nucleobase-containing compound metabolic process
## 5 intracellular transport
## 6 positive regulation of nucleobase-containing compound metabolic process
## 7 cellular macromolecule metabolic process
## 8 cellular metabolic process
## 9 cellular component organization or biogenesis
## 10 positive regulation of nitrogen compound metabolic process
## Annotated Significant Expected result1
## 1 5737 2169 1744.80 < 1e-30
## 2 5792 2161 1761.53 < 1e-30
## 3 5954 2213 1810.80 < 1e-30
## 4 4076 1598 1239.64 < 1e-30
## 5 1587 719 482.66 < 1e-30
## 6 2052 888 624.08 < 1e-30
## 7 3266 1307 993.29 < 1e-30
## 8 10271 3525 3123.74 < 1e-30
## 9 6601 2389 2007.57 < 1e-30
## 10 3175 1268 965.62 < 1e-30
Plot GO hierarchy formed by top 50 GO terms:
showSigOfNodes(GOdata, score(GOresult), firstSigNodes = 50, useInfo = 'all')
## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 79
## Number of Edges = 171
##
## $complete.dag
## [1] "A graph with 79 nodes."
GSEA identifies significantly enriched pathways in CRC patients by
comparing with normal KEGG pathways (i.e., hsa
human
organism). I used gseKEGG
function to perform KEGG pathway
enrichment analysis.
Preparation of the inputs received from DE analysis:
df = topTable(fit, coef="groupnormal", n=Inf, adjust="BH")
kegg_gene_list <- df$logFC
names(kegg_gene_list) <- entrez_ids[as.numeric(rownames(df))]
# omit any NA values
kegg_gene_list<-na.omit(kegg_gene_list)
# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)
Create gseKEGG object:
library(R.utils)
library(clusterProfiler)
R.utils::setOption("clusterProfiler.download.method",'wininet')
kegg_organism = "hsa"
kk2 <- gseKEGG(geneList = kegg_gene_list,
organism = kegg_organism,
nPerm = 10000,
minGSSize = 3,
maxGSSize = 800,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
keyType = "ncbi-geneid")
Check top 5 pathways:
head(kk2, 5)
## ID Description setSize enrichmentScore
## hsa03010 hsa03010 Ribosome 129 0.6300956
## hsa03040 hsa03040 Spliceosome 128 0.6200938
## hsa03013 hsa03013 Nucleocytoplasmic transport 104 0.6330755
## hsa03008 hsa03008 Ribosome biogenesis in eukaryotes 76 0.6365703
## hsa04120 hsa04120 Ubiquitin mediated proteolysis 141 0.5519044
## NES pvalue p.adjust qvalue rank
## hsa03010 3.157399 0.0003918495 0.001913969 0.0008241973 12248
## hsa03040 3.099624 0.0003954132 0.001913969 0.0008241973 10413
## hsa03013 3.058391 0.0003658983 0.001913969 0.0008241973 14097
## hsa03008 2.912703 0.0003252033 0.001913969 0.0008241973 11821
## hsa04120 2.806603 0.0004105090 0.001913969 0.0008241973 12538
## leading_edge
## hsa03010 tags=56%, list=27%, signal=41%
## hsa03040 tags=49%, list=23%, signal=38%
## hsa03013 tags=57%, list=32%, signal=39%
## hsa03008 tags=52%, list=26%, signal=39%
## hsa04120 tags=42%, list=28%, signal=30%
## core_enrichment
## hsa03010 6201/6164/51187/51187/6139/6139/6201/6139/6168/6189/9349/6205/6229/6133/6160/6134/6125/6223/9045/6229/65008/200916/6169/65003/51318/11224/6189/55052/6173/6218/6218/6157/51065/6189/64983/6233/6218/51318/51318/6227/6165/6165/6129/6129/9801/6227/51121/6135/6125/51023/6122/6222/28998/9045/6168/6167/6128/6125/54460/51023/55173/6155/11222/23521/6167/6160/6160/25873/6167/9801/6202/6138/63931/6124/6146/64981/64968/23521/6138/6175/6217/6138/64960/6210/6170/64965/6187/6152/6160/6183/51263/6194/6235/6124/6232/6129/6124/6125/6224/4736/6122/6204/6137/6223/6210/64969/6138/3921/6208/64960/6160/6165/51065/9045/6232/9045/6204/6234/29074/6204/6204/6217/6169/51263/6137/6165/6159/64960/6194/51263/29093/6152/6193/6122/6146/6228/51263/6130/6146/23521/6194/6132/6183/6146/6141/6203/6122/51081/6223/6230/51264/6130/51263/6233/6175/6181/6146
## hsa03040 9939/220988/9984/23350/6428/4809/6432/51362/6427/57461/7919/10772/3192/6426/10772/6432/27316/23658/58517/10772/23020/11325/3192/10915/6428/4686/8559/57187/5356/10772/84950/58517/55660/6427/6635/6430/6426/6430/9879/51645/55119/29896/51691/1655/23451/23350/1665/6427/27258/6430/58517/22916/57187/6434/58517/3183/51691/10929/2521/6428/6434/3178/51691/23451/23451/55110/6426/6426/10907/4116/6632/4670/6434/1665/220988/23451/10915/6432/55660/6635/6635/6426/22916/10569/3306/9939/3178/6434/6428/29896/6627/6636/10772/3178/55660/6637/29896/55660/3192/23350/6429/3183/10569/3183/1655/23020/2521/84991/84950/58517/23350/220988/6627/10929/3190/3183/11338/9879/84321/6431/51729/10285/4686/988/4670/10286/57187/8559/84950/84991/51690/6429/58517/3192/6633/51340/3183/3312/2521/6100/3178/6625/3192/10929/51639/10915/27316/8449/10084/11325/23658/55660/58517/23451/22938/10713/6627/5356/3312/9129/55660
## hsa03013 9939/3841/3841/51068/9984/5411/7175/65110/7919/56001/51068/3841/7175/5411/4686/84305/57187/55916/5901/5411/81929/3842/65110/5411/5903/23279/9818/22916/7175/57187/25909/65110/10527/51068/10762/5903/51808/5903/10921/10527/65109/55110/65110/348995/11260/56001/129401/4116/1434/55161/55916/55705/348995/7175/22916/9972/5903/57122/9939/7341/3842/3843/10762/59343/5901/51808/55746/3839/10527/9631/9818/10527/7341/55705/11097/3843/3839/53371/3843/55705/23165/84321/51194/26019/4686/3836/7329/57187/55308/25909/3836/11260/23279/10762/3838/10284/79023/5905/1915/7514/64328/10762/3838/81929/30000/387082/3836/79902/1434/57510/5903/8021/3843/10284/8480/1434/55706/55161/23225/3840/1434/10526/4928/55706/23636/10482/64901/7341/23636/23511/59343/10189/55161/1915/79023/7329/23039/4927/10189/3842/22985/55746/9939/10250/23279/6613/9688/9939/2733/23225/9939/7175/23511/57510/5905/23225/10762/23633/57510/10526
## hsa03008 1736/84135/51068/4931/4809/56001/51068/54552/55916/5901/55781/51119/102157402/25996/10556/54464/10556/29889/51068/55341/1736/84128/56001/134430/55341/55916/26354/55131/166378/54464/5901/10940/84135/51602/10556/10813/9724/134430/55272/10171/23195/29102/23195/55813/55127/51367/55127/51096/10199/25996/23560/7514/23160/55127/10528/51077/83732/1457/51077/81691/23560/54464/2091/84916/23560/51096/55781/28987/1736/27341/55127/4809/55127/54464/1459/4809/55341/10482/10885
## hsa04120 868/51366/330/26091/8450/55236/8925/8065/64750/8453/10277/4214/9063/55236/7337/55284/25898/55284/64750/11059/868/1161/92912/55958/6477/7328/7337/51366/51433/8451/26091/57448/29945/8924/6233/8450/29945/4193/8065/7321/6921/25898/25898/8454/8554/51366/57448/8925/119504/6502/55294/11059/55958/7337/26091/10054/7321/26272/6477/6477/9039/4591/4193/7324/83737/8925/7337/9690/26259/5071/330/10277/55236/51434/4193/7328/10393/331/9063/9616/51433/55236/9690/6502/64750/8450/51465/8697/7323/89910/8065/64682/9690/7328/9063/8452/119504/25898/7327/7334/55294/51465/55585/26091/7329/4193/8452/10055/63893/6921/868/26091/10075/7334/7334/11060/329/55585/10393/4214/10075/51529/8945/55284/8454/9021/9063/55120/119504/8881/9354/3093/83737/64326/996/55284/83737/29882/51433/8881/23221/4193/26232/7323/9063/10055/64326/9063/8450/7326/64682/26272/8451/7322/7327/55294/9320/9616/8697/6502/10277/8945/8451/6233/672/7322/23759/51434/7329/7319/10401/4193/22954/6500/7320/64682/64750/4591
Dotplot:
enrichplot::dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)
Encrichment plot map:
emapplot(pairwise_termsim(kk2))
GSEA Plot for the first pathway (Ribosome):
gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1)
In the enrichment analysis, we take the first two pathways: hsa03010 and hsa03040. Pathview is used to plot enriched KEGG pathway:
library(pathview)
# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="hsa03010", species = kegg_organism)
knitr::include_graphics("hsa03010.pathview.png")
# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="hsa03040", species = kegg_organism)
knitr::include_graphics("hsa03040.pathview.png")
The expressions of top 100 genes that are siginificantly DEG are gathered as a tiny dataset. This dataset is used to build classifiers for predicting CRC patients. Following models are built, tuned and tested. The cross validation and confusion matrices are used to evaluate the model performances.
Linear Discriminant Analysis (LDA): A
statistical method used to find a linear combination of features that
separates two or more classes. (Package
MASS
-lda()
)
Generalized Linear Models (GLM): A flexible
class of models that can be used to model a wide range of data types by
specifying a suitable probability distribution and a linear predictor.
(Package stats
-glm()
)
Support Vector Machines (SVM): A machine
learning algorithm used for classification and regression analysis that
creates a hyperplane or set of hyperplanes in a high-dimensional space
to separate data into classes or predict a continuous output. (Package
e1071
-svm()
)
Neural Networks (NN): A set of algorithms
modeled after the structure of the human brain, used for supervised and
unsupervised learning, pattern recognition, and data clustering.
(Package caret
-nnet()
)
Decision Trees (DT): A tree-like model used to
make decisions by recursively splitting data based on the value of
certain features until a decision can be made. (Package
rpart
-rpart
()`)
Gradient Boost Machines (GBM): A machine
learning technique that builds an ensemble of weak prediction models,
such as decision trees, and gradually improves their performance by
boosting their weights based on their individual errors. (Package
gbm
-nnet
gbm()`)
Create a tiny dataset consisting of only the expressions of top 100 genes, which is used for classifiers stated below. To use machine learning models, usually input features are normalized.
top_100_expr = counts_matrix[as.numeric(rownames(top_100)),]
# transpose the matrix
top_100_expr = t(top_100_expr)
# normalizing features
top_100_expr = scale(top_100_expr)
dim(top_100_expr)
## [1] 121 100
Convert the status column to factor, and matrix to data frame:
top_100_expr = as.data.frame(top_100_expr)
Add the ground truth classification label “status” in the top 100 gene expressions matrix:
top_100_expr$group <- factor(pheno1$status)
Linear Discriminant Analysis (LDA):
library(caret)
library(MASS)
levels(top_100_expr$group) <- make.names(levels(top_100_expr$group))
levels(top_100_expr$group)
## [1] "colorectal.cancer" "normal"
Set the training control environment with 10-fold cross-validation:
set.seed(1)
# Define the training control object for 10-fold cross-validation
train.control <- trainControl(
method = "repeatedcv",
number = 10, ## 10-fold CV
repeats = 3,## repeated three times
classProbs = TRUE
)
# Fit the LDA model using the train() function
lda_fit <- train(group ~ ., data = top_100_expr, method = "lda", trControl = train.control)
lda_fit
## Linear Discriminant Analysis
##
## 121 samples
## 100 predictors
## 2 classes: 'colorectal.cancer', 'normal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6953768 0.3910275
The accuracy of LDA model is 0.695.
I have used the cross validation method, so the confusion table is automatically calculated by the caret library. Here I just make a “fake” confusion table for illustration. The 100% accuracy below is due to all data have been used in the training.
Print the confusion table:
ghat = predict(lda_fit)
table(ghat, top_100_expr$group)
##
## ghat colorectal.cancer normal
## colorectal.cancer 59 0
## normal 0 62
Mean error:
mean(ghat != top_100_expr$group)
## [1] 0
Generalized Linear Models (GLM):
set.seed(1)
glm_fit <- train(group ~ .,
data = top_100_expr,
method = "glm",
trControl = train.control,
metric = "Accuracy")
glm_fit
## Generalized Linear Model
##
## 121 samples
## 100 predictors
## 2 classes: 'colorectal.cancer', 'normal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ...
## Resampling results:
##
## Accuracy Kappa
## 0.6691142 0.3402693
The accuracy of GLM model is 0.669.
Support Vector Machines (SVM):
set.seed(1)
svm_fit <- train(group ~ ., data = top_100_expr, method = "svmRadial", trControl = train.control, metric = "Accuracy", tuneLength = 10)
svm_fit
## Support Vector Machines with Radial Basis Function Kernel
##
## 121 samples
## 100 predictors
## 2 classes: 'colorectal.cancer', 'normal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9229992 0.8451671
## 0.50 0.9257770 0.8507226
## 1.00 0.9257770 0.8507226
## 2.00 0.9311189 0.8614976
## 4.00 0.9334693 0.8661381
## 8.00 0.9392774 0.8781227
## 16.00 0.9392774 0.8781227
## 32.00 0.9392774 0.8781227
## 64.00 0.9364996 0.8725671
## 128.00 0.9392774 0.8781227
##
## Tuning parameter 'sigma' was held constant at a value of 0.006925265
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.006925265 and C = 8.
The accuracy of LDA model is 0.939, where sigma = 0.006925265 and C = 8.
Fine-tuning model parameters:
plot(svm_fit)
Neural Networks (NN):
library("caret")
set.seed(1)
garbage <- capture.output( # capture any garbage output
nnfit <-
train(
group ~ .,
data = top_100_expr,
method = "nnet",
tuneLength = 5,
trControl = train.control,
metric = "Accuracy"))
nnfit
## Neural Network
##
## 121 samples
## 100 predictors
## 2 classes: 'colorectal.cancer', 'normal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.9200078 0.8390297
## 1 1e-04 0.9369270 0.8734836
## 1 1e-03 0.9306915 0.8610006
## 1 1e-02 0.9503885 0.9003449
## 1 1e-01 0.9503885 0.9003449
## 3 0e+00 0.9255245 0.8503915
## 3 1e-04 0.9329643 0.8656454
## 3 1e-03 0.9448329 0.8892338
## 3 1e-02 0.9503885 0.9003449
## 3 1e-01 0.9503885 0.9003449
## 5 0e+00 0.9334305 0.8661508
## 5 1e-04 0.9447941 0.8891131
## 5 1e-03 0.9476107 0.8947893
## 5 1e-02 0.9503885 0.9003449
## 5 1e-01 0.9503885 0.9003449
## 7 0e+00 0.9252331 0.8494441
## 7 1e-04 0.9394911 0.8785802
## 7 1e-03 0.9503885 0.9003449
## 7 1e-02 0.9503885 0.9003449
## 7 1e-01 0.9503885 0.9003449
## 9 0e+00 0.9278749 0.8556510
## 9 1e-04 0.9394911 0.8784573
## 9 1e-03 0.9503885 0.9003449
## 9 1e-02 0.9503885 0.9003449
## 9 1e-01 0.9503885 0.9003449
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.
The best accuracy is 0.95 for the model were size = 1 and decay = 0.1.
Fine-tuning model parameters:
plot(nnfit)
Decision Trees (DT):
library("rpart")
library("rpart.plot")
set.seed(1)
tuning_grid <- expand.grid(
maxdepth = c(1,2,3,4,5,6,7,8,9,10)
)
tree_fit <- train(group~., data=top_100_expr,
method = "rpart2",
tuneLength = 6,
trControl = train.control,
metric = "Accuracy",
tuneGrid = tuning_grid
)
tree_fit
## CART
##
## 121 samples
## 100 predictors
## 2 classes: 'colorectal.cancer', 'normal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ...
## Resampling results across tuning parameters:
##
## maxdepth Accuracy Kappa
## 1 0.7494755 0.4968216
## 2 0.8125486 0.6262569
## 3 0.8355284 0.6714442
## 4 0.8355284 0.6714442
## 5 0.8355284 0.6714442
## 6 0.8355284 0.6714442
## 7 0.8355284 0.6714442
## 8 0.8355284 0.6714442
## 9 0.8355284 0.6714442
## 10 0.8355284 0.6714442
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 3.
The best accuracy is 0.836 with depth 3.
Fine-tuning model parameters:
plot(tree_fit)
library(rattle)
fancyRpartPlot(tree_fit$finalModel)
Gradient Boost Machine (GBM):
set.seed(1)
garbage <- capture.output(
gbm_fit <- train(group~., data=top_100_expr,
method = "gbm",
tuneLength = 5,
trControl = train.control,
metric = "Accuracy"
))
gbm_fit
## Stochastic Gradient Boosting
##
## 121 samples
## 100 predictors
## 2 classes: 'colorectal.cancer', 'normal'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9170163 0.8333538
## 1 100 0.9223582 0.8441302
## 1 150 0.9251360 0.8496857
## 1 200 0.9223582 0.8441302
## 1 250 0.9251360 0.8496857
## 2 50 0.9200466 0.8395684
## 2 100 0.9253885 0.8503449
## 2 150 0.9251360 0.8496857
## 2 200 0.9256022 0.8508024
## 2 250 0.9337218 0.8670115
## 3 50 0.9198329 0.8392338
## 3 100 0.9281663 0.8559004
## 3 150 0.9283800 0.8563580
## 3 200 0.9253497 0.8499290
## 3 250 0.9279138 0.8550270
## 4 50 0.9141997 0.8273837
## 4 100 0.9170163 0.8334766
## 4 150 0.9253885 0.8503449
## 4 200 0.9253885 0.8503449
## 4 250 0.9258159 0.8512614
## 5 50 0.9086830 0.8168099
## 5 100 0.9253885 0.8503449
## 5 150 0.9281663 0.8559004
## 5 200 0.9309441 0.8614560
## 5 250 0.9306915 0.8607969
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250, interaction.depth =
## 2, shrinkage = 0.1 and n.minobsinnode = 10.
The best accuracy is 0.934 with parameters: n.trees = 250, interaction.depth = 2, shrinkage = 0.1, and n.minobsinnode = 10.
Fine-tuning model parameters:
plot(gbm_fit)
The way of handling duplicated Probe IDs and Entrez IDs impacts the
final dataset for analysis and genes’ p-values. I have experimented on
different ways and observe very different DEG subset. The unclear ID
mapping can be the first limitation of this study.
After preprocessing expression counts, DEGs are found with 0.05 adjusted
p-value threshold. And up- and down-regulated genes have been plotted
among DEGs between CRC and HC by fold change value. DEGs are significant
as their p values are very small. However, it is not certain if the same
result will happen in a larger dataset. Thus, the second limitation of
this study is the small sample size.
Enriched GO terns are found according to the top DEGs with the lowest p-values. Top two biological processes are regulation of cellular metabolic process and regulation of nitrogen compound metabolic process.
Enriched pathways have been found for CRC. Top two pathways are ribosome and spliceosome.
As for CRC prediction with this tiny dataset with 121 samples and 100 predictors, the neural network has the best accuracy with a few neurons. Besides, SVM and GBM both achieve higher than 90% accuracy. It implies the high complexity a simple neural network can model. Parameter tuning impacts the model performance significantly from above fine-tuning plots. Cross validation maximizes the utilization of the dataset especially for the small dataset. Colinearity is found in the dataset, which shows some genes are highly correlated and it is harmful for the machine learning models.
comparison <- data.frame(
Model = c("LDA", "GLM", "SVM", "NN", "DT", "GBM"),
Accuracy = c(0.695, 0.669, 0.939, 0.950, 0.836, 0.934)
)
t(comparison)
## [,1] [,2] [,3] [,4] [,5] [,6]
## Model "LDA" "GLM" "SVM" "NN" "DT" "GBM"
## Accuracy "0.695" "0.669" "0.939" "0.950" "0.836" "0.934"
Again, the result of such small dataset may not be reliable. For such sample size, the number of predictors is considered quite large. Overfiting can exist. It should be validated and re-trained with a large dataset with enough samples.