1 Introduction

This project aims to discover biomarkers and predict colorectal cancer (CRC) through studying the gene expressions of 59 CRC patients and 62 healthy controls (HCs) from Gene Expression Ominibus (GEO) database with id GSE164191. Two data objects are essential: a gene expression matrix (54,675 \(\times\) 121) and a phenotype table (121 \(\times\) 4). Duplicated Probe IDs are removed with the first gene left and duplicated Entrez IDs are removed with the lowest adjusted p-value gene left. To achieve the goal, 4 types of analysis have been conducted as followings:

  1. Differential Expression Analysis (DEGs): I identify the significant differently expressed genes between CRC and HC groups with age and gender co-variants using a 0.05 p-value threshold. As the MAS5 normalization has been performed by the paper (“Immune-related gene expression signatures in colorectal cancer”), the preprocessing only includes ID converstion and filtering the low expression counts. A DGEList object is created by expression data, then a linear model is fitted using Limma package.

  2. Gene Ontology Enrichment Analysis (GOEA): I identify the top enriched gene ontology terms and plot the GO hierarchy. A topGO object is created to perform GO enrichment test.

  3. Gene Set Enrichment Analysis (GSEA): I identify the significantly enriched pathways against KEGG pathways with a 0.05 p-value threshold. The function gseKEGG from clusterProfiler is used to perform GSEA. The pathview library is used to plot two enriched pathways: hsa03010 and hsa03040.

  4. CRC Prediction: I build and tune the models to predict CRC patients and HCs based on the expressions of top 100 DEGs using caret library using cross validation method. Prediction models include linear discriminant analysis (LDA), generalized linear models (GLM), support vector machines (SVM), neural networks (NN), decision trees (DT), and gradient boost machines (GBM).

Through the above analysis, significant differences in gene expression are found between CRC patients and HCs, which could be effective biomarkers. Prediction models showed a high accuracy in this tiny datasets. Their real performance requires to be validated in larger datasets.

2 Differential Expression Analysis (DEGs)

2.1 Method Description

Data preprocessing is performed before conducting analysis. Key steps include: converting Probe IDs of Affymetrix HG-U133_Plus_2 platform into Entrez IDs, filtering out NA and low expression data. As the MAS5 normalization has been performed by the paper for the data, counts are then directly fitted with a linear model using Limma. Gene expression dispersion, p-value histogram and MA plot are made. The objective is to identify the significant genes that differently expressed in CRC patients and controls.

2.2 Experiment and Results

library(magrittr)
library(dplyr)
library(data.table)
library(ggplot2)
library(Biobase)
library(org.Hs.eg.db)
library(topGO)
library(clusterProfiler)
library(enrichplot)
library(limma)
library(edgeR)
require(DOSE)
library(pathview)
library(GEOquery)
library(Biobase)

Data Preparation:

The txt file contains gene expression profiling of 59 colorectal cancer (CRC) patients and 62 healthy controls (HCs).

gse = getGEO("GSE164191")
# length(gse)
meta = pData(phenoData(gse[[1]]))
names(meta)
##  [1] "title"                   "geo_accession"          
##  [3] "status"                  "submission_date"        
##  [5] "last_update_date"        "type"                   
##  [7] "channel_count"           "source_name_ch1"        
##  [9] "organism_ch1"            "characteristics_ch1"    
## [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
## [13] "characteristics_ch1.3"   "treatment_protocol_ch1" 
## [15] "growth_protocol_ch1"     "molecule_ch1"           
## [17] "extract_protocol_ch1"    "label_ch1"              
## [19] "label_protocol_ch1"      "taxid_ch1"              
## [21] "hyb_protocol"            "scan_protocol"          
## [23] "description"             "data_processing"        
## [25] "platform_id"             "contact_name"           
## [27] "contact_email"           "contact_institute"      
## [29] "contact_address"         "contact_city"           
## [31] "contact_zip/postal_code" "contact_country"        
## [33] "supplementary_file"      "supplementary_file.1"   
## [35] "data_row_count"          "age:ch1"                
## [37] "disease status:ch1"      "gender:ch1"             
## [39] "tissue:ch1"
#### reading count matrix
library(data.table)
ori_data = fread("GSE164191_series_matrix.txt", fill = TRUE)
# ori_data %>% head

counts = ori_data[64:54738,]
samples = ori_data[63, -1]
countsNames <-  counts$V1
rownames(counts) <- countsNames
counts[,1] = NULL
colnames(counts) <- as.character(samples)
# counts %>% head

#### reading phenotypic data
pheno = subset(meta, select = c("title", "age:ch1", "disease status:ch1", "gender:ch1"))
colnames(pheno) <- c("title", "age", "status", "gender")
# pheno %>% head

#### Checking and mapping counts and phenotypic data
phenoNames = rownames(pheno)

missingNames <- setdiff(colnames(counts), phenoNames) # elements of x not in y
# missingNames

counts1 <- counts[, !c(..missingNames)]
rownames(counts1) <- countsNames

pheno1 <- pheno
# all(colnames(counts1) == rownames(pheno1))

The original identifiers provided in the GEO164191 dataset are Probe IDs of Affymetrix HG-U133_Plus_2 platform. Prob IDs are converted to ENTREZ IDs. The duplicated Probe IDs are removed first with the first gene left:

library(hgu133plus2.db)
# find the mapping table between Probe IDs and Entrez IDs
id_mapping = AnnotationDbi::select(hgu133plus2.db, countsNames, "ENTREZID")
# filer out the na rows
id_mapping = na.omit(id_mapping)
# remove the duplicated probe IDs
id_mapping = id_mapping[!duplicated(id_mapping$PROBEID),]
dim(id_mapping)
## [1] 44662     2

44662 genes are left with unique Probe IDs.

Convert all characters in counts to numbers as a matrix, add a column for PROBE IDs, then find the matching ENTREZ IDs by merging counts and id mapping tables:

counts_matrix <- matrix(as.numeric(unlist(counts1)), ncol = length(counts1), byrow = FALSE)
PROBEID <- countsNames
counts_matrix <- cbind(counts_matrix, PROBEID)
counts_matrix <- merge(counts_matrix, id_mapping, by = "PROBEID")
dim(counts_matrix)
## [1] 44662   123

Separate Probe IDs, Entrez IDs, and count matrix to different variables:

probe_ids = counts_matrix[,1]
entrez_ids = counts_matrix[,dim(counts_matrix)[2]]
counts_matrix = counts_matrix[,-c(1,dim(counts_matrix)[2])]
counts_matrix <- matrix(as.numeric(unlist(counts_matrix)), ncol = length(counts1), byrow = FALSE)

Filter out genes with very low counts across all libraries. The threshold is set as 10 here.

keep <- rowSums(counts_matrix) >= 10
counts_matrix <- counts_matrix[keep,]
probe_ids <- probe_ids[keep]
entrez_ids <- entrez_ids[keep]

dim(counts_matrix)
## [1] 44662   121

No gene is found below the threshold.

Create DGEList object with group labels (normal or cancer):

group = pheno1$status
dge = DGEList(counts_matrix,group = group)

Normalization is not needed as the data is already normalized.

d <- calcNormFactors(dge)

Create design matrix with co-variants (age and gender), group (i.e., cancer status) is kept as the main attribute, intercept is kept:

design <- model.matrix(~ age + gender + group, data=pheno1)

Plot dispersions of gene expressions against normalized counts using voom without normalization method:

v = voom(counts = d, design = design, plot = T, normalize.method = "none")

Fit a linear model:

fit <- lmFit(v, design)
fit <- eBayes(fit, trend = TRUE)

Plot the p-value histogram:

res = data.frame(fit)
thehist = hist(res$p.value.groupnormal, breaks = 50, plot=FALSE)
plot(thehist)

MA plot (log-fold-changes vs normalized counts):

# Mean-difference plot
limma::plotMA(fit)

Sort genes by adjusted p-values in ascending order. Filter out the duplicated Entrez IDs by keeping the gene with the lowest p-value. Dimension after filtering is 21367 .

top <- topTable(fit, coef="groupnormal", n=Inf, adjust="BH")
top <- top[order(top$adj.P.Val), ]

# for rows with duplicated entrez IDs, choose the first row with the highest adjusted p value
top$ENTREZID = entrez_ids[as.numeric(rownames(top))]
top <- top[!duplicated(top$ENTREZID),]

dim(top)
## [1] 21367     7

List the top 25 significant DEGs as followings:

top_25 <- head(top,25)
top_100 <- head(top,100)

top_25
##            logFC  AveExpr         t      P.Value    adj.P.Val         B
## 16233 -0.7215613 3.133655 -7.864248 9.397586e-12 4.197150e-07 16.307344
## 11216 -1.1283695 5.670882 -6.972724 5.763052e-10 1.286947e-05 12.484455
## 7814   0.6771076 5.668244  6.874449 9.012231e-10 1.341681e-05 12.055612
## 20520 -0.8947702 5.863532 -6.483536 5.245020e-09 5.719367e-05 10.373200
## 31088  0.9440105 5.560535  6.425712 6.788493e-09 5.719367e-05 10.127117
## 3050   0.6838653 4.037530  6.363209 8.964123e-09 5.719367e-05  9.857513
## 33070 -0.5475620 2.553364 -6.368487 8.756426e-09 5.719367e-05  9.811460
## 29861  0.5688781 3.527407  6.277456 1.310795e-08 6.017266e-05  9.487427
## 38071 -0.5615244 6.258030 -6.272107 1.342161e-08 6.017266e-05  9.474808
## 15405  1.1959499 5.674375  6.263510 1.394127e-08 6.017266e-05  9.440533
## 42910  0.6874490 3.876167  6.232676 1.597408e-08 6.017266e-05  9.308093
## 31397  0.7300965 4.488013  6.229948 1.616748e-08 6.017266e-05  9.300147
## 6971  -0.8858330 1.075623 -6.151671 2.281616e-08 7.838579e-05  8.550084
## 27792  0.5435575 4.906145  6.128744 2.523109e-08 8.049078e-05  8.876473
## 41499  0.9717034 2.546625  6.096458 2.906481e-08 8.384360e-05  8.637724
## 15465  0.8040916 5.250491  6.062001 3.379125e-08 8.384360e-05  8.597869
## 25965  1.2712551 2.465411  6.071861 3.236634e-08 8.384360e-05  8.442564
## 35173  0.7964616 1.476944  6.081404 3.104368e-08 8.384360e-05  8.231666
## 20452  0.6205344 4.432372  6.015568 4.137835e-08 8.519548e-05  8.407555
## 18476  0.7273415 4.821288  6.009019 4.257555e-08 8.519548e-05  8.379265
## 9092   0.6632160 4.581803  5.999442 4.438815e-08 8.519548e-05  8.340334
## 20074  0.8013458 5.669264  5.985363 4.719171e-08 8.519548e-05  8.277918
## 27479  0.5915544 4.540757  5.975709 4.921425e-08 8.519548e-05  8.242097
## 19363  0.6312993 4.568853  5.973063 4.978343e-08 8.519548e-05  8.231433
## 28106  0.6279386 5.925042  5.965240 5.150414e-08 8.519548e-05  8.193485
##       ENTREZID
## 16233     1174
## 11216     5534
## 7814      9774
## 20520     2280
## 31088    29068
## 3050     55692
## 33070   400236
## 29861     1783
## 38071    54815
## 15405    23015
## 42910    51747
## 31397    54800
## 6971    286135
## 27792    58508
## 41499     7562
## 15465      546
## 25965    55671
## 35173     9321
## 20452     8930
## 18476    58517
## 9092      4820
## 20074     6432
## 27479     4666
## 19363     5978
## 28106    79101

Top 25 genes’ ENTREZ IDs:

top25_entrez_ids = entrez_ids[as.numeric(rownames(top_25))]
top25_entrez_ids
##  [1] "1174"   "5534"   "9774"   "2280"   "29068"  "55692"  "400236" "1783"  
##  [9] "54815"  "23015"  "51747"  "54800"  "286135" "58508"  "7562"   "546"   
## [17] "55671"  "9321"   "8930"   "58517"  "4820"   "6432"   "4666"   "5978"  
## [25] "79101"

Top 100 genes’ ENTREZ IDs:

top100_entrez_ids = entrez_ids[as.numeric(rownames(top_100))]
top100_entrez_ids
##   [1] "1174"   "5534"   "9774"   "2280"   "29068"  "55692"  "400236" "1783"  
##   [9] "54815"  "23015"  "51747"  "54800"  "286135" "58508"  "7562"   "546"   
##  [17] "55671"  "9321"   "8930"   "58517"  "4820"   "6432"   "4666"   "5978"  
##  [25] "79101"  "54799"  "57513"  "130872" "23137"  "3181"   "283337" "83941" 
##  [33] "3709"   "26039"  "3835"   "6526"   "80169"  "351"    "26508"  "55339" 
##  [41] "7456"   "4603"   "124411" "10142"  "8899"   "84124"  "7919"   "55904" 
##  [49] "9698"   "90624"  "27327"  "9252"   "285527" "57223"  "25831"  "54906" 
##  [57] "196441" "28951"  "55102"  "157680" "9101"   "3005"   "9736"   "730092"
##  [65] "51107"  "10209"  "79646"  "56339"  "10181"  "1654"   "22889"  "11198" 
##  [73] "8473"   "163"    "1727"   "257415" "2186"   "139322" "10147"  "54737" 
##  [81] "7072"   "1763"   "200576" "6651"   "5775"   "256471" "6134"   "201627"
##  [89] "5478"   "6642"   "667"    "85028"  "55670"  "54780"  "284273" "220988"
##  [97] "23215"  "6774"   "714"    "9987"

Set a adjusted p-value threshold as 0.05, make a subset of genes with adjusted p-value smaller than the threshold.

subset_genes = top[top$adj.P.Val<0.05,]
head(subset_genes)
##            logFC  AveExpr         t      P.Value    adj.P.Val         B
## 16233 -0.7215613 3.133655 -7.864248 9.397586e-12 4.197150e-07 16.307344
## 11216 -1.1283695 5.670882 -6.972724 5.763052e-10 1.286947e-05 12.484455
## 7814   0.6771076 5.668244  6.874449 9.012231e-10 1.341681e-05 12.055612
## 20520 -0.8947702 5.863532 -6.483536 5.245020e-09 5.719367e-05 10.373200
## 31088  0.9440105 5.560535  6.425712 6.788493e-09 5.719367e-05 10.127117
## 3050   0.6838653 4.037530  6.363209 8.964123e-09 5.719367e-05  9.857513
##       ENTREZID
## 16233     1174
## 11216     5534
## 7814      9774
## 20520     2280
## 31088    29068
## 3050     55692
deg_results = top

# Take negative log10 of adjusted p-values
deg_results$neg_log10_pval <- -log10(deg_results$adj.P.Val)

# Generate volcano plot
plot(deg_results$logFC, deg_results$neg_log10_pval, 
     pch=20, main="Volcano Plot", xlab="log2 Fold Change", ylab="-log10 p-value")

# Highlight up-regulated genes in red and down-regulated genes in blue
up_genes <- deg_results$logFC > 0 & deg_results$adj.P.Val < 0.05
down_genes <- deg_results$logFC < 0 & deg_results$adj.P.Val < 0.05
points(deg_results$logFC[up_genes], deg_results$neg_log10_pval[up_genes], 
       pch=20, col="red")
points(deg_results$logFC[down_genes], deg_results$neg_log10_pval[down_genes], 
       pch=20, col="blue")

3 Gene Ontology Enrichment Analysis (GOEA)

3.1 Method Description

GOEA aims to find the most enriched gene ontology. The human gene universe is loaded by using OrganismDbi. GO terms are found by the list of human genes, which are in the significant DEGs from the previous analysis. The fisher test is run with the GOdata to identify the top GO terms based on their scores.

3.2 Experiment and Results

Load human gene data from database “org.Hs.eg.db”:

library(org.Hs.eg.db)
library(OrganismDbi)

columns(org.Hs.eg.db)
##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

Number of genes left in the significant DEGs subset.

length(subset_genes[[1]])
## [1] 6816

Load human gene universe then choose genes from that:

library(topGO)
humanGeneUniverse <- as.character(unique(OrganismDbi::select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), column="ENTREZID")$ENTREZID))
head(humanGeneUniverse)
## [1] "1"  "2"  "3"  "9"  "10" "11"
myGeneList = entrez_ids[as.numeric(rownames(subset_genes))]
head(myGeneList)
## [1] "1174"  "5534"  "9774"  "2280"  "29068" "55692"
geneList <- factor(as.integer(humanGeneUniverse %in% myGeneList))
names(geneList) <- humanGeneUniverse

library(magrittr)
geneList %>% head
##  1  2  3  9 10 11 
##  0  0  0  1  0  0 
## Levels: 0 1

Create a topGOdata class object required by topGO package in order to perform GO enrichment test:

GOdata <- new("topGOdata", 
              ontology="BP", 
              allGenes=geneList, 
              nodeSize=5, # the minimum number of genes for each GO term.
              annotationFun=annFUN.org, 
              mapping="org.Hs.eg.db", 
              ID="entrez")

Find all the GO terms mapped by the geneList:

GOresult <- runTest(GOdata, algorithm = "classic", statistic = "fisher")
## 
##           -- Classic Algorithm -- 
## 
##       the algorithm is scoring 9280 nontrivial nodes
##       parameters: 
##           test statistic: fisher
resultTable <- GenTable(GOdata, GOresult, topNodes = 50, numChar=200)
resultTable[1:10,]
##         GO.ID
## 1  GO:0031323
## 2  GO:0051171
## 3  GO:0080090
## 4  GO:0019219
## 5  GO:0046907
## 6  GO:0045935
## 7  GO:0044260
## 8  GO:0044237
## 9  GO:0071840
## 10 GO:0051173
##                                                                       Term
## 1                                 regulation of cellular metabolic process
## 2                        regulation of nitrogen compound metabolic process
## 3                                  regulation of primary metabolic process
## 4           regulation of nucleobase-containing compound metabolic process
## 5                                                  intracellular transport
## 6  positive regulation of nucleobase-containing compound metabolic process
## 7                                 cellular macromolecule metabolic process
## 8                                               cellular metabolic process
## 9                            cellular component organization or biogenesis
## 10              positive regulation of nitrogen compound metabolic process
##    Annotated Significant Expected result1
## 1       5737        2169  1744.80 < 1e-30
## 2       5792        2161  1761.53 < 1e-30
## 3       5954        2213  1810.80 < 1e-30
## 4       4076        1598  1239.64 < 1e-30
## 5       1587         719   482.66 < 1e-30
## 6       2052         888   624.08 < 1e-30
## 7       3266        1307   993.29 < 1e-30
## 8      10271        3525  3123.74 < 1e-30
## 9       6601        2389  2007.57 < 1e-30
## 10      3175        1268   965.62 < 1e-30

Plot GO hierarchy formed by top 50 GO terms:

showSigOfNodes(GOdata, score(GOresult), firstSigNodes = 50, useInfo = 'all')

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 79 
## Number of Edges = 171 
## 
## $complete.dag
## [1] "A graph with 79 nodes."

4 Gene Set Enrichment Analysis (GSEA)

4.1 Method Description

GSEA identifies significantly enriched pathways in CRC patients by comparing with normal KEGG pathways (i.e., hsa human organism). I used gseKEGG function to perform KEGG pathway enrichment analysis.

4.2 Experiment and Results

Preparation of the inputs received from DE analysis:

df = topTable(fit, coef="groupnormal", n=Inf, adjust="BH")

kegg_gene_list <- df$logFC
names(kegg_gene_list) <- entrez_ids[as.numeric(rownames(df))]
  
# omit any NA values 
kegg_gene_list<-na.omit(kegg_gene_list)

# sort the list in decreasing order (required for clusterProfiler)
kegg_gene_list = sort(kegg_gene_list, decreasing = TRUE)

Create gseKEGG object:

library(R.utils)
library(clusterProfiler)
R.utils::setOption("clusterProfiler.download.method",'wininet')

kegg_organism = "hsa"
kk2 <- gseKEGG(geneList     = kegg_gene_list,
               organism     = kegg_organism,
               nPerm        = 10000,
               minGSSize    = 3,
               maxGSSize    = 800,
               pvalueCutoff = 0.05,
               pAdjustMethod = "BH",
               keyType       = "ncbi-geneid")

Check top 5 pathways:

head(kk2, 5)
##                ID                       Description setSize enrichmentScore
## hsa03010 hsa03010                          Ribosome     129       0.6300956
## hsa03040 hsa03040                       Spliceosome     128       0.6200938
## hsa03013 hsa03013       Nucleocytoplasmic transport     104       0.6330755
## hsa03008 hsa03008 Ribosome biogenesis in eukaryotes      76       0.6365703
## hsa04120 hsa04120    Ubiquitin mediated proteolysis     141       0.5519044
##               NES       pvalue    p.adjust       qvalue  rank
## hsa03010 3.157399 0.0003918495 0.001913969 0.0008241973 12248
## hsa03040 3.099624 0.0003954132 0.001913969 0.0008241973 10413
## hsa03013 3.058391 0.0003658983 0.001913969 0.0008241973 14097
## hsa03008 2.912703 0.0003252033 0.001913969 0.0008241973 11821
## hsa04120 2.806603 0.0004105090 0.001913969 0.0008241973 12538
##                            leading_edge
## hsa03010 tags=56%, list=27%, signal=41%
## hsa03040 tags=49%, list=23%, signal=38%
## hsa03013 tags=57%, list=32%, signal=39%
## hsa03008 tags=52%, list=26%, signal=39%
## hsa04120 tags=42%, list=28%, signal=30%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              core_enrichment
## hsa03010                                                                                                                                          6201/6164/51187/51187/6139/6139/6201/6139/6168/6189/9349/6205/6229/6133/6160/6134/6125/6223/9045/6229/65008/200916/6169/65003/51318/11224/6189/55052/6173/6218/6218/6157/51065/6189/64983/6233/6218/51318/51318/6227/6165/6165/6129/6129/9801/6227/51121/6135/6125/51023/6122/6222/28998/9045/6168/6167/6128/6125/54460/51023/55173/6155/11222/23521/6167/6160/6160/25873/6167/9801/6202/6138/63931/6124/6146/64981/64968/23521/6138/6175/6217/6138/64960/6210/6170/64965/6187/6152/6160/6183/51263/6194/6235/6124/6232/6129/6124/6125/6224/4736/6122/6204/6137/6223/6210/64969/6138/3921/6208/64960/6160/6165/51065/9045/6232/9045/6204/6234/29074/6204/6204/6217/6169/51263/6137/6165/6159/64960/6194/51263/29093/6152/6193/6122/6146/6228/51263/6130/6146/23521/6194/6132/6183/6146/6141/6203/6122/51081/6223/6230/51264/6130/51263/6233/6175/6181/6146
## hsa03040                                                                              9939/220988/9984/23350/6428/4809/6432/51362/6427/57461/7919/10772/3192/6426/10772/6432/27316/23658/58517/10772/23020/11325/3192/10915/6428/4686/8559/57187/5356/10772/84950/58517/55660/6427/6635/6430/6426/6430/9879/51645/55119/29896/51691/1655/23451/23350/1665/6427/27258/6430/58517/22916/57187/6434/58517/3183/51691/10929/2521/6428/6434/3178/51691/23451/23451/55110/6426/6426/10907/4116/6632/4670/6434/1665/220988/23451/10915/6432/55660/6635/6635/6426/22916/10569/3306/9939/3178/6434/6428/29896/6627/6636/10772/3178/55660/6637/29896/55660/3192/23350/6429/3183/10569/3183/1655/23020/2521/84991/84950/58517/23350/220988/6627/10929/3190/3183/11338/9879/84321/6431/51729/10285/4686/988/4670/10286/57187/8559/84950/84991/51690/6429/58517/3192/6633/51340/3183/3312/2521/6100/3178/6625/3192/10929/51639/10915/27316/8449/10084/11325/23658/55660/58517/23451/22938/10713/6627/5356/3312/9129/55660
## hsa03013                                                                      9939/3841/3841/51068/9984/5411/7175/65110/7919/56001/51068/3841/7175/5411/4686/84305/57187/55916/5901/5411/81929/3842/65110/5411/5903/23279/9818/22916/7175/57187/25909/65110/10527/51068/10762/5903/51808/5903/10921/10527/65109/55110/65110/348995/11260/56001/129401/4116/1434/55161/55916/55705/348995/7175/22916/9972/5903/57122/9939/7341/3842/3843/10762/59343/5901/51808/55746/3839/10527/9631/9818/10527/7341/55705/11097/3843/3839/53371/3843/55705/23165/84321/51194/26019/4686/3836/7329/57187/55308/25909/3836/11260/23279/10762/3838/10284/79023/5905/1915/7514/64328/10762/3838/81929/30000/387082/3836/79902/1434/57510/5903/8021/3843/10284/8480/1434/55706/55161/23225/3840/1434/10526/4928/55706/23636/10482/64901/7341/23636/23511/59343/10189/55161/1915/79023/7329/23039/4927/10189/3842/22985/55746/9939/10250/23279/6613/9688/9939/2733/23225/9939/7175/23511/57510/5905/23225/10762/23633/57510/10526
## hsa03008                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1736/84135/51068/4931/4809/56001/51068/54552/55916/5901/55781/51119/102157402/25996/10556/54464/10556/29889/51068/55341/1736/84128/56001/134430/55341/55916/26354/55131/166378/54464/5901/10940/84135/51602/10556/10813/9724/134430/55272/10171/23195/29102/23195/55813/55127/51367/55127/51096/10199/25996/23560/7514/23160/55127/10528/51077/83732/1457/51077/81691/23560/54464/2091/84916/23560/51096/55781/28987/1736/27341/55127/4809/55127/54464/1459/4809/55341/10482/10885
## hsa04120 868/51366/330/26091/8450/55236/8925/8065/64750/8453/10277/4214/9063/55236/7337/55284/25898/55284/64750/11059/868/1161/92912/55958/6477/7328/7337/51366/51433/8451/26091/57448/29945/8924/6233/8450/29945/4193/8065/7321/6921/25898/25898/8454/8554/51366/57448/8925/119504/6502/55294/11059/55958/7337/26091/10054/7321/26272/6477/6477/9039/4591/4193/7324/83737/8925/7337/9690/26259/5071/330/10277/55236/51434/4193/7328/10393/331/9063/9616/51433/55236/9690/6502/64750/8450/51465/8697/7323/89910/8065/64682/9690/7328/9063/8452/119504/25898/7327/7334/55294/51465/55585/26091/7329/4193/8452/10055/63893/6921/868/26091/10075/7334/7334/11060/329/55585/10393/4214/10075/51529/8945/55284/8454/9021/9063/55120/119504/8881/9354/3093/83737/64326/996/55284/83737/29882/51433/8881/23221/4193/26232/7323/9063/10055/64326/9063/8450/7326/64682/26272/8451/7322/7327/55294/9320/9616/8697/6502/10277/8945/8451/6233/672/7322/23759/51434/7329/7319/10401/4193/22954/6500/7320/64682/64750/4591

Dotplot:

enrichplot::dotplot(kk2, showCategory = 10, title = "Enriched Pathways" , split=".sign") + facet_grid(.~.sign)

Encrichment plot map:

emapplot(pairwise_termsim(kk2))

GSEA Plot for the first pathway (Ribosome):

gseaplot(kk2, by = "all", title = kk2$Description[1], geneSetID = 1)

In the enrichment analysis, we take the first two pathways: hsa03010 and hsa03040. Pathview is used to plot enriched KEGG pathway:

library(pathview)

# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="hsa03010", species = kegg_organism)


knitr::include_graphics("hsa03010.pathview.png")

# Produce the native KEGG plot (PNG)
dme <- pathview(gene.data=kegg_gene_list, pathway.id="hsa03040", species = kegg_organism)

knitr::include_graphics("hsa03040.pathview.png")

5 CRC Prediction

5.1 Method Description

The expressions of top 100 genes that are siginificantly DEG are gathered as a tiny dataset. This dataset is used to build classifiers for predicting CRC patients. Following models are built, tuned and tested. The cross validation and confusion matrices are used to evaluate the model performances.

  1. Linear Discriminant Analysis (LDA): A statistical method used to find a linear combination of features that separates two or more classes. (Package MASS-lda())

  2. Generalized Linear Models (GLM): A flexible class of models that can be used to model a wide range of data types by specifying a suitable probability distribution and a linear predictor. (Package stats-glm())

  3. Support Vector Machines (SVM): A machine learning algorithm used for classification and regression analysis that creates a hyperplane or set of hyperplanes in a high-dimensional space to separate data into classes or predict a continuous output. (Package e1071-svm())

  4. Neural Networks (NN): A set of algorithms modeled after the structure of the human brain, used for supervised and unsupervised learning, pattern recognition, and data clustering. (Package caret-nnet())

  5. Decision Trees (DT): A tree-like model used to make decisions by recursively splitting data based on the value of certain features until a decision can be made. (Package rpart-rpart()`)

  6. Gradient Boost Machines (GBM): A machine learning technique that builds an ensemble of weak prediction models, such as decision trees, and gradually improves their performance by boosting their weights based on their individual errors. (Package gbm-nnetgbm()`)

5.2 Experiment and Results

Create a tiny dataset consisting of only the expressions of top 100 genes, which is used for classifiers stated below. To use machine learning models, usually input features are normalized.

top_100_expr = counts_matrix[as.numeric(rownames(top_100)),]
# transpose the matrix
top_100_expr = t(top_100_expr)
# normalizing features
top_100_expr = scale(top_100_expr)
dim(top_100_expr)
## [1] 121 100

Convert the status column to factor, and matrix to data frame:

top_100_expr = as.data.frame(top_100_expr)

Add the ground truth classification label “status” in the top 100 gene expressions matrix:

top_100_expr$group <- factor(pheno1$status)

Linear Discriminant Analysis (LDA):

library(caret)
library(MASS)

levels(top_100_expr$group) <- make.names(levels(top_100_expr$group))
levels(top_100_expr$group)
## [1] "colorectal.cancer" "normal"

Set the training control environment with 10-fold cross-validation:

set.seed(1)
# Define the training control object for 10-fold cross-validation
train.control <- trainControl(
                           method = "repeatedcv",
                           number = 10, ## 10-fold CV
                           repeats = 3,## repeated three times
                           classProbs = TRUE
                           )

# Fit the LDA model using the train() function
lda_fit <- train(group ~ ., data = top_100_expr, method = "lda", trControl = train.control)

lda_fit
## Linear Discriminant Analysis 
## 
## 121 samples
## 100 predictors
##   2 classes: 'colorectal.cancer', 'normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6953768  0.3910275

The accuracy of LDA model is 0.695.

I have used the cross validation method, so the confusion table is automatically calculated by the caret library. Here I just make a “fake” confusion table for illustration. The 100% accuracy below is due to all data have been used in the training.

Print the confusion table:

ghat = predict(lda_fit)
table(ghat, top_100_expr$group)
##                    
## ghat                colorectal.cancer normal
##   colorectal.cancer                59      0
##   normal                            0     62

Mean error:

mean(ghat != top_100_expr$group)
## [1] 0

Generalized Linear Models (GLM):

set.seed(1)
glm_fit <- train(group ~ ., 
                 data = top_100_expr, 
                 method = "glm", 
                 trControl = train.control, 
                 metric = "Accuracy")

glm_fit
## Generalized Linear Model 
## 
## 121 samples
## 100 predictors
##   2 classes: 'colorectal.cancer', 'normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6691142  0.3402693

The accuracy of GLM model is 0.669.

Support Vector Machines (SVM):

set.seed(1)
svm_fit <- train(group ~ ., data = top_100_expr, method = "svmRadial", trControl = train.control, metric = "Accuracy", tuneLength = 10)

svm_fit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 121 samples
## 100 predictors
##   2 classes: 'colorectal.cancer', 'normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ... 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.9229992  0.8451671
##     0.50  0.9257770  0.8507226
##     1.00  0.9257770  0.8507226
##     2.00  0.9311189  0.8614976
##     4.00  0.9334693  0.8661381
##     8.00  0.9392774  0.8781227
##    16.00  0.9392774  0.8781227
##    32.00  0.9392774  0.8781227
##    64.00  0.9364996  0.8725671
##   128.00  0.9392774  0.8781227
## 
## Tuning parameter 'sigma' was held constant at a value of 0.006925265
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.006925265 and C = 8.

The accuracy of LDA model is 0.939, where sigma = 0.006925265 and C = 8.

Fine-tuning model parameters:

plot(svm_fit)

Neural Networks (NN):

library("caret")
set.seed(1)

garbage <- capture.output( # capture any garbage output
nnfit <-
   train(
   group ~ .,
   data = top_100_expr,
   method = "nnet",
   tuneLength = 5,
   trControl = train.control,
   metric = "Accuracy"))

nnfit
## Neural Network 
## 
## 121 samples
## 100 predictors
##   2 classes: 'colorectal.cancer', 'normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  Accuracy   Kappa    
##   1     0e+00  0.9200078  0.8390297
##   1     1e-04  0.9369270  0.8734836
##   1     1e-03  0.9306915  0.8610006
##   1     1e-02  0.9503885  0.9003449
##   1     1e-01  0.9503885  0.9003449
##   3     0e+00  0.9255245  0.8503915
##   3     1e-04  0.9329643  0.8656454
##   3     1e-03  0.9448329  0.8892338
##   3     1e-02  0.9503885  0.9003449
##   3     1e-01  0.9503885  0.9003449
##   5     0e+00  0.9334305  0.8661508
##   5     1e-04  0.9447941  0.8891131
##   5     1e-03  0.9476107  0.8947893
##   5     1e-02  0.9503885  0.9003449
##   5     1e-01  0.9503885  0.9003449
##   7     0e+00  0.9252331  0.8494441
##   7     1e-04  0.9394911  0.8785802
##   7     1e-03  0.9503885  0.9003449
##   7     1e-02  0.9503885  0.9003449
##   7     1e-01  0.9503885  0.9003449
##   9     0e+00  0.9278749  0.8556510
##   9     1e-04  0.9394911  0.8784573
##   9     1e-03  0.9503885  0.9003449
##   9     1e-02  0.9503885  0.9003449
##   9     1e-01  0.9503885  0.9003449
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1 and decay = 0.1.

The best accuracy is 0.95 for the model were size = 1 and decay = 0.1.

Fine-tuning model parameters:

plot(nnfit)

Decision Trees (DT):

library("rpart")
library("rpart.plot")

set.seed(1)
tuning_grid <- expand.grid(
  maxdepth = c(1,2,3,4,5,6,7,8,9,10)
) 

tree_fit <- train(group~., data=top_100_expr, 
                 method = "rpart2", 
                 tuneLength = 6,
                 trControl = train.control,
                 metric = "Accuracy",
                 tuneGrid = tuning_grid
               )

tree_fit
## CART 
## 
## 121 samples
## 100 predictors
##   2 classes: 'colorectal.cancer', 'normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ... 
## Resampling results across tuning parameters:
## 
##   maxdepth  Accuracy   Kappa    
##    1        0.7494755  0.4968216
##    2        0.8125486  0.6262569
##    3        0.8355284  0.6714442
##    4        0.8355284  0.6714442
##    5        0.8355284  0.6714442
##    6        0.8355284  0.6714442
##    7        0.8355284  0.6714442
##    8        0.8355284  0.6714442
##    9        0.8355284  0.6714442
##   10        0.8355284  0.6714442
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was maxdepth = 3.

The best accuracy is 0.836 with depth 3.

Fine-tuning model parameters:

plot(tree_fit)

library(rattle)
fancyRpartPlot(tree_fit$finalModel)  

Gradient Boost Machine (GBM):

set.seed(1)

garbage <- capture.output(
gbm_fit <- train(group~., data=top_100_expr, 
                 method = "gbm", 
                 tuneLength = 5,
                 trControl = train.control,
                 metric = "Accuracy"
               ))

gbm_fit
## Stochastic Gradient Boosting 
## 
## 121 samples
## 100 predictors
##   2 classes: 'colorectal.cancer', 'normal' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 110, 109, 109, 109, 109, 109, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9170163  0.8333538
##   1                  100      0.9223582  0.8441302
##   1                  150      0.9251360  0.8496857
##   1                  200      0.9223582  0.8441302
##   1                  250      0.9251360  0.8496857
##   2                   50      0.9200466  0.8395684
##   2                  100      0.9253885  0.8503449
##   2                  150      0.9251360  0.8496857
##   2                  200      0.9256022  0.8508024
##   2                  250      0.9337218  0.8670115
##   3                   50      0.9198329  0.8392338
##   3                  100      0.9281663  0.8559004
##   3                  150      0.9283800  0.8563580
##   3                  200      0.9253497  0.8499290
##   3                  250      0.9279138  0.8550270
##   4                   50      0.9141997  0.8273837
##   4                  100      0.9170163  0.8334766
##   4                  150      0.9253885  0.8503449
##   4                  200      0.9253885  0.8503449
##   4                  250      0.9258159  0.8512614
##   5                   50      0.9086830  0.8168099
##   5                  100      0.9253885  0.8503449
##   5                  150      0.9281663  0.8559004
##   5                  200      0.9309441  0.8614560
##   5                  250      0.9306915  0.8607969
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 250, interaction.depth =
##  2, shrinkage = 0.1 and n.minobsinnode = 10.

The best accuracy is 0.934 with parameters: n.trees = 250, interaction.depth = 2, shrinkage = 0.1, and n.minobsinnode = 10.

Fine-tuning model parameters:

plot(gbm_fit)

6 Conclusions

The way of handling duplicated Probe IDs and Entrez IDs impacts the final dataset for analysis and genes’ p-values. I have experimented on different ways and observe very different DEG subset. The unclear ID mapping can be the first limitation of this study.
After preprocessing expression counts, DEGs are found with 0.05 adjusted p-value threshold. And up- and down-regulated genes have been plotted among DEGs between CRC and HC by fold change value. DEGs are significant as their p values are very small. However, it is not certain if the same result will happen in a larger dataset. Thus, the second limitation of this study is the small sample size.

Enriched GO terns are found according to the top DEGs with the lowest p-values. Top two biological processes are regulation of cellular metabolic process and regulation of nitrogen compound metabolic process.

Enriched pathways have been found for CRC. Top two pathways are ribosome and spliceosome.

As for CRC prediction with this tiny dataset with 121 samples and 100 predictors, the neural network has the best accuracy with a few neurons. Besides, SVM and GBM both achieve higher than 90% accuracy. It implies the high complexity a simple neural network can model. Parameter tuning impacts the model performance significantly from above fine-tuning plots. Cross validation maximizes the utilization of the dataset especially for the small dataset. Colinearity is found in the dataset, which shows some genes are highly correlated and it is harmful for the machine learning models.

comparison <- data.frame(
  Model = c("LDA", "GLM", "SVM", "NN", "DT", "GBM"),
  Accuracy = c(0.695, 0.669, 0.939, 0.950, 0.836, 0.934)
)

t(comparison)
##          [,1]    [,2]    [,3]    [,4]    [,5]    [,6]   
## Model    "LDA"   "GLM"   "SVM"   "NN"    "DT"    "GBM"  
## Accuracy "0.695" "0.669" "0.939" "0.950" "0.836" "0.934"

Again, the result of such small dataset may not be reliable. For such sample size, the number of predictors is considered quite large. Overfiting can exist. It should be validated and re-trained with a large dataset with enough samples.