Vanderbilt University



Discoveries Featured

What Drives Cancer?

By: Carol A. Rouzer, VICB Communications
Published: July 28, 2014

Proteomics analysis of 95 colorectal cancers provides insight into the biochemical alterations that drive the malignant phenotype in this disease.

The realization that cancer is primarily a disease caused by multiple gene abnormalities led to the founding of The Cancer Genome Atlas (TCGA), a massive NIH-funded project that aims to identify the specific genetic damage that drives malignant behavior in over 25 different kinds of cancer. TGCA has already generated huge datasets that provide important insights into many cancers. However, the results have raised as many questions as they have answered. We now can fully appreciate the staggering amount of genetic damage that is present in most forms of cancer and the heterogeneity that exists between tumors from different patients. Yet, in many cases, identification of the precise genetic changes that are necessary to produce malignancy in each kind of cancer remains a challenge. Key to meeting this challenge will be to learn how genetic damage in a cancer cell leads to changes in expression of the proteins encoded by the damaged genes. This is the mission of the Clinical Proteomic Tumor Analysis Consortium (CPTAC), which is working to acquire proteomic data to match the genomic data acquired by TGCA. Now, VICB members Dan Liebler, Bing Zhang, and Dave Tabb, along with their CPTAC collaborators, report the first complete proteomics analysis of 95 TGCA colorectal cancer (CRC) samples [B. Zhang et al., (2014) Nature, published online July 20, DOI:10.1038/nature13438].

The CPTAC team first established a highly reproducible shotgun proteomics protocol for the analysis of large numbers of tumor samples (Figure 1). The investigators first digest the samples with trypsin to obtain peptides that are then partially purified by basic reverse-phase liquid chromatography (bRPLC) before analysis by liquid chromatography-tandem mass spectrometry (LC-MS/MS). Application of multiple computer programs to the large LC-MS/MS datasets yields peptide sequences from which peptides are identified and then assembled into proteins. The total number of MS spectra acquired for each protein provides an estimate of the relative quantity of that protein through an approach referred to as “spectral counting”.

Figure 1.  Outline of the method used for proteomics analysis of TGCA CRC samples. The tumor sample is homogenized, and the proteins are digested into peptides using trypsin. An initial purification by basic reverse-phase liquid chromatography (bRPLC) separates the peptides into 15 fractions, and each fraction is then analyzed by LC-MS/MS. The large number of spectra resulting from this analysis are analyzed by a series of computer programs that determines the sequence of each peptide and identifies the protein from which the peptide is derived based on the sequence. The proteins are then assembled into a database for further analysis. Image reproduced by permission from Macmillan Publishers Ltd, from B. Zhang et al., (2014) Nature, published online July 20, DOI:10.1038/nature13438, copyright 2014.

Analysis of their 95 TGCA CRC samples yielded 6,299,756 spectra from which the team identified 124,823 peptides representing 7,526 proteins that corresponded to 7,211 genes. Among the identified proteins, the investigators found 796 single amino acid variants (SAAVs). Of these, 64 had been previously identified as somatic variants by TCGA, and 101 had been reported in the COSMIC (Catalogue of Somatic Mutations in Cancer) Database. Thus, good evidence suggested that these variants were tumor-related. In contrast, 526 of the variants were listed in in the single nucleotide polymorphism database but not in the other two cancer-related databases. These were likely germline variants that were present in the patient’s DNA from birth. The remaining 162 variants had not been reported previously (Figure 2). The total of 108 SAAVs that were listed either by the TGCA or the COSMIC database mapped to 105 genes, many of which were known cancer genes, such as KRAS, CTNNB1, SF3B1, ALDH2, and FH. Fourteen of these genes, such as PARP1, TST, GAK, and HSD17B4, are the targets of FDA-approved anti-cancer agents, or agents currently in clinical trials.


Figure 2
.  Classification of single amino acid variants (SAAVs) identified in CRC proteomes. Of the 796 variants found, 64 were identified as somatic variants by TGCA (right circle), 101 were listed in the COSMIC database (center circle), and 526 were found in a database of single nucleotide polymorphisms (left circle). The Venn diagram illustrates overlap between the three groups. The remaining 162 variants were classified as new. Image reproduced by permission from Macmillan Publishers Ltd, from B. Zhang et al., (2014) Nature, published online July 20, DOI:10.1038/nature13438, copyright 2014.

For 87 of the tumors, TGCA provided mRNA-seq data from which levels of gene transcription could be ascertained. The CPTAC team combined these data with their proteomics data to evaluate the correlation between mRNA and protein levels. For each gene within a given tumor, they found a strong positive correlation between mRNA and protein expression levels (coefficient of 0.47). However, when the investigators evaluated the relationship between mRNA and protein levels for each gene across the 87 tumors, only 89% of the genes gave a positive correlation, and only 32% were statistically significant. The use of KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis to separate the genes according to their biological function revealed that those involved in some metabolic processes, such as amino acid or fatty acid metabolism, showed a stronger mRNA-protein correlation, while those involved in other processes, such as RNA splicing and oxidative phosphorylation, showed a much weaker correlation. Furthermore, genes and proteins that were stably expressed were more likely to exhibit a positive correlation. These results suggest that mRNA levels are generally a poor indicator of expression for many proteins, and that gene expression patterns and protein function influence the relationship between the two parameters.

TGCA had identified numerous copy number variations (CNVs) in its genomic analysis of the CRC samples, with 17 regions of significant focal amplification and 28 regions of significant focal deletion. The CPTAC team found a strong positive correlation between the regions of focal amplification and levels of the mRNAs encoded by those regions. This correlation was weaker for regions of non-focal amplification, and no correlation was observed for regions of focal deletion. Furthermore, the investigators discovered that correlations between regions of gene amplification and levels of the encoded proteins were much weaker than those observed for mRNA. Of particular interest was the finding that some CNV hotspots correlated with altered expression of mRNA not directly encoded by the altered DNA. The five strongest CNV hotspots that exhibited this behavior were on chromosomes 20q, 18, 16, 13, and 7. However, most of these effects did not translate to the level of protein expression. Of all of the hotspots, amplification of 20q had the greatest global impact on both mRNA and protein expression. Among the affected proteins were HNF4α, a transcription factor known to play a role in both normal gastrointestinal development and CRC, Tom34, which is known to promote the growth of CRC cells, and Src, the non-receptor tyrosine kinase that plays a role in many forms of cancer, including CRC. These findings suggest a previously unappreciated role for 20q amplification in CRC.

The CPTAC investigators used their proteomics data to search for patterns of abnormal protein expression among the tumors that they evaluated. The results led to classification of the tumors into five subtypes. These subtypes exhibited some similarities to previously published classifications of CRC based on gene expression data; however, there were important differences. Of particular interest was the finding that subtypes B and C both exhibited characteristics consistent with microsatellite instability, normally an indicator of a good prognosis. On the other hand, subtype C also expressed “stem-like” qualities, usually associated with a poor prognosis. The researchers investigated the characteristics of the subtypes further as they searched for protein expression signatures that distinguish each of the subtypes (Figure 3). The results of these efforts demonstrated that subtype C tumors exhibit reduced expression of proteins related to the E-cadherin, β-catenin, and α-catenin complex and increased expression of collagens and extracellular matrix glycoproteins (Figure 4). Both of these observations are characteristic of tumors undergoing epithelial-mesenchymal transition, an indicator of poor prognosis. Thus, the investigators could conclude that some tumors that exhibit microsatellite instability likely have a much poorer prognosis than would be predicted from that characteristic alone.

Figure 3.  Identification of the signature proteins of each CRC subtype. In each case, the pink circle represents the number of proteins exhibiting higher (top) or lower (bottom) levels of expression than all other subtypes. The green circle provides the same data for each subtype relative to normal colon tissue. The intersection of the two circles represents signature proteins for each of the subtypes. Image reproduced by permission from Macmillan Publishers Ltd, from B. Zhang et al., (2014) Nature, published online July 20, DOI:10.1038/nature13438, copyright 2014.

Figure 4.  (a) Network diagrams of some of the signature proteins discovered in CRC subtype C. Shown are the interrelationships between proteins associated with E-cadherin (CDH1), β-catenin (CTNNB1), and α-catenin (CTNNA1) (left), which are down-regulated in subtype C and the interrelationships between collagens (COL4A2, COL14A1, COL4A3, COL18A1, COL3A1, COL5A2, COL5A1, COL2A1, COL1A1 and matrix glycoproteins (FBLN2, FBLN5, PCOLCE, BGN, FN1, HRG, MFAP2), which are up-regulated in subtype C. Image reproduced by permission from Macmillan Publishers Ltd, from B. Zhang et al., (2014) Nature, published online July 20, DOI:10.1038/nature13438, copyright 2014.

Together, the results demonstrate that genomics, transcriptomics, and proteomics must all be brought to bear in the quest to understand cancer. The proteomics data generated by CPTAC reinforce and confirm many conclusions drawn from TGCA data. However, the proteomics data also clearly provide new insights while also demonstrating clearly that genomic and transcriptomic data alone cannot be used to predict changes in protein levels. Clearly, CPTAC has set a new standard for the approach to understanding cancer biology.











Vanderbilt University School of Medicine | Vanderbilt University Medical Center | Vanderbilt University | Eskind Biomedical Library

The Vanderbilt Institute of Chemical Biology 896 Preston Building, Nashville, TN 37232-6304 866.303 VICB (8422) fax 615 936 3884
Vanderbilt University is committed to principles of equal opportunity and affirmative action. Copyright © 2013 by Vanderbilt University Medical Center