Gene epression A fesible rodmp for unsupervised deconvolution of twosource mied gene epressions Niy Wng, Eric P. Hoffmn, Robert Clrke, Zhen Zhng, Dvid M. Herrington 5, Ie-Ming Shih, Dougls A. Levine 6, Guoqing Yu, Jinhu Xun nd Yue Wng, Deprtment of Electricl nd Computer Engineering, Virgini Tech, Arlington, VA, USA; Reserch Center for Genetic Medicine, Children's Ntionl Medicl Center, Wshington, DC, USA; Lombrdi Comprehensive Cncer Center, Georgetown University, Wshington, DC 57, USA; Deprtment of Pthology, Johns Hopkins University, Bltimore, MD, USA; 5 Deprtment of Internl Medicine, Wke Forest University, Winston-Slem, NC 757, USA; 6 Deprtment of Surgery, Memoril Slon-Kettering Cncer Center, New York, NY, USA Received on XXXXX; revised on XXXXX; ccepted on XXXXX Associte Editor: XXXXXXX Contct: yuewng@vt.edu Supplementry informtion: Supplementry dt re vilble t Bioinformtics online. INTRODUCTION Tissue heterogeneity is mjor confounding fctor in studying individul popultions tht cnnot be resolved directly by globl profiling (Hoffmn, et l., ). Eperimentl solutions to mitigte tissue heterogeneity re epensive, time consuming, inpplicble to eisting dt, nd my lter the originl gene epression ptterns (Kuhn, et l., ; Shen-Orr, et l., ). Alterntively, vrious in silico methods perform bsiclly supervised deconvolution bsed on either eternlly-obtined constituent proportions (Shen-Orr, et l., ; Sturt, et l., ) or previously-cquired cell-specific signtures (Kuhn, et l., ; Lu, et l., ). In the erlier issues of this journl, few rticles hve reported semi-supervised methods tht were specificlly focused on dissecting two-source mied gene epressions. Gosink et l. used (known) epression dt from single cell type to determine the proportion (nd subsequently epression profile) of ech cell type in heterogeneous smple (Gosink, et l., 7). This method detects the minimum of proportion tht provides good estimte in noiseless or simultion dt. Built upon this work, Clrke et l. developed geometry-bsed method tht provides more ccurte estimte of this minimum in noisy rel dt nd cn be pplied in situtions where one or multiple heterologous smples re vilble (Clrke, et l., ). These methods ssume liner miture of log-trnsformed epression levels tht hs recently been shown to be invlid (Zhong nd Liu, ). Ahn et l. proposed sttisticl pproch for deconvolving linerly mied cncer trnscriptomes in individul smples under vrious rw mesured dt scenrios (Ahn, et l., ). This prcticl (most likely) solution gin requires prior knowledge of gene epression of one tissue type from multiple similr smples nd pplies vrious heuristics. Here we sk whether it is possible to deconvolute two-source mied epressions (estimting both proportions nd cell-specific profiles) from two or more heterogeneous smples without requiring ny forementioned prior knowledge (Wng, ). Supported by well-grounded mthemticl frmework, we rgue tht both constituent proportions nd cell-specific epressions cn be estimted in completely unsupervised mode when cell-specific mrker genes eist, which do not hve to be known priori, for ech of constituent cell types. Fundmentl to the success of our pproch is geometric eploittion of cell/condition-specific mrker genes nd epression non-negtivity. Specificlly, we show tht () the sctter plot of mied epressions is compressed version of cell-specific epressions; () the resident genes on the two rdii of sctter sector re the cell-specific mrker genes; () the rdius vectors defined by the cell-specific mrker genes re the column vectors of the miing mtri; nd () the rnk of between-tissue differentilly epressed genes is miing-invrint. We demonstrte the performnce of unsupervised deconvolution on both simultion nd rel gene epression dt, together with perspective discussions. THEORY AND METHOD. Rodmp of unsupervised deconvolution We dopt the liner ltent vrible model of rw mesured epression dt (Zhong nd Liu, ), given by (bold font indictes column vectors) () i s () i, or ( i) s tissue ( i ) s tissue ( i ), () () i smple + s () i tissue smple tissue where s tissue(i) nd s tissue(i) re the gene epression vlues for pure tissues -, nd smple(i) nd smple(i) re the gene epression vlues for heterogeneous smples -, for genes i,,n, respectively; nd jk re the miing proportions with + + (fter signl normliztion). We further dopt the concept of cell-specific mrker genes (MG) (Gosink, et l., 7; Kuhn, et l., ; Wng, ), i.e., genes whose epression is highly nd eclusively enriched in prticulr cell popultion in the contet of interest, or mthemticlly s(i MG) [α i ] T nd s(i MG) [ β i] T. Since rw mesured gene epression vlues s re non-negtive, when cell-specific mrker genes eist for ech cell type, the liner ltent vrible model () is identifible using two or more mied epressions, s we will elborte vi the following theorems nd their forml proofs (see Fig. for geometric illustrtion). Theorem (Sctter compression). Suppose tht pure tissue epressions re non-negtive nd (i) s tissue(i) + s tissue(i) where nd re linerly
N.Wng et l. stissue stissue ( j ) + ( ) stissue ( j ) stissue independent, then, the sctter plot of mied epressions is compressed into sctter sector whose two rdii coincide with nd. + stissue stissue ( j ) Proof of theorem. Since nd re linerly independent, without loss of generlity, we ssume tht, i.e.,. stissue stissue ( j ) + ( ) stissue stissue ( j ) + stissue stissue ( j ). Multiply both sides by stissue(i) nd dd stissue(i) to both sides, since rw mesured epressions re non-negtive, we hve stissue + stissue stissue + stissue. Simple mthemticl reorgniztions led to stissue stissue ( j ) + stissue stissue ( j ) + stissue ( j ) stissue + stissue stissue ( j ) Simple mthemticl reorgniztions led to ( stissue + stissue ) ( stissue + stissue ), stissue stissue ( j ) + stissue ( j ) stissue + stissue stissue ( j ) + stissue stissue ( j ), stissue + stissue. stissue + stissue Since (i) stissue(i) + stissue(i), we hve smple smple. Since (i) stissue(i) + stissue(i), we complete the proof with ( j) stissue ( j ) s tissue smple smple. stissue ( j ) stissue smple ( j ) smple Using similr strtegy, we cn show smple, smple tht redily completes the proof. Theorem (Unsupervised identifibility). Suppose tht pure tissue epressions re non-negtive nd cell-specific mrker genes eist for ech constituting tissue type, nd (i) stissue(i) + stissue(i) where nd re linerly independent, then, the two rdii of the sctter sector of mied epressions coincide with nd tht cn be redily estimted from mrker gene epression vlues with pproprite rescling. Proof of theorem. Bsed on the definition of cell-specific mrker genes, i.e., s(img) [αi ]T nd s(img) [ βi]t, nd the eistence of cell-specific mrker genes for ll constituting tissue types, we hve ( img ) α i, ( img ) β i. By the conclusion of Theorem, we complete the proof..5.5 All genes Mrker genes All genes Mrker genes.5 stissue + stissue smple s tissue.5 ( img ) β i.5.5.8 86.8 68 [ ] s tissue.5.5 ( img ) α i.5 smple.5.5.5 Figure. Geometric nd mthemticl description of the miing process. Corollry (Invrince of differentil epression). Suppose tht pure tissue epressions re non-negtive nd (i) stissue(i) + stissue(i) where nd re linerly independent, then, the rnk of between-tissue differentilly epressed genes is miing-invrint. Proof of corollry. Without loss of generlity, we ssume tht, i.e.,, nd stissue ( j ) s tissue, i.e., stissue ( j ) stissue stissue stissue ( j ). stissue ( j ) stissue Since nd re linerly independent nd, multiply both sides by ( - ), nd dd stissue(i)stissue(j) nd stissue(i)stissue(j) to both sides, we hve stissue ( j ) + stissue ( j ) s + stissue tissue. stissue ( j ) + stissue ( j ) stissue + stissue From Theorem, there eists mthemticl solution uniquely identifying the liner ltent vrible model () bsed on two/more mied epressions: under noise-free scenrio, we cn (in principle) directly estimte nd by locting the two rdii tht most tightly enclose the sctter sector of mied epressions. Moreover, Corollry llows for between-tissue differentil nlysis from mied epressions without requiring deconvolution.. Algorithm nd evlution criteri So fr, we hve described the theoreticl rodmp for unsupervised deconvolution of two-source mied epressions. We now complete the description of our lgorithm by considering the identifiction of mrker genes or miing mtri, nd its ppliction to dt deconvolution. Although the miing mtri cn be estimted using only one mrker gene per tissue type, more ccurte solution, with prcticl pplicbility, is to estimte nd using multiple mrker genes. Our unsupervised deconvolution begins by detecting the cell-specific mrker genes directly from mied epressions, in which the differentil nlysis of gene epressions is performed on ll genes. Mthemticlly, MG is defined s n inde set MG i smple k m k min ε smple k min k m ε, () smple smple where km nd kmin re the mimum nd minimum rtios of smple(i) over smple(i) cross ll i, respectively; nd ε is pre-fied positive smll rel number. To obtin relible set of mrker gene indices, some pre-processing steps re required, including mode/men-bsed normliztion nd removl of minimlly-epressed nd outlier genes. On the bsis of the epression vlues of detected cell-specific mrker genes, the miing mtri is estimted using stndrdized smple verges, ˆ, ˆ, nmg i MG ( i ) nmg i MG ( i ) () where MG nd MG re the inde sets of mrker genes for tissue types nd, respectively; nmg nd nmg re the numbers of mrker genes for tissue types nd, respectively; nd. denotes the vector norm (L or L). The resulting â nd â re then used to deconvolute the mied epressions into cell-specific profiles vi mtri inversion techniques. Unsupervised deconvolution lgorithm: ) Normlize gene epression profile using globl men/mode; ) Remove minimlly-epressed genes whose norm is less thn prefied positive smll rel number δ, nd outlier genes whose norm is bigger thn pre-fied positive lrge rel number γ;
A fesible rodmp for unsupervised deconvolution of two-source mied gene epressions ) Detect the indices of cell-specific mrker genes, for ech of the constituting tissue types, ccording to (); ) Estimte miing mtri ccording to (), normlized to proportions; 5) Estimte cell-specific epression profiles using mied epressions nd mtri inversion technique(s). We use four complementry evlution criteri nd known ground truth to ssess the performnce of the proposed unsupervised deconvolution method. To ssess the ccurcy of tissue proportion estimtes, in ddition to clssic correltion coefficient, we dopt the E criterion given by p ij p ij E +, () i j m k p ik j i m k p kj where p ij is the ijth element of the mtri [â â ] - [ ] with â nd â being the estimted column vectors of miing mtri. Note tht E is invrint to permuttion or scling nd E when the estimtion is perfect. To ssess the ccurcy of estimted cell-specific epression ptterns, we clculte the correltion coefficient between the estimted epression profile nd ground truth over mrker genes nd ll genes respectively. Moreover, to ssess the membership (nd rnk) mtch (nd mismtch) between the mrker genes detected from pure versus mied epressions, we utilize Venn digrms, together with Spermn s rnk correltion coefficient. More detils on lgorithm, prmeter settings, nd lterntive schemes, re included in the supplementry informtion. EXPERIMENTAL RESULTS. Vlidtion on cell line epression dt We first considered numericl mitures of two humn cell line epressions, sitution in which ll fctors re known nd liner miture model is idel. We reconstituted mied epressions by multiplying the mesured cell line epressions by the proportion of the tissue subset in given heterogeneous smple (Fig. ). We detected the cell-specific mrker genes solely bsed on the reconstituted epression mitures nd ccordingly obtined highly ccurte estimte of the miing mtri with E.95 (nd correltion coefficient of.99) (Tble ). For ech cell line, comprison of the estimted epression profile of ech type to the mesured epression pttern in the pure cell line showed n lmost perfect correltion with n verge correltion coefficient of.99, indicting tht we could ccurtely deconvolute the mied epressions into constituent epression ptterns in completely unsupervised wy. Tble. Identifiction of liner miture model using numericl mitures of two brest cncer cell line epressions. Smple/Tissue MCF7 (brest cncer) HS7 (fibroblsts) (ssigned/estimted) (ssigned/estimted) Smple.8/.77 86/9 Smple.8/.86 68/.66 Net, we tested our method on biologiclly mied epressions from two brest cncer cell lines. The mrna etrcted from the individul cell lines re mied with pre-specified proportions before subsequent procedures including mplifiction nd microrry eperiment (Tble ). Such mitures mimic the ctul biologicl smples with vrying reltive bundnces of the constituent subsets from one nother (Fig. ) (Kuhn, et l., ; Shen-Orr, et l., ). Tble. Identifiction of liner miture model using biologicl mitures of two brest cncer cell line epressions. Smple/Tissue MCF7 (brest cncer) HS7 (fibroblsts) (ssigned/estimted) (ssigned/estimted) Smple.75/.76 /.78 Smple /..75/.7576 The proposed method gin ccurtely estimted the miing proportions with E.778 (nd correltion coefficient of.99) (Tble nd Fig. ), nd cell-specific epression ptterns with n verge correltion coefficient of.99 between the estimted epression profile of ech type to the mesured epression pttern in the pure cell line (Fig. ). The high correltion tht we chieved between the estimted proportions/tissue-epressions nd ground truth suggests Figure. Sctter plot of biologicl miing nd blind model identifiction. tht unsupervised deconvolution of tissue-specific epression profiles from two-source heterogeneous smples using liner model should yield ccurte epression estimtes for most genes. Figure. Highly correlted sctter plots between the estimted nd mesured pure cell epression profiles (over mrker nd ll genes). The theoreticl rodmp lso enbles the etended detection of differentilly-epressed genes beyond mrker genes imed t mimizing the informtion obtinble from mied epressions. To ssess the specificity nd sensitivity of detecting differentilly epressed genes without deconvolution, we compred the rnked inde subsets of differentilly epressed genes between smples to gold stndrd set of differentilly epressed genes identified from the pure cell line mesurements, on both numericl nd biologicl mitures of two brest cncer cell line epressions. In ddition to the Venn digrm nd Spermn s rnk correltion coefficient (rrnk.9), receiver operting chrcteristics curve nlysis showed tht the detection of differentilly epressed genes bsed on mied epressions (Corollry ) to be both highly specific nd sensitive with n re under the curve of.85 (supplementry informtion).. Anlysis of benchmrk epression dt As n emple for the purpose of comprison, we lso nlyzed the sme public benchmrk gene epression dtset (AFFY) used by
N.Wng et l. Ahn et l. (Ahn, et l., ). This dtset consists of biologiclly mied heterogeneous smples with vrying proportions of humn brin nd hert tissues. We selected smples with brin/hert proportion rtios of /% ( smples), 5%/75% ( smples), 75%/5% ( smples) nd %/ ( smples). In contrst to the semi-supervised methods tht ll require prior knowledge of gene epression of one tissue type (Ahn, et l., ; Clrke, et l., ; Gosink, et l., 7), the 6 pure tissue smples were not used in ny step of our proposed lgorithm but simply served s the truth for ssessment. For ech proportion rtio, consistent with routine prctice, we tke the verge of the replictes (with the sme proportion rtio) s the mied/observed epression profile to be nlyzed by the lgorithm. As forementioned, in ddition to ccurte signl normliztion (Wng, et l., ), to mintin dt qulity nd computtionl efficiency, we selected subset of genes in the subsequent nlyses by ecluding minimlly-epressed nd outlier genes (supplementry informtion) (Ahn, et l., ). This provided us with bout probe sets cross ll smples. Tble. Unsupervised estimtion of unknown tissue proportions on AFFY brin-hert miture dtset. Smple/Tissue Brin Hert (ssigned/estimted) (ssigned/estimted) Smple /..75/.7658 Smple.75/.76 /.78 With pre-processed rw mesured dt, we first evluted how well the proposed method estimted tissue proportions in this dtset (Tble ). Without using ny knowledge of either tissue-specific epression or constituent proportions, s in other methods (Ahn, et l., ; Clrke, et l., ; Gosink, et l., 7; Kuhn, et l., ), our lgorithm ccurtely estimted the unknown tissue proportions with correltion coefficient of.99 (E.7), s compred with correltion coefficient of.98 produced by the semi-supervised method on the sme dtset (Fig. ) (Ahn, et l., ). Figure. Sctter plot of brin-hert mitures nd proportion estimtes. Net, we emined how well the proposed method estimted tissue-specific epression ptterns in this dtset. As shown in Fig. 5, the proposed method ccurtely nd blindly estimted the gene epressions of pure brin nd pure hert tissues, with correltion coefficient of.96-.99 between the estimted men tissue epression levels nd mesured men pure tissue epression levels, s compre to correltion coefficient of.88-.95 produced by the semi-supervised method on the sme dtset. These results suggest tht this unsupervised deconvolution method is ble to ccurtely deconvolute two-source mied epressions (estimting both proportions nd cell-specific profiles) from two or more heterogeneous smples. Detiled informtion on dditionl eperimentl results (tbles, figures, dtsets) re included in the supplementry informtion. DISCUSSION In this letter, we presented fesible rodmp for unsupervised deconvolution of two-source mied epressions, supported by the newly proved theorems under relistic conditions nd eperimentl tests on rel gene epression dt. One importnt dvntge of this unsupervised deconvolution pproch lies in its unique nd proven bility to detect cell-specific mrker genes nd estimte constituent proportions directly from mied epressions when the relevnt prior knowledge is either unrelible or unvilble. This is significnt, in reltion to semi-supervised methods, since it is well-known tht () cell-specific mrker genes (membership nd epression) re condition-specific nd () the totl mount of mrna from the sme volume of cncer cells is much higher thn tht of norml cells (due to unknown tumor ploidy) (Clrke, et l., ). Figure 5. Estimtion of tissue-specific gene epressions from AFFY: sctter plots compring deconvolved men brin/hert tissue epression vlues with mesured men pure brin/hert tissue epression vlues. We foresee vriety of etensions to the concepts nd strtegies in the proposed method. For emple, with further development, intrtumor heterogeneity cn be reveled in terms of hidden subclonl mrker genes nd subclonl repopultion dynmics. There is lso possible wy to estimte the mrker epression profiles for individul smples (Ahn, et l., ). Rewrite () s () i s () i + s () i, (5) smple tissue tissue,smple () i s () i s () i smple + tissue tissue,smple where stissue,smple(i) nd stissue,smple(i) re the smple-specific vritions in pure tissues. Then, for mrker genes, we hve ( stissuej( imgj) + stissue j,smple( imgj) ) ( stissuej( imgj) + stissue j,smple ( imgj) ) ( i smple MGj ) j, (6) ( i smple MGj ) j where j is the tissue type inde. According to (), âj nd âj re obtined vi some form of verging over tissue-specific mrker genes, where for ech smple we my resonbly ssume stissue j,smple () i. (7) n i MGj i MGj Denote stissuej,smplek(imgj)stissuej(imgj)+ stissuej,smplek(i), we hve stissue j,smplek( imgj) smplek( img j) kj, (8) ( ) for ech of k nd j, where k is the smple inde.
A fesible rodmp for unsupervised deconvolution of two-source mied gene epressions Funding: Ntionl Institutes of Helth, under Grnts NS955, CA97, CA66, HL6, in prt. Conflict of Interest: none declred. REFERENCES Ahn, J., et l. () DeMi: deconvolution for mied cncer trnscriptomes using rw mesured dt, Bioinformtics, 9, 865-87. Clrke, J., Seo, P. nd Clrke, B. () Sttisticl epression deconvolution from mied tissue smples, Bioinformtics, 6, -9. Gosink, M.M., Petrie, H.T. nd Tsinorems, N.F. (7) Electroniclly subtrcting epression ptterns from mied cell popultion, Bioinformtics,, 8-. Hoffmn, E.P., et l. () Epression profiling-best prctices for dt genertion nd interprettion in clinicl trils, Nt. Rev. Genet., 5, 9-7. Kuhn, A., et l. () Popultion-specific epression nlysis (PSEA) revels moleculr chnges in disesed brin, Nt Methods, 8, 95-97. Lu, P., Nkorchevskiy, A. nd Mrcotte, E.M. () Epression deconvolution: reinterprettion of DNA microrry dt revels dynmic chnges in cell popultions, Proceedings of the Ntionl Acdemy of Sciences of the United Sttes of Americ,, 7-75. Shen-Orr, S.S., et l. () Cell type-specific gene epression differences in comple tissues, Nt Methods, 7, 87-89. Sturt, R.O., et l. () In silico dissection of cell-type-ssocited ptterns of gene epression in prostte cncer, Proc. Ntl. Acd. Sci.,, 65-6. Wng, Y. () Independent Component Imging. US Ptent 6,78,96. Wng, Y., et l. () Itertive normliztion of cdna microrry dt, IEEE Trns Info. Tech. Biomed, 6, 9-7. Zhong, Y. nd Liu, Z. () Gene epression deconvolution in liner spce, Nt Methods, 9, 8-9; uthor reply 9. 5