Marginalized kernels for biological sequences

Marginalized kernels for biological sequences Koji Tsuda, Taishin Kin and Kiyoshi Asai AIST, 2-41-6 Aomi Koto-ku, Tokyo, Japan Presented by Shihai Zhao May 8, 2014 Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 1 / 23

Overview 1 Introduction kernel functions hidden Markov model 2 Methods new kernel connections to the Fisher kernel 3 Results and Conclusion Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 2 / 23

kernel functions In kernel methods such as S.V.M., a kernel function should be determined a priori. Supervised learning Objective function is clear. Kernels are designed to optimize the function. Unsupervised learning The choice of kernel is subjective. It is determined to reflect the user s notion of similarity. Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 3 / 23

kernel functions for sequences Texts Count features, which represent the number of each symbol contained in a sequence Biological sequences Count does not work out of the box primary due to frequent context change Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 4 / 23

A DNA sequence with hidden context information. Suppose the hidden variable ( h ) indicates coding/noncoding regions. Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 5 / 23

New way to design a kernel Visible Hidden HMM Joint & Marginalized Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 6 / 23

HMM A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 7 / 23

Example of HMM A limited number of sequences whose structures are known. We want to train the four HMMs of secondary structures to make the prediction Helix Sheet Turn Other Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 8 / 23

block diagram Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 9 / 23

HMMs of secondary structures Combined HMM for prediction Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 10 / 23

marginalized kernel x, x X, h, h H, where H is a finite set. z = (x, h), z = (x, h ) K(x, x ) = h H p(h x)p(h x )K z (z, z ) h H p(x x) has to be estimated from the data. When the cardinality of H is too large, the calculation can be intractable. Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 11 / 23

marginalized kernel from Gaussian mixture K(x, x ) = p(h x)p(h x )x T A h x h H where A h is the inverse of covariance matrix. Distance in feature space D(x, x ) = K(x, x) + K(x, x ) 2K(x, x ) Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 12 / 23

marginalized count kernel Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 13 / 23

second-order marginalized count kernel Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 14 / 23

Definition of the Fisher kernel Assume a probabilistic model p(x θ) is defined on X, where θ is a parameter vector. Let ˆθ denote parameter values which are obtained by some learning algorithm. Then the Fisher kernel between two objects is defined as K f (x, x ) = s(x, ˆθ) T Z 1 (ˆθ)s(x, ˆθ) where s is the Fisher score s(x, ˆθ) := θ log p(x ˆθ) and Z is the Fisher information matrix Z(ˆθ) = x X p(x ˆθ)s(x, ˆθ)s(x, ˆθ) T Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 15 / 23

Fisher kernel from latent variable models the Fisher score is described as θ log p(x ˆθ) = h H θp(x, h ˆθ) p(x ˆθ) = h H p(x, h ˆθ) θ p(x, h ˆθ) p(x ˆθ) p(x, h ˆθ) = h H p(h x, ˆθ) θ p(x, h ˆθ) Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 16 / 23

The Fisher kernel is described as a marginalized kernel K f (x, x ) = θ p(x ˆθ) T Z(ˆθ) 1 θ p(x ˆθ) = p(h x, ˆθ)p(h x, ˆθ)K z (z, z ) h H h H where the joint kernel is K z (z, z ) = θ p(x, h ˆθ) T Z(ˆθ) 1 θ p(x, h ˆθ) Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 17 / 23

experiment settings 84 amino acid sequences from 5 genera in Actinobacteria The number of sequences in each genus is listed as 9,32,15,14,14 Pairwise identity is 62%-99% BLAST scores cannot directly be converted to kernels Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 18 / 23

Two kinds of experiments clustering and supervised classification are performed on the following kernels: CK1: Count kernel CK2: Second-order count kernel FK: Fisher kernel MCK1: Marginalized count kernel MCK2: Second-order marginalized count kernel Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 19 / 23

clustering result Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 20 / 23

classification result Genera 1 and 2 are not used because they can be seperated easily by all kernels. We do one vs one for the rest three. Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 21 / 23

effect of HMM states Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 22 / 23

conclusion Fisher kernel is a special case of MCK. second-order kernels perform better than first-order kernels number of HMM states effect Koji Tsuda, Taishin Kin and Kiyoshi Asai Marginalized kernels May 8, 2014 23 / 23