Multimodal Sparse Representation Learning and Applications
Miriam Cha(Author, Graduate student, Harvard University), Youngjune L. Gwon(First co-author, Graduate student, Harvard University), H.T. Kung(Second co-author, Professor of Computer Science and Electrical Engineering, Harvard University)
Sparse coding has been applied successfully to single-modality scenarios. We consider a sparse coding framework for multimodal representation learning. Our framework aims to capture semantic correlation between different data types via joint sparse coding. Such joint optimization induces a unified representation that is sparse and shared across modalities. In particular, we compute joint, cross-modal, and stacked cross-modal sparse codes. We find that these representations are robust to noise and provide greater flexibility in modeling features for multimodal input. A good multimodal framework should be able to fill in missing modality given the other and improve represen- tational efficiency. We demonstrate missing modality case through image denoising and indicate effectiveness of cross-modal sparse code in uncovering the relation of the clean-corrupted image pairs. Furthermore, we experiment with multi-layer sparse coding to learn highly nonlinear relationship. The effectiveness of our approach is also demonstrated in the multimedia event detection and retrieval on the TRECVID dataset (audio-video), category classification on the Wikipedia dataset (image-text), and sentiment classification on PhotoTweet (image-text).
Key words:Multimodal learning, multimedia, visual-text, audio-video, sparse coding |