Nandeska? Say What? Learning, Visualizing, and Understanding Multilingual Word Embeddings

What is the equivalent of the English phrase, “say what?” in Japanese? In this talk, we provide an intuitive approach to learning such distributed representations of phrases from multilingual data in a novel way using autoencoders and generative neural networks. Distributed representations of language are a very natural way to encode relationships between words and phrases.

Such representations map discrete representations to continuous vectors, and frequently encode useful semantics of the linguistic units of the underlying language corpus, making them ubiquitous in NLP tasks. However, for most machine translation tasks, large amounts of parallel corpora are needed to learn semantic relationships between pairs of language-phrases, which can be problematic without aligned data.

This talk will provide an examination of distributed representations using neural embeddings, with particular focus on the use of generative models and auto-encoders for learning shared word and phrase representations across languages. We show how we can speed up learning shared latent representations using Spark, and discuss techniques for optimizing phrase-alignment using active learning.

Session hashtag: #AISAIS16

« back
About Ali Zaidi

Ali is a data scientist in the language understanding team at Microsoft AI Research. He spends his days trying to make tools for researchers and engineers to analyze large quantities of language data efficiently in the cloud and on clusters. Ali studied statistics and machine learning at the University of Toronto and Stanford University.