Category: Uncategorized

Multilingual Semantic Models

In this post I’ll discuss a model for learning word embeddings, such that they end up in the same space in different languages. This means we can find the similarity between some English and German words, or even compare the meaning of two sentences in different languages. It is a summary and analysis of the paper by Karl Moritz Hermann and Phil Blunsom, titled “Multilingual Models for Compositional Distributional Semantics“, published at ACL 2014.

The Task

The goal of this work is to extend the distributional hypothesis to multilingual data and joint-space embeddings. This would give us the ability to compare words and sentences in different languages, and also make use of labelled training data from languages other than the target language. For example, below is an illustration of English words and their Estonian translations in the same semantic space.



This actually turns out to be a very difficult task, because the distributional hypothesis stops working across different languages. While “fish” is an important feature of “cat”, because they occur together often, “kass” never occurs with “fish”, because they are in different languages and therefore used in separate sets of documents.

In order to learn these representations in the same space, the authors construct a neural network that learns from parallel sentences (pairs of the same sentence in different languages). The model is then evaluated on the task of topic classification, training on one language and testing on the other.

A bit of a spoiler, but here is a visualisation of some words from the final model, mapped into 2 dimensions.



The words from English, German and French are successfully mapped into clusters based on meaning. The colours indicate gender (blue=male, red=female, green=neutral).

The Multilingual Model

The main idea is as follows: We have sentence \(a\) in one language, and we have a function \(f(a)\) which maps that sentence into a vector representation (we’ll come back to that function). We then have sentence \(b\), which is the same sentence just in a different language, and function \(g(b)\) for mapping it into a vector representation. Our goal is to have \(f(a)\) and \(g(b)\) be identical, because both of these sentences have the same meaning. So during training, we show the model a series of parallel sentences \(a\) and \(b\), and each time we adjust the functions \(f(a)\) and \(g(b)\) so that they would produce more similar vectors.

Here is a graphical representation of the model:


\(b1\), \(b2\) and \(b3\) are words in sentence \(b\); \(a1\), \(a2\),\(a3\) and \(a4\) are words in sentence \(a\). The red vectors in the middle are the sentence representations that we want to be similar.

Next, let’s talk about the functions \(f(a)\) and \(g(b)\) that map a sentence to a vector. As you can see from the image above, each word is represented as a vector as well. The simplest option of going from words to sentences is to just add the individual word vectors together (the ADD model):

\(f_{ADD}(a) = \sum_{i=1}^{n} a_i\)

Here, \(a_i\) is the vector for word number \(i\) in sentence \(a\). This addition is similar to a basic bag-of-words model, because it doesn’t preserve any information about the order of the words. Therefore, the authors have also proposed a bigram version of this function (the BI model):

\(f_{BI}(a) = \sum_{i=1}^{n} tanh(a_{i-1} + a_i)\)

In this function, we step though the sentence, add together vectors for two consecutive words, and pass them through a nonlinearity (tanh). The result is then summed together into a sentence vector. This is essentially a multi-layer compositional network, where word vectors are first combined to bigram vectors, and then bigram vectors are combined to sentence vectors.

One more component to make this model work – the optimization function. The authors define an energy function given two sentences:

\(E(a,b) = || f(a) – g(b) ||^ 2\)

This means we find the Euclidean distance between the two vector representations and take the square of it. This value will be big when the vectors are different, and small when they are similar.

But we can’t directly use this for optimization, because functions \(f(a)\) and \(g(b)\) that always returned zero vectors would be the most optimal solution. We want the model to give similar vectors for similar sentences, but different vectors for semantically different sentences. Here’s a function for that:

\(E_{nc}(a,b,c) = [m + E(a,b) – E(a,c)]_{+}\)

We’ve introduced a randomly selected sentence \(c\) that probably has nothing to do with \(a\) or \(b\). Our objective is to minimze the \(E_{nc}(a,b,c)\) function, which means we want \(E(a,b)\) (for related sentences) to be small, and \(E(a,c)\) (for unrelated sentences) to be large. This form of training – teaching the model to distinguish between correct and incorrect samples – is called noise contrastive estimation. The formula also includes \(m\), which is the margin we want to have between the values of \(E(a,b)\) and \(E(a,c)\). The whole thing is passed through the function \([x]_{+} = max(x,0)\), which means that if \(E(a,c)\) is greater than \(E(a,b)\) by margin \(m\), then we’re already optimal and don’t need to adjust the model further.

The authors also experiment with a document-level variation of the model (the DOC model), where individual sentence vectors are combined into document vectors and these are also optimized to be similar, in addition to the sentence vectors.


The authors evaluate the system on the task of topic classification. The classifier is trained on one language (eg English) and the test results are reported on another language (eg German) for which no labelled training data was used. They run two main experiments:

  1. The cross-lingual document classification (CLDC) task, described by Klementiev et al. (2012). The system is trained on the parallel Europarl corpus, and tested on Reuters RCV1/RCV2. The language pairs used were English-German and English-French.
  2. The authors built a new corpus from parallel subtitles of TED talks (not yet online at the time of writing this), based on a previous TED corpus for IWSLT. Each talk also has topic tags assigned to them, and the task is to assign a correct tag to every talk, using the document-level vector.



First, results on the CLDC task:

I-Matrix is the previous state-of-the-art system, and all the models described here manage to outperform it. The +-variations of the model (ADD+ and BI+) use the French data as an extra training resource, thereby improving performance. This is an interesting result, as the classifier is trained on English and tested on German, and French seems completely unrelated to the task. But by adding it into the model, the English word representations are improved by having more information available, which in turn propagates on to having better German representations.

Next, experiments on the TED corpus:


The authors have performed a much larger number of experiments, and I’ve only chosen a few examples to show here.

The MT System is a machine translation baseline, where the test data is translated into the source language using a machine translation system. The most interesting scenarios for application are where the source language is English, and this is where the MT baseline often still wins. So if we want to topic classification in Dutch, but we only have English labelled data, it’s best to just automatically translate the Dutch text into English before classification.

Experiments in the other direction (where the target language is English) show different results, and the multilingual neural models manage to win on most languages. I’m curious about this difference – perhaps the MT systems are better tuned to translate into English, but not as good when translating from English into other languages? In any case, I think with some additional developments the neural network model will be able to beat the baseline in both directions.

In most cases, adding the document-level training signal (ADD/DOC) helped accuracy quite a bit. The bigram models (BI) however were outperformed by the basic ADD models on this task, and the authors suspect this is due to sparsity issues caused by less training data.

Finally, the ADD/DOC/joint model was trained on all languages simultaneously, taking advantage of parallel data in all the languages, and mapping all vectors into the same space. The results of this experiment seem to be mixed, leading to an improvement on some languages and decrease on others.

In conclusion, this is definitely a very interesting model, and it bridges the gap between vector representations of different languages, using only sentence-aligned plain text. Combining sentence-level and document-level training signals seems to give a fairly consistent improvement in classification accuracy. Unfortunately, in the most interesting scenario, mapping from English to other resource-poor languages, this system does not yet beat the MT baseline. But hopefully this is only the first step, and future research will further improve the results.


Hermann, K. M., & Blunsom, P. (2014). Multilingual Models for Compositional Distributed Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 58–68).

Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In COLING 2012.

Political ideology detection

Neural networks have a range of interesting applications, and here I will discuss on one them: recursive neural networks and the detection of political ideology. This post is a summary and analysis of a recent publication by Mohit Iyyer, Peter Anns, Jordan Boyd-Graber and Philip Resnik: “Political Ideology Detection Using Recursive Neural Networks“.

The Task

Given a sentence, we want the model to detect the political ideology expressed in that sentence. In this research, the authors deal with US politics, so the possible options are liberal (democrats) or conservative (republicans). As a practical application we might consider a system that processes a large amount of news articles or public speeches to detect and measure explicit or hidden political bias of the authors.


A traditional approach to this problem is a simple bag-of-words model, where each word is treated as a separate feature, but this ignores any syntactic structure and even word order. As shown below, political ideology can be compositionally complicated – while certain sections of the sentence are locally conservative, the way they are used in context makes the overall sentence liberal.political_sample_1

Figure 1: Sample sentence from Iyyer et al. (2014). Blue nodes are liberal, red nodes are conservative, grey nodes are neutral.

Recursive Neural Network

The recursive neural network in this work is based on Socher et al. (2011), where it was used for sentiment detection. The idea of semantic composition is that the meaning of a sentence or text is composed of its smaller subparts, and recursive neural networks aim to model this by recursively combining vector representations of words and phrases.

We represent each word as a vector of length d. We can then construct a network that takes two word vectors, concatenated into a vector of length 2d, as input and outputs a new vector of length d. We can then stack these networks together and use the output of one network as one of the inputs to another network. This way we can combine multiple words to phrases and phrases to sentences. We don’t need to have a separate network for each of these levels, and can just reuse the same network each time – which is why these are called recursive neural networks.

Mathematically, the composition happens as follows:

x_c = f(W_L\times x_a + W_R\times x_b + b)

\(x_a\) and \(x_b\) are the vectors for the two input words/phrases; \(x_c\) is the output vector; \(W_L\) and \(W_R\) are the weight matrices for the left and right input; \(b\) is the bias vector. If we think of the input as a concatenation of the vectors, this formula actually becomes the traditional single-layer neural network:

x_c = f(W\times x_{ab} + b)

An example of this recurrent neural network in action is shown below:


During training, an additional softmax layer is added on top, which calculates the probability for both classes (liberal or conservative). The model is then trained to minimize the negative log-probability of the correct class, with L2-regularization.  The error is backpropagated through the whole tree, so the information on the sentence level will be used to modify the individual word vectors, along with all the weights used in the composition.

Also, if I understand the paper correctly, the final prediction from the model at test time is actually generated by averaging vectors for all the nodes in the sentence tree, concatenating this with the root vector, and training a separate logistic regression model over that. Which seems a bit odd, because the recurrent network is quite capable of predicting the class label on the root node, and has been optimised to do so. I wonder what the performance difference would be, if the root-level predictions from the RNN were used directly for evaluation.


The authors use two datasets for evaluation:

  1. The Convote dataset (Thomas et al., 2006) contains transcripts of spoken text from the US Congressional floor debate. The experiments here use 7,816 sentences from the corpus.
  2. The Ideological Books Corpus (IBC) (Gross et al., 2013) is a collection of books and articles by authors whose political bias is well-known. Iyyer et al. (2014) extend the dataset by providing sentence-level and partial phrase-level annotation for 4,062 sentences, crowd-sourced through Crowdflower.

Both of these datasets are originally annotated on the level of the author and their political views, but this information can be used to also label each sentence with its most likely political class.

On both datasets, the authors apply heuristic filters to only keep sentences containing explicit bias in either direction, and remove neutral sentences which take up majority of the corpora. However, as this filtering is based on surface forms (essentially bag-of-words or bag-of phrases), it is unknown how this affects the final dataset. It would be interesting to also see results where the test set contains all of the original sentences.


The authors compare a number of different models.

  • Baselines
    • Random: The label (conservative/liberal) is chosen randomly.
    • LR1: Basic logistic regression using bag-of-words (BoW) features.
    • LR2: As previous, but adding phrase annotations as more training data.
    • LR3: Logistic regression with BoW and dependency-based features.
    • LR-w2v: Logistic regression over averaged word vectors from word2vec.
  • RNN Models
    • RNN1: The basic RNN model, trained on sentence-level annotations, using random initialization.
    • RNN1-w2v: As previous, but initialized with vectors from word2vec.
    • RNN2-w2v: The RNN model, trained on sentence-level and phrase-level annotations, initialized with vectors from word2vec.

The results (accuracy) can be seen in the graph.


Some conclusions we can draw from the results:

  • All the models outperform the random baseline by quite a margin.
  • All the recurrent neural network models outperform all the logistic regression models.
  • Adding the phrase-level annotations as extra training data for the logistic regression decreases performance (LR2), but adding them to the RNN model improves performance (RNN2-w2v).
  • Logistic regression trained on averaged word2vec vectors (LR-w2v) outperforms logistic regression with BoW features (LR1), and even dependency-based features (LR3, on the IBC dataset).
  • Initializing the RNN with word2vec vectors gives a little boost to the overall accuracy.


Gross, J., Acree, B., Sim, Y., & Smith, N. A. (2013). Testing the Etch-a-Sketch Hypothesis : Measuring Ideological Signaling via Candidates’ Use of Key Phrases. In APSA 2013.

Iyyer, M., Enns, P., Boyd-Graber, J., & Resnik, P. (2014). Political Ideology Detection Using Recursive Neural Networks. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 1113–1122).

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semi-supervised recursive autoencoders for predicting sentiment distributions. Proceedings of EMNLP. Retrieved from

Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing – EMNLP ’06 (pp. 327–335).

Don’t count, predict

In the past couple of years, neural networks have nearly taken over the field of NLP, as they are being used in recent state-of-the-art systems for many tasks. One interesting application is distributional semantics, as they can be used to learn intelligent dense vector representations for words. Marco Baroni, Georgiana Dinu and German Kruszewski presented a paper in ACL 2014 called “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors“, where they compare these new neural-network models with more traditional context vectors on a range of different tasks. Here I will try to give an overview and a summary of their work.

Distributional hypothesis

The goal is to find how similar two words are to each other semantically. The distributional hypothesis states:

Words which are similar in meaning occur in similar contexts
(Rubenstein & Goodenough, 1965).

Therefore, if we want to find a word similar to “magazine”, we can look for words that occur in similar contexts, such as “newspaper”.

I was reading a magazine today I was reading a newspaper today
The magazine published an article The newspaper published an article
He buys this magazine every day He buys this newspaper every day


Also, if we want to find how similar “magazine” and “newspaper” are, we can compare how similar are all the contexts in which they appear. For example, to find the similarity between two words, we can represent the contexts as feature vectors and calculate the cosine similarity between their corresponding vectors.

The counting model

In a traditional counting model, we count how many times a certain context word appears near our main word. We have to manually choose a window size – the number of words we count on either side of the main word.

For example, given sentences “He buys this newspaper every day” and “I read a newspaper every day”, we can look at window size 2 (on either side of the main word) and create the following count vector:

buys this every day read a
1 1 2 2 1 1

As you can see, “every” and “day” get a count of 2, because they appear in both sentences, whereas “he” doesn’t get included because it doesn’t fit into the window of 2 words.

In order to get the best performance, it is important to use a weighting function which turns these counts into more informative values. A good weighting scheme would downweight frequent words such as “this” and “a”, and upweight more informative words such as “read”. The authors experimented with two weighting schemes: Pointwise Mutual Information  and Local Mutual Information. An overview of various weighting methods can be found in the thesis of Curran (2003).

These vectors are also often compressed. The motivation is that feature words such as “buy” and “purchase” are very similar and compression techniques should be able to represent them as a single semantic class. In addition, this gives smaller dense vectors, which can be easier to operate on. The authors here experimented with SVD and Non-negative Matrix Factorization.

The predicting model

A the neural network (predicting) model, this work uses the word2vec implementation of Continuous Bag-of-Words (CBOW). Word2vec is a toolkit for efficiently learning word vectors from very large text corpora (Mikolov et al., 2013). The CBOW architecture is shown below:


The vectors of context words are given as input to the network. They are summed and then used to predict the main word. During training, the error is backpropagated and the context vectors are updated so that they would predict the correct word. Since similar words appear in similar contexts, the vectors of similar words will also be updated towards similar directions.

A normal network would require finding the probabilities of all possible words in our vocabulary (using softmax), which can be computationally very demanding. Word2vec implements two alternatives to speed this up:

  1. Hierarchical softmax, where the words are arranged in a tree (similar to a decision tree), making the complexity logarithmic.
  2. Negative sampling, Where the system learns to distinguish the correct answer from a sample of a few incorrect answers.

In addition, word2vec can downsample very frequent words (for example, function words such as “a” and “the”) which are not very informative.

From personal experience I have found skip-gram models (also implemented in word2vec) to perform better than CBOW, although they are slightly slower. Skip-grams were not compared in this work, so it is still an open question which of the two models gives better vectors, given the same training time.


The models were trained on a corpus of 2.8 billion tokens, combining ukWac, the English Wikipedia and the British National Corpus. They used 300,000 most frequent words as both the context and target words.

The authors performed on a number of experiments on different semantic similarity tasks and datasets.

  1. Semantic relatedness: In this task, humans were asked to rate the relatedness of two words, and the system correlation with these scores is measured.
    • rg: 65 noun pairs by Rubenstein & Goodenough (1965)
    • ws: Wordsim353, 353 word pairs by Finkelstein et al. (2002)
    • wss: Subset of Wordsim353 focused on similarity (Agirre et al., 2009)
    • wsr: Subset of Wordsim353 focused on relatedness (Agirre et al., 2009)
    • men: 1000 word pairs by Bruni et al. (2014)
  2. Synonym detection: The system needs to find the correct synonym for a word.
    • toefl: 80 multiple-choice questions with 4 synonym candidates (Landauer & Dumais, 1997).
  3. Concept categorization: Given a set of concepts, the system needs to group them into semantic categories.
    • ap: 402 concepts in 21 categories (Almuhareb, 2006)
    • esslli: 44 concepts in 6 categories (Baroni et al., 2008)
    • battig: 83 concepts in 10 categories (Baroni et al., 2010)
  4. Selectional preferences: Using some examples, the system needs to decide how likely a noun is to be a subject or object for a specific verb.
    • up: 221 word pairs (Pado, 2007)
    • mcrae: 100 noun-verb pairs (McRae et al., 1998)
  5. Analogy: Using examples, the system needs to find a suitable word to solve analogy questions, such as: “X is to ‘uncle’ as ‘sister’ is to ‘brother'”
    • an: ~19,500 analogy questions (Mikolov et al., 2013b)
    • ansyn: Subset of the analogy questions, focused on syntactic analogies
    • ansem: Subset of the analogy questions, focused on semantic analogies


The graph below illustrates the performance of different models on all the tasks. The blue bars represent the counting models, and the red bars are for neural network models. The “individual” models are the best models on that specific task, whereas the “best-overall” is the single best model across all the tasks.

In conclusion, the neural networks win with a large margin. The neural models have given a large improvement on the tasks of semantic relatedness, synonym detection and analogy detection. The performance is equal or slightly better on categorization and selectional preferences.

The best parameter choices for counting models are as follows:

  • window size 2 (bigger is not always better)
  • weighted with PMI, not LMI
  • no dimensionality reduction (not using SVD or NNMF)

The best parameters for the neural network model were:

  • window size 5
  • negative sampling (not hierarchical softmax)
  • subsampling of frequent words
  • dimensionality 400

The paper also finds that the neural models are much less sensitive to parameter choices, as even the worst neural models perform relatively well, whereas counting models can thoroughly fail with unsuitable values.

It seems appropriate to quote the final conclusion of the authors:

Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. A more realistic expectation was that a complex picture would emerge, with predict and count vectors beating each other on different tasks. Instead, we found that the predict models are so good that, while triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.

The results certainly support that vectors created by neural models are more suitable for distributional similarity tasks. We have also performed a similar comparison on the task of hyponym generation (Rei & Briscoe, 2014). Our findings match those here – the neural models outperform simple bag-of-words models. However, the neural models are in turn outperformed by a dependency-based model. It remains to be seen whether this finding also applies on a wider range of distributional similarity models.


Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on – NAACL ’09, (June), 19. doi:10.3115/1620754.1620758

Almuhareb, A. (2006) Attributes in Lexical Acquisition. Phd thesis, University of Essex.

Baroni, M., Barbu, E., Murphy, B., & Poesio, M. (2010). Strudel: A distributional semantic model based on properties and types. Cognitive Science, 34.

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 238–247).

Baroni, M., Evert, S., & Lenci, A., editors. (2008). Bridging the gap between Semantic
Theory and Computational Simulations: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics.

Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal Distributional Semantics. J. Artif. Intell. Res.(JAIR), 49, 1–47.

Curran, J. R. (2003). From distributional to semantic similarity. University of Edinburgh. University of Edinburgh.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. In ACM Transactions on Information Systems (Vol. 20, pp. 116–131). ACM. doi:10.1145/503104.503110

McRae, K., Spivey-Knowlton, M., & Tanenhaus, M. (1998). Modeling the influence of the- matic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38:283–312.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 1–12.

Mikolov, T., Yih, W., & Zweig, G. (2013b). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Pado, U. (2007). The Integration of Syntax and Semantic Plausibility in a Wide-Coverage Model of Sentence Processing. Dissertation, Saarland University, Saarbrücken.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. In CoNLL-2014 (pp. 68–77).

Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy.
Communications of the ACM, 8 (10), 627–633.