Linguistic Regularities in Word Representations

In 2013, Mikolov et al. (2013) published a paper showing that complicated semantic analogy problems could be solved simply by adding and subtracting vectors learned with a neural network. Since then, there has been some more investigation into what is actually behind this method, and also some suggested improvements. This post is a summary/discussion of the paper “Linguistic Regularities in Sparse and Explicit Word Representations“, by Omer Levy and Yoav Goldberg, published at ACL 2014.

The Task

The task under consideration is analogy recovery. These are questions in the form:

a is to b as c is to d

In a usual setting, the system is given words a, b, c, and it needs to find d. For example:

‘apple’ is to ‘apples’ as ‘car’ is to ?

where the correct answer is ‘cars’. Or the well-known example:

‘man’ is to ‘woman’ as ‘king’ is to ?

where the desired answer is ‘queen’.

While methods such as relation extraction would also be completely reasonable approaches to this problem, the research is mainly focused on solving it by using vector similarity methods. This means we create vector representations for each of the words, and then use their positions in the high-dimensional feature space to determine what the missing word should be.

Vector Spaces

As the first step, we need to create feature vectors for each word in our vocabulary. The two main ways of doing this, which are also considered by this paper, are:

  1. BOW: The bag-of words approach. We count all the words that a certain word appears with, treat the counts as features in a vector, and weight them using something like positive pointwise mutual information (PPMI).
  2. word2vec: Vectors created using the word2vec toolkit. Low-dimensional dense vector representations are learned for each word using a neural network.

If you want to learn more details about these models, take a look at an earlier post about comparing both of these models.

Analogy Artihmetic

We have the vectors, but how can we find the correct words in the analogy task?

The idea of Mikolov et al. (2013) was that the vector offset of vectors should reflect their relation:

\(a – b \approx c – d\)
\(man -woman \approx king -queen\)

We can rearrange this to find d:

\(d \approx c – a +b\)
\(queen \approx king -man +woman\)

Therefore, we can calculate the vector for \(c – a + b\), compare it to the vectors of all words in our vocabulary, and return the highest-scoring word as the answer.

\(d_w = arg max_{d_w’ \in V}(cos(d’, c – a +b))\)

I’m using \(d_w\) here to denote the actual word, whereas the \(d\) stands for the vector representation of \(d_w\). \(V\) is our whole vocabulary. \(cos()\) is the cosine similarity between two vectors. Let’s refer to this as the addition method.regularities_equation

An alternative to this is to consider the directions of the offset directly. We find the word that has the most similar offset (in terms of direction), compared to the offset of the sample words.

\(d_w = arg max_{d’_w \in V}(cos(a – b, c – d’))\)

It only looks at the direction of the relational step, ignoring the actual magnitude of this step. We’ll see later that sometimes this is good, sometimes not so much. We’ll refer to this method as direction. Apparently this method was actually used by Mikolov in some experiments, although it wasn’t included in the paper.

Now onto the modification proposed by Levy & Goldberg (2014). The addition method basically looks for a word that is similar to c and b, and not similar to a, with each contributing equally. This can lead to cases where one large term dominates the equation, especially if the magnitudes of the similarities are very different. Therefore, the authors have proposed that instead of adding these similarities, they could be multiplied:

\(d_w = arg max_{d’_w \in V}\frac{cos(d’, c) cos(d’, b)}{cos(d’, a) + \epsilon}\)

If the vectors are normalised (unit length), then this formula actually becomes a multiplication of their scalar products:

\(d_w = arg max_{d’_w \in V}\frac{(d’ \cdot c) \times (d’ \cdot b)}{(d’ \cdot a) + \epsilon}\)

\(\epsilon\) is set to 0.001 to avoid division by zero. Also, the formula doesn’t handle negative similarity scores, whereas cosine (and the scalar product) can be negative. Therefore, the values are transformed into a positive range by doing \(x_{new} = (x + 1)/2\). We’ll refer to this as the multiplication method.


The alternative methods are evaluated on the task of analogy recovery, using three different datasets:

  1. The MSR dataset containing 8,000 analogy questions.
  2. The GOOGLE dataset with 19,544 analogy questions.
  3. The SEMEVAL dataset, covering 79 distinct relation types.

In the MSR and GOOGLE datasets, the system needs to find the correct answer from an open vocabulary. In contrast, the SEMEVAL task provides candidates for each relation that need to be reranked. Accuracy is used as the evaluation metric on all datasets.

The word vectors were trained on 1.5 billion words of English Wikipedia. The vocabulary was restricted to contain only words that occurred at least 100 times in the corpus, resulting in 189,533 words.




Some observations based on the results:

  • The multiplication method outperforms the additon method on two tasks, and gives comparable results on the third one.
  • The direction method performs horribly on the MSR and GOOGLE tasks, but gives the best results on SEMEVAL. This is because the method only looks at the direction of the relation, but doesn’t care if the result is actually positioned very far in the semantic space. The SEMEVAL task restricts the candidates to a small list, and the method is able to pick the best option, but doesn’t perform as well on open vocabulary.
  • The BOW model is generally worse than word2vec when using the addition method, but is comparable to the best results when using multiplication. This indicates that the property of linguistic regularities applies to both classical BOW and neural network representations.


Based on my experience with applying different vector operations to discover linguistic regularities, I’d like to point out two observations.

First, while the ability to recover semantic relations in such a way is remarkable, it is not quite as impressive as it seems at first glance. Let’s take the well-known example of king – man + woman = queen. Depending on the parameters I used, several vector spaces I built actually gave queen as the most similar word to king, with no vector operations needed. This makes sense, because king and queen are semantically very related, and probably occur in very similar contexts. But it also means that (woman – man) only needs to be a zero vector or some small random noise in order to give the correct answer. I’m not saying this is a major factor in these experiments, but I would encourage future work to also report baselines without using any vector operations.

Second, I observed that the quality of these regularities highly depends on the examples that are provided, and how they relate to the query word. For my work (Rei & Briscoe, 2014) I looked at hyponym relations, and below are some examples when trying to find hyponyms for bird.


The greyed out terms are the query terms, and it is common practice to ignore these in the results; the bold results are correct answers. As you can see, just the plain vector similarity returns one bird species correctly in the top results. When using (salmon,fish) as the example pair, the system is quite good and returns a whole lot more. But as I start to use hyponym examples from other domains, the quality degrades fast, and (bird-treatment+therapy) actually returns worse results than the basic vector similarity.

I believe there is still plenty of space to further improve out understanding of how we can operate on various types of features, in order to model different semantic relations and compositional semantics.


Levy, O., & Goldberg, Y. (2014). Linguistic Regularities in Sparse and Explicit Word Representations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 171–180).

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. In CoNLL-2014 (pp. 68–77).

Multilingual Semantic Models

In this post I’ll discuss a model for learning word embeddings, such that they end up in the same space in different languages. This means we can find the similarity between some English and German words, or even compare the meaning of two sentences in different languages. It is a summary and analysis of the paper by Karl Moritz Hermann and Phil Blunsom, titled “Multilingual Models for Compositional Distributional Semantics“, published at ACL 2014.

The Task

The goal of this work is to extend the distributional hypothesis to multilingual data and joint-space embeddings. This would give us the ability to compare words and sentences in different languages, and also make use of labelled training data from languages other than the target language. For example, below is an illustration of English words and their Estonian translations in the same semantic space.



This actually turns out to be a very difficult task, because the distributional hypothesis stops working across different languages. While “fish” is an important feature of “cat”, because they occur together often, “kass” never occurs with “fish”, because they are in different languages and therefore used in separate sets of documents.

In order to learn these representations in the same space, the authors construct a neural network that learns from parallel sentences (pairs of the same sentence in different languages). The model is then evaluated on the task of topic classification, training on one language and testing on the other.

A bit of a spoiler, but here is a visualisation of some words from the final model, mapped into 2 dimensions.



The words from English, German and French are successfully mapped into clusters based on meaning. The colours indicate gender (blue=male, red=female, green=neutral).

The Multilingual Model

The main idea is as follows: We have sentence \(a\) in one language, and we have a function \(f(a)\) which maps that sentence into a vector representation (we’ll come back to that function). We then have sentence \(b\), which is the same sentence just in a different language, and function \(g(b)\) for mapping it into a vector representation. Our goal is to have \(f(a)\) and \(g(b)\) be identical, because both of these sentences have the same meaning. So during training, we show the model a series of parallel sentences \(a\) and \(b\), and each time we adjust the functions \(f(a)\) and \(g(b)\) so that they would produce more similar vectors.

Here is a graphical representation of the model:


\(b1\), \(b2\) and \(b3\) are words in sentence \(b\); \(a1\), \(a2\),\(a3\) and \(a4\) are words in sentence \(a\). The red vectors in the middle are the sentence representations that we want to be similar.

Next, let’s talk about the functions \(f(a)\) and \(g(b)\) that map a sentence to a vector. As you can see from the image above, each word is represented as a vector as well. The simplest option of going from words to sentences is to just add the individual word vectors together (the ADD model):

\(f_{ADD}(a) = \sum_{i=1}^{n} a_i\)

Here, \(a_i\) is the vector for word number \(i\) in sentence \(a\). This addition is similar to a basic bag-of-words model, because it doesn’t preserve any information about the order of the words. Therefore, the authors have also proposed a bigram version of this function (the BI model):

\(f_{BI}(a) = \sum_{i=1}^{n} tanh(a_{i-1} + a_i)\)

In this function, we step though the sentence, add together vectors for two consecutive words, and pass them through a nonlinearity (tanh). The result is then summed together into a sentence vector. This is essentially a multi-layer compositional network, where word vectors are first combined to bigram vectors, and then bigram vectors are combined to sentence vectors.

One more component to make this model work – the optimization function. The authors define an energy function given two sentences:

\(E(a,b) = || f(a) – g(b) ||^ 2\)

This means we find the Euclidean distance between the two vector representations and take the square of it. This value will be big when the vectors are different, and small when they are similar.

But we can’t directly use this for optimization, because functions \(f(a)\) and \(g(b)\) that always returned zero vectors would be the most optimal solution. We want the model to give similar vectors for similar sentences, but different vectors for semantically different sentences. Here’s a function for that:

\(E_{nc}(a,b,c) = [m + E(a,b) – E(a,c)]_{+}\)

We’ve introduced a randomly selected sentence \(c\) that probably has nothing to do with \(a\) or \(b\). Our objective is to minimze the \(E_{nc}(a,b,c)\) function, which means we want \(E(a,b)\) (for related sentences) to be small, and \(E(a,c)\) (for unrelated sentences) to be large. This form of training – teaching the model to distinguish between correct and incorrect samples – is called noise contrastive estimation. The formula also includes \(m\), which is the margin we want to have between the values of \(E(a,b)\) and \(E(a,c)\). The whole thing is passed through the function \([x]_{+} = max(x,0)\), which means that if \(E(a,c)\) is greater than \(E(a,b)\) by margin \(m\), then we’re already optimal and don’t need to adjust the model further.

The authors also experiment with a document-level variation of the model (the DOC model), where individual sentence vectors are combined into document vectors and these are also optimized to be similar, in addition to the sentence vectors.


The authors evaluate the system on the task of topic classification. The classifier is trained on one language (eg English) and the test results are reported on another language (eg German) for which no labelled training data was used. They run two main experiments:

  1. The cross-lingual document classification (CLDC) task, described by Klementiev et al. (2012). The system is trained on the parallel Europarl corpus, and tested on Reuters RCV1/RCV2. The language pairs used were English-German and English-French.
  2. The authors built a new corpus from parallel subtitles of TED talks (not yet online at the time of writing this), based on a previous TED corpus for IWSLT. Each talk also has topic tags assigned to them, and the task is to assign a correct tag to every talk, using the document-level vector.



First, results on the CLDC task:

I-Matrix is the previous state-of-the-art system, and all the models described here manage to outperform it. The +-variations of the model (ADD+ and BI+) use the French data as an extra training resource, thereby improving performance. This is an interesting result, as the classifier is trained on English and tested on German, and French seems completely unrelated to the task. But by adding it into the model, the English word representations are improved by having more information available, which in turn propagates on to having better German representations.

Next, experiments on the TED corpus:


The authors have performed a much larger number of experiments, and I’ve only chosen a few examples to show here.

The MT System is a machine translation baseline, where the test data is translated into the source language using a machine translation system. The most interesting scenarios for application are where the source language is English, and this is where the MT baseline often still wins. So if we want to topic classification in Dutch, but we only have English labelled data, it’s best to just automatically translate the Dutch text into English before classification.

Experiments in the other direction (where the target language is English) show different results, and the multilingual neural models manage to win on most languages. I’m curious about this difference – perhaps the MT systems are better tuned to translate into English, but not as good when translating from English into other languages? In any case, I think with some additional developments the neural network model will be able to beat the baseline in both directions.

In most cases, adding the document-level training signal (ADD/DOC) helped accuracy quite a bit. The bigram models (BI) however were outperformed by the basic ADD models on this task, and the authors suspect this is due to sparsity issues caused by less training data.

Finally, the ADD/DOC/joint model was trained on all languages simultaneously, taking advantage of parallel data in all the languages, and mapping all vectors into the same space. The results of this experiment seem to be mixed, leading to an improvement on some languages and decrease on others.

In conclusion, this is definitely a very interesting model, and it bridges the gap between vector representations of different languages, using only sentence-aligned plain text. Combining sentence-level and document-level training signals seems to give a fairly consistent improvement in classification accuracy. Unfortunately, in the most interesting scenario, mapping from English to other resource-poor languages, this system does not yet beat the MT baseline. But hopefully this is only the first step, and future research will further improve the results.


Hermann, K. M., & Blunsom, P. (2014). Multilingual Models for Compositional Distributed Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 58–68).

Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In COLING 2012.

Political ideology detection

Neural networks have a range of interesting applications, and here I will discuss on one them: recursive neural networks and the detection of political ideology. This post is a summary and analysis of a recent publication by Mohit Iyyer, Peter Anns, Jordan Boyd-Graber and Philip Resnik: “Political Ideology Detection Using Recursive Neural Networks“.

The Task

Given a sentence, we want the model to detect the political ideology expressed in that sentence. In this research, the authors deal with US politics, so the possible options are liberal (democrats) or conservative (republicans). As a practical application we might consider a system that processes a large amount of news articles or public speeches to detect and measure explicit or hidden political bias of the authors.


A traditional approach to this problem is a simple bag-of-words model, where each word is treated as a separate feature, but this ignores any syntactic structure and even word order. As shown below, political ideology can be compositionally complicated – while certain sections of the sentence are locally conservative, the way they are used in context makes the overall sentence liberal.political_sample_1

Figure 1: Sample sentence from Iyyer et al. (2014). Blue nodes are liberal, red nodes are conservative, grey nodes are neutral.

Recursive Neural Network

The recursive neural network in this work is based on Socher et al. (2011), where it was used for sentiment detection. The idea of semantic composition is that the meaning of a sentence or text is composed of its smaller subparts, and recursive neural networks aim to model this by recursively combining vector representations of words and phrases.

We represent each word as a vector of length d. We can then construct a network that takes two word vectors, concatenated into a vector of length 2d, as input and outputs a new vector of length d. We can then stack these networks together and use the output of one network as one of the inputs to another network. This way we can combine multiple words to phrases and phrases to sentences. We don’t need to have a separate network for each of these levels, and can just reuse the same network each time – which is why these are called recursive neural networks.

Mathematically, the composition happens as follows:

x_c = f(W_L\times x_a + W_R\times x_b + b)

\(x_a\) and \(x_b\) are the vectors for the two input words/phrases; \(x_c\) is the output vector; \(W_L\) and \(W_R\) are the weight matrices for the left and right input; \(b\) is the bias vector. If we think of the input as a concatenation of the vectors, this formula actually becomes the traditional single-layer neural network:

x_c = f(W\times x_{ab} + b)

An example of this recurrent neural network in action is shown below:


During training, an additional softmax layer is added on top, which calculates the probability for both classes (liberal or conservative). The model is then trained to minimize the negative log-probability of the correct class, with L2-regularization.  The error is backpropagated through the whole tree, so the information on the sentence level will be used to modify the individual word vectors, along with all the weights used in the composition.

Also, if I understand the paper correctly, the final prediction from the model at test time is actually generated by averaging vectors for all the nodes in the sentence tree, concatenating this with the root vector, and training a separate logistic regression model over that. Which seems a bit odd, because the recurrent network is quite capable of predicting the class label on the root node, and has been optimised to do so. I wonder what the performance difference would be, if the root-level predictions from the RNN were used directly for evaluation.


The authors use two datasets for evaluation:

  1. The Convote dataset (Thomas et al., 2006) contains transcripts of spoken text from the US Congressional floor debate. The experiments here use 7,816 sentences from the corpus.
  2. The Ideological Books Corpus (IBC) (Gross et al., 2013) is a collection of books and articles by authors whose political bias is well-known. Iyyer et al. (2014) extend the dataset by providing sentence-level and partial phrase-level annotation for 4,062 sentences, crowd-sourced through Crowdflower.

Both of these datasets are originally annotated on the level of the author and their political views, but this information can be used to also label each sentence with its most likely political class.

On both datasets, the authors apply heuristic filters to only keep sentences containing explicit bias in either direction, and remove neutral sentences which take up majority of the corpora. However, as this filtering is based on surface forms (essentially bag-of-words or bag-of phrases), it is unknown how this affects the final dataset. It would be interesting to also see results where the test set contains all of the original sentences.


The authors compare a number of different models.

  • Baselines
    • Random: The label (conservative/liberal) is chosen randomly.
    • LR1: Basic logistic regression using bag-of-words (BoW) features.
    • LR2: As previous, but adding phrase annotations as more training data.
    • LR3: Logistic regression with BoW and dependency-based features.
    • LR-w2v: Logistic regression over averaged word vectors from word2vec.
  • RNN Models
    • RNN1: The basic RNN model, trained on sentence-level annotations, using random initialization.
    • RNN1-w2v: As previous, but initialized with vectors from word2vec.
    • RNN2-w2v: The RNN model, trained on sentence-level and phrase-level annotations, initialized with vectors from word2vec.

The results (accuracy) can be seen in the graph.


Some conclusions we can draw from the results:

  • All the models outperform the random baseline by quite a margin.
  • All the recurrent neural network models outperform all the logistic regression models.
  • Adding the phrase-level annotations as extra training data for the logistic regression decreases performance (LR2), but adding them to the RNN model improves performance (RNN2-w2v).
  • Logistic regression trained on averaged word2vec vectors (LR-w2v) outperforms logistic regression with BoW features (LR1), and even dependency-based features (LR3, on the IBC dataset).
  • Initializing the RNN with word2vec vectors gives a little boost to the overall accuracy.


Gross, J., Acree, B., Sim, Y., & Smith, N. A. (2013). Testing the Etch-a-Sketch Hypothesis : Measuring Ideological Signaling via Candidates’ Use of Key Phrases. In APSA 2013.

Iyyer, M., Enns, P., Boyd-Graber, J., & Resnik, P. (2014). Political Ideology Detection Using Recursive Neural Networks. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 1113–1122).

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semi-supervised recursive autoencoders for predicting sentiment distributions. Proceedings of EMNLP. Retrieved from

Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing – EMNLP ’06 (pp. 327–335).

Don’t count, predict

In the past couple of years, neural networks have nearly taken over the field of NLP, as they are being used in recent state-of-the-art systems for many tasks. One interesting application is distributional semantics, as they can be used to learn intelligent dense vector representations for words. Marco Baroni, Georgiana Dinu and German Kruszewski presented a paper in ACL 2014 called “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors“, where they compare these new neural-network models with more traditional context vectors on a range of different tasks. Here I will try to give an overview and a summary of their work.

Distributional hypothesis

The goal is to find how similar two words are to each other semantically. The distributional hypothesis states:

Words which are similar in meaning occur in similar contexts
(Rubenstein & Goodenough, 1965).

Therefore, if we want to find a word similar to “magazine”, we can look for words that occur in similar contexts, such as “newspaper”.

I was reading a magazine today I was reading a newspaper today
The magazine published an article The newspaper published an article
He buys this magazine every day He buys this newspaper every day


Also, if we want to find how similar “magazine” and “newspaper” are, we can compare how similar are all the contexts in which they appear. For example, to find the similarity between two words, we can represent the contexts as feature vectors and calculate the cosine similarity between their corresponding vectors.

The counting model

In a traditional counting model, we count how many times a certain context word appears near our main word. We have to manually choose a window size – the number of words we count on either side of the main word.

For example, given sentences “He buys this newspaper every day” and “I read a newspaper every day”, we can look at window size 2 (on either side of the main word) and create the following count vector:

buys this every day read a
1 1 2 2 1 1

As you can see, “every” and “day” get a count of 2, because they appear in both sentences, whereas “he” doesn’t get included because it doesn’t fit into the window of 2 words.

In order to get the best performance, it is important to use a weighting function which turns these counts into more informative values. A good weighting scheme would downweight frequent words such as “this” and “a”, and upweight more informative words such as “read”. The authors experimented with two weighting schemes: Pointwise Mutual Information  and Local Mutual Information. An overview of various weighting methods can be found in the thesis of Curran (2003).

These vectors are also often compressed. The motivation is that feature words such as “buy” and “purchase” are very similar and compression techniques should be able to represent them as a single semantic class. In addition, this gives smaller dense vectors, which can be easier to operate on. The authors here experimented with SVD and Non-negative Matrix Factorization.

The predicting model

A the neural network (predicting) model, this work uses the word2vec implementation of Continuous Bag-of-Words (CBOW). Word2vec is a toolkit for efficiently learning word vectors from very large text corpora (Mikolov et al., 2013). The CBOW architecture is shown below:


The vectors of context words are given as input to the network. They are summed and then used to predict the main word. During training, the error is backpropagated and the context vectors are updated so that they would predict the correct word. Since similar words appear in similar contexts, the vectors of similar words will also be updated towards similar directions.

A normal network would require finding the probabilities of all possible words in our vocabulary (using softmax), which can be computationally very demanding. Word2vec implements two alternatives to speed this up:

  1. Hierarchical softmax, where the words are arranged in a tree (similar to a decision tree), making the complexity logarithmic.
  2. Negative sampling, Where the system learns to distinguish the correct answer from a sample of a few incorrect answers.

In addition, word2vec can downsample very frequent words (for example, function words such as “a” and “the”) which are not very informative.

From personal experience I have found skip-gram models (also implemented in word2vec) to perform better than CBOW, although they are slightly slower. Skip-grams were not compared in this work, so it is still an open question which of the two models gives better vectors, given the same training time.


The models were trained on a corpus of 2.8 billion tokens, combining ukWac, the English Wikipedia and the British National Corpus. They used 300,000 most frequent words as both the context and target words.

The authors performed on a number of experiments on different semantic similarity tasks and datasets.

  1. Semantic relatedness: In this task, humans were asked to rate the relatedness of two words, and the system correlation with these scores is measured.
    • rg: 65 noun pairs by Rubenstein & Goodenough (1965)
    • ws: Wordsim353, 353 word pairs by Finkelstein et al. (2002)
    • wss: Subset of Wordsim353 focused on similarity (Agirre et al., 2009)
    • wsr: Subset of Wordsim353 focused on relatedness (Agirre et al., 2009)
    • men: 1000 word pairs by Bruni et al. (2014)
  2. Synonym detection: The system needs to find the correct synonym for a word.
    • toefl: 80 multiple-choice questions with 4 synonym candidates (Landauer & Dumais, 1997).
  3. Concept categorization: Given a set of concepts, the system needs to group them into semantic categories.
    • ap: 402 concepts in 21 categories (Almuhareb, 2006)
    • esslli: 44 concepts in 6 categories (Baroni et al., 2008)
    • battig: 83 concepts in 10 categories (Baroni et al., 2010)
  4. Selectional preferences: Using some examples, the system needs to decide how likely a noun is to be a subject or object for a specific verb.
    • up: 221 word pairs (Pado, 2007)
    • mcrae: 100 noun-verb pairs (McRae et al., 1998)
  5. Analogy: Using examples, the system needs to find a suitable word to solve analogy questions, such as: “X is to ‘uncle’ as ‘sister’ is to ‘brother'”
    • an: ~19,500 analogy questions (Mikolov et al., 2013b)
    • ansyn: Subset of the analogy questions, focused on syntactic analogies
    • ansem: Subset of the analogy questions, focused on semantic analogies


The graph below illustrates the performance of different models on all the tasks. The blue bars represent the counting models, and the red bars are for neural network models. The “individual” models are the best models on that specific task, whereas the “best-overall” is the single best model across all the tasks.

In conclusion, the neural networks win with a large margin. The neural models have given a large improvement on the tasks of semantic relatedness, synonym detection and analogy detection. The performance is equal or slightly better on categorization and selectional preferences.

The best parameter choices for counting models are as follows:

  • window size 2 (bigger is not always better)
  • weighted with PMI, not LMI
  • no dimensionality reduction (not using SVD or NNMF)

The best parameters for the neural network model were:

  • window size 5
  • negative sampling (not hierarchical softmax)
  • subsampling of frequent words
  • dimensionality 400

The paper also finds that the neural models are much less sensitive to parameter choices, as even the worst neural models perform relatively well, whereas counting models can thoroughly fail with unsuitable values.

It seems appropriate to quote the final conclusion of the authors:

Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. A more realistic expectation was that a complex picture would emerge, with predict and count vectors beating each other on different tasks. Instead, we found that the predict models are so good that, while triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.

The results certainly support that vectors created by neural models are more suitable for distributional similarity tasks. We have also performed a similar comparison on the task of hyponym generation (Rei & Briscoe, 2014). Our findings match those here – the neural models outperform simple bag-of-words models. However, the neural models are in turn outperformed by a dependency-based model. It remains to be seen whether this finding also applies on a wider range of distributional similarity models.


Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on – NAACL ’09, (June), 19. doi:10.3115/1620754.1620758

Almuhareb, A. (2006) Attributes in Lexical Acquisition. Phd thesis, University of Essex.

Baroni, M., Barbu, E., Murphy, B., & Poesio, M. (2010). Strudel: A distributional semantic model based on properties and types. Cognitive Science, 34.

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 238–247).

Baroni, M., Evert, S., & Lenci, A., editors. (2008). Bridging the gap between Semantic
Theory and Computational Simulations: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics.

Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal Distributional Semantics. J. Artif. Intell. Res.(JAIR), 49, 1–47.

Curran, J. R. (2003). From distributional to semantic similarity. University of Edinburgh. University of Edinburgh.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. In ACM Transactions on Information Systems (Vol. 20, pp. 116–131). ACM. doi:10.1145/503104.503110

McRae, K., Spivey-Knowlton, M., & Tanenhaus, M. (1998). Modeling the influence of the- matic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38:283–312.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 1–12.

Mikolov, T., Yih, W., & Zweig, G. (2013b). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Pado, U. (2007). The Integration of Syntax and Semantic Plausibility in a Wide-Coverage Model of Sentence Processing. Dissertation, Saarland University, Saarbrücken.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. In CoNLL-2014 (pp. 68–77).

Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy.
Communications of the ACM, 8 (10), 627–633.


Neural Networks, Part 3: The Network

We have learned about individual neurons in the previous section, now it’s time to put them together to form an actual neural network.

The idea is quite simple – we line multiple neurons up to form a layer, and connect the output of the first layer to the input of the next layer. Here is an illustration:

Figure 1: Neural network with two hidden layers.

Each red circle in the diagram represents a neuron, and  the blue circles represent fixed values. From left to right, there are four columns: the input layer, two hidden layers, and an output layer. The output from neurons in the previous layer is directed into the input of each of the neurons in the next layer.

We have 3 features (vector space dimensions) in the input layer that we use for learning: \(x_1\), \(x_2\) and \(x_3\). The first hidden layer has 3 neurons, the second one has 2 neurons, and the output layer has 2 output values. The size of these layers is up to you – on complex real-world problems we would use hundreds or thousands of neurons in each layer.

The number of neurons in the output layer depends on the task. For example, if we have a binary classification task (something is true or false), we would only have one neuron. But if we have a large number of possible classes to choose from, our network can have a separate output neuron for each class.

The network in Figure 1 is a deep neural network, meaning that it has two or more hidden layers, allowing the network to learn more complicated patterns. Each neuron in the first hidden layer receives the input signals and learns some pattern or regularity. The second hidden layer, in turn, receives input from these patterns from the first layer, allowing it to learn “patterns of patterns” and higher-level regularities. However, the cost of adding more layers is increased complexity and possibly lower generalisation capability, so finding the right network structure is important.


I have implemented a very simple neural network for demonstration. You can find the code here:

The first important method is initialiseNetwork(), which sets up the necessary structures:

	public void initialiseNetwork(){
		input = new double[1 + M]; // 1 is for the bias
		hidden = new double[1 + H];
		weights1 = new double[1 + M][H];
		weights2 = new double[1 + H];

		input[0] = 1.0; // Setting the bias
		hidden[0] = 1.0;

\(M\) is the number of features in the feature vectors, \(H\) is the number of neurons in the hidden layer. We add 1 to these, since we also use the bias constants.

We represent the input and hidden layer as arrays of doubles. For example, hidden[i] stores the current output value of the i-th neuron in the hidden layer.

The first set of weights, between the input and hidden layer, are stored as a matrix. Each of the \((1 + M)\) neurons in the input layer connects to \(H\) neurons in the hidden layer, leading to a total of \((1 + M) \times H\) weights. We only have one output neuron, so the second set of weights between hidden and output layers is technically a \((1 + H) \times 1\) matrix, but we can just represent that as a vector.

The second important function is forwardPass(), which takes an input vector and performs the computation to reach an output value.

	public void forwardPass(){
		for(int j = 1; j < hidden.length; j++){
			hidden[j] = 0.0;
			for(int i = 0; i < input.length; i++){
				hidden[j] += input[i] * weights1[i][j-1];
			hidden[j] = sigmoid(hidden[j]);

		output = 0.0;
		for(int i = 0; i < hidden.length; i++){
			output += hidden[i] * weights2[i];
		output = sigmoid(output);

The first for-loop calculates the values in the hidden layer, by multiplying the input vector with the weight vector and applying the sigmoid function. The last part calculates the output value by multiplying the hidden values with the second set of weights, and also applying the sigmoid.


To test out this network, I have created a sample dataset using the database at This dataset contains sociodemographic statistics for 141 countries:

  • Population density (per suqare km)
  • Population growth rate (%)
  • Urban population (%)
  • Life expectancy at birth (years)
  • Fertility rate (births per woman)
  • Infant mortality (deaths per 1000 births)
  • Enrolment in tertiary education (%)
  • Unemployment (%)
  • Estimated control of corruption (score)
  • Estimated government effectiveness (score)
  • Internet users (per 100)

Based on this information, we want to train a neural network that can predict whether the GDP per capita is more than average for that country (label 1 if it is, 0 if it’s not).

I’ve separated the dataset for training (121 countries) and testing (40 countries). The values have been normalised, by subtracting the mean and dividing by the standard deviation, using a script from a previous article. I’ve also pre-trained a model that we can load into this network and evaluate. You can download these from here: original data, training data, test data, pretrained model.

You can then execute the neural network (remember to compile and link the binaries):

java neuralnet.SimpleNeuralNetwork data/model.txt data/countries-classify-gdp-normalised.test.txt

The output should be something like this:

Label: 0	Prediction: 0.01
Label: 0	Prediction: 0.00
Label: 1	Prediction: 0.99
Label: 0	Prediction: 0.00
Label: 0	Prediction: 0.20
Label: 0	Prediction: 0.01
Label: 1	Prediction: 0.99
Label: 0	Prediction: 0.00
Accuracy: 0.9

The network is in verbose mode, so it prints out the labels and predictions for each test item. At the end, it also prints out the overall accuracy. The test data contains 14 positive and 26 negative examples; a random system would have had accuracy 50%, whereas a biased system would have accuracy 65%. Our network managed 90%, which means it has learned some useful patterns in the data.

In this case we simply loaded a pre-trained model. In the next section, I will describe how to learn this model from some training data.

How to normalise feature vectors

I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over the place. In this example I’m working with demographical real-world values for countries. For example, a feature for GDP per person in a country ranges from 551.27 to 88286.0, whereas estimates for corruption range between -1.56 to 2.42. This can be very confusing for machine learning algorithms, as they can end up treating bigger values as more important signals.

To handle this issue, we want to scale all the feature values into roughly the same range. We can do this by taking each feature value, subtracting its mean (thereby shifting the mean to 0), and dividing by the standard deviation (normalising the distribution). This is a piece of code I’ve implemented a number of times for various projects, so it’s time to write a nice reusable script. Hopefully it can be helpful for others as well. I chose to do this in python, as it’s easies to run compared to C++ and Java (doesn’t need to be compiled), but has better support for real-valued numbers compared to bash scripting.

Each line in the input file is assumed to be a feature vector, with values separated by whitespace. The first element is an integer class label that will be left untouched. This is followed by a number of floating point feature values which will be normalised. For example:

1 0.563 13498174.2 -21.3
0 0.114 42234434.3 15.67

We’re assuming dense vectors, meaning that each line has an equal number of features.

To execute it, simply use

python < in.txt > out.txt

The complete script that will normalise feature vectors is here:

import sys;
import fileinput;
import numpy;

data = []
linecount = 0
for line in fileinput.input():
  if line.strip():
    index = 0
    for value in line.split():
      if linecount == 0:
      if index == 0:

for row in range(0, linecount):
  for col in range(0, index):
    if col == 0:
      val = (data[col][row] - numpy.mean(data[col]))/numpy.std(data[col])
      sys.stdout.write("\t" + str(val))


Neural Networks, Part 2: The Neuron

A neuron is a very basic classifier. It takes a number of input signals (a feature vector) and outputs a single value (a prediction). A neuron is also a basic building block of neural networks, and by combining together many neurons we can build systems that are capable of learning very complicated patterns. This is part 2 of an introductory series on neural networks. If you haven’t done so yet, you might want to start by learning about the background to neural networks in part 1.

Neurons in artificial neural networks are inspired by biological neurons in nervous systems (shown below). A biological neuron has three main parts: the main body (also known as the soma), dendrites and an axon. There are often many dendrites attached to a neuron body, but only one axon, which can be up to a meter long. In most cases (although there are exceptions), the neuron receives input signals from dendrites, and then outputs its own signals through the axon. Axons in turn connect to the dendrites of other neurons, using special connections called synapses, forming complex neural networks.

Figure 1: Biological neuron in a nervous system

Below is an illustration of an artificial neuron, where the input is passed in from the left and the prediction comes out from the right. Each input position has a specific weight in the neuron, and they determine what output to give, given a specific input vector. For example, a neuron could be trained to detect cities. We can then take the vector for London from the previous section, give it as input to our neuron, and it will tell us it’s a city by outputting value 1. If we do the same for the word Tuesday, it will give a 0 instead, meaning that it’s not a city.

Figure 2: Artificial neuron

You might notice that there’s a constant value of \(+1\) as one of the input signals, and it has a separate weight \(w_0\). This is called a bias, and it allows the network to shift the activation function up or down. Biases are not strictly required for building neurons or neural networks, but they can be very important to the performance, depending on the types of feature values you are using.

Let’s say we have an input vector \([1, x_1, x_2]\) and a weight vector \([w_0, w_1, w_2]\). Internally, we first multiply the corresponding input values with their weights, and add them together:

z = (w_0 \times 1) + (w_1 \times x_1) + (w_2 \times x_2)

Then, we pass the sum through an activation function. In this case we will use the sigmoid function (also known as the logistic curve) as our activation function.

y = f(z) = f((w_0 \times 1) + (w_1 \times x_1) + (w_2 \times x_2))


f(t) = \frac{1}{1 + e^{-t}}

The sigmoid function takes any real value and maps it to a range between 0 and 1. When plotted, a sigmoid function looks like this:

Sigmoid function
Figure 3: Sigmoid function

Using a sigmoid as our activation function has some benefits:

  • Regardless of input, it will map everything to a range between 0 and 1. We don’t have to worry about output values exploding for unexpected input vectors.
  • The function is non-linear, which allows us to learn more complicated non-linear relationships in our neural networks.
  • It is differentiable, which comes in handy when we try to perform backpropagation to train our neural networks.

However, it is not required to use an activation function, and there exist many successful network architectures that don’t use one. We could also use a different activation, such as a hyperbolic tangent or a rectifier.

We have now looked at all the components of a neuron, but we’re not going to stop here. Richard Feynman once said “What I cannot create, I do not understand”, so let us create a working example of a neuron. Here is the complete code required to test a neuron with pre-trained weights:

public class SimpleNeuron {

  private double[] weights;

  public SimpleNeuron(double[] weights){
    this.weights = weights;

  public double classify(double[] input){
    double value = 0.0;

    // Adding the bias
    value += weights[0] * 1.0;

    // Adding the rest of the weights
    for(int i = 0; i < input.length; i++)
      value += weights[i + 1] * input[i];

    // Passing the value through the sigmoid activation function
    value = 1.0 / (1.0 + Math.exp(-1.0 * value));

    return value;

  public static void main(String[] args) {
    // Creating data structures to hold the data
    String[] names = new String[5];
    double[][] vectors = new double[5][2];

    // Inserting the data
    names[0] = "London";
    vectors[0][0] = 0.86;
    vectors[0][1] = 0.09;
    names[1] = "Paris";
    vectors[1][0] = 0.74;
    vectors[1][1] = 0.11;
    names[2] = "Tuesday";
    vectors[2][0] = 0.15;
    vectors[2][1] = 0.77;
    names[3] = "Friday";
    vectors[3][0] = 0.05;
    vectors[3][1] = 0.82;
    names[4] = "???";
    vectors[4][0] = 0.59;
    vectors[4][1] = 0.19;

    // Initialising the weights
    double[] weights = {0.0, 100.0, -100.0};
    SimpleNeuron neuron = new SimpleNeuron(weights);

    // Classifying each of the data points
    for(int i = 0; i < names.length; i++){
      double prediction = neuron.classify(vectors[i]);
      System.out.println(names[i] + " : " + (int)prediction);

The classify() function is the interesting part in this code. It takes a feature vector as an argument and returns the prediction value y. For testing, we use the examples from the previous section and try to classify each of them. The output of running this code will be as follows:

London : 1
Paris : 1
Tuesday : 0
Friday : 0
??? : 1

As you can see, the neuron has successfully separated cities from days. It has also provided a label for the previously-unknown example – apparently the last data point should belong with cities as well.

For this code example, I manually chose and hard-coded weight values, so that it would provide a good classification. In later sections we will see how to have the system learn these values automatically, using some training data.

Now that we know about individual neurons, in the next section we’ll look at how to connect them together and form neural networks.


Neural Networks, Part 1: Background

Artificial neural networks (NN for short) are practical, elegant, and mathematically fascinating models for machine learning. They are inspired by the central nervous systems of humans and animals – smaller processing units (neurons) are connected together to form a complex network that is capable of learning and adapting.

The idea of such neural networks is not new. McCulloch-Pitts (1943) described binary threshold neurons already back in 1940’s. Rosenblatt (1958) popularised the use of perceptrons, a specific type of neurons, as very flexible tools for performing a variety of tasks. The rise of neural networks was halted after Minsky and Papert (1969) published a book about the capabilities of perceptrons, and mathematically proved that they can’t really do very much. This result was quickly generalised to all neural networks, whereas it actually applied only to a specific type of perceptrons, leading to neural networks being disregarded as a viable machine learning method.

In recent years, however, the neural network has made an impressive comeback. Research in the area has become much more active, and neural networks have been found to be more than capable learners, breaking state-of-the-art results on a wide variety of tasks. This has been substantially helped by developments in computing hardware, allowing us to train very large complex networks in reasonable time. In order to learn more about neural networks, we must first understand the concept of vector space, and this is where we’ll start.

A vector space is a space where we can represent the position of a specific point or object as a vector (a sequence of numbers). You’re probably familiar with 2 or 3-dimensional coordinate systems, but we can easily extend this to much higher dimensions (think hundreds or thousands). However, it’s quite difficult to imagine a 1000-dimensional space, so we’ll stick to 2-dimensional illustations.

In the graph below, we have placed 4 objects in a 2-dimensional space, and each of them has a 2-dimensional vector that represents their position in this space. For machine learning and classification we make the assumption that similar objects have similar coordinates and are therefore positioned close to each other. This is true in our example, as cities are positioned in the upper left corner, and days-of-the-week are positioned a bit further in the lower right corner.


Let’s say we now get a new object (see image below) and all we know are its coordinates. What do you think, is this object a city or a day-of-the-week? It’s probably a city, because it is positioned much closer to other existing cities we already know about.


This is the kind of reasoning that machine learning tries to perform. Our example was very simple, but this problem gets more difficult when dealing with thousands of dimensions and millions of noisy datapoints.

In a traditional machine learning context these vectors are given as input to the classifier, both at training and testing time. However, there exist methods of representation learning where these vectors are learned automatically, together with the model.

Now that we know about vector spaces, it’s time to look at how the neuron works.


  1. McCulloch, Warren S., and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity.” The Bulletin of Mathematical Biophysics 5.4 (1943): 115-133.
  2. Minsky, Marvin, and Papert Seymour. “Perceptrons.” (1969).
  3. Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.