Category: Uncategorized

Attending to characters in neural sequence labeling models

Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar vectors means these words also behave similarly in the model, which is what we want for good generalisation properties.

However, word embeddings have a couple of weaknesses:

  1. If a word doesn’t exist in the training data, we can’t have an embedding for it. Therefore, the best we can do is clump all unseen words together under a single OOV (out-of-vocabulary) token.
  2. If a word only occurs a couple of times, the word embedding likely has very poor quality. We simply don’t have enough information to learn how these words behave in different contexts.
  3. We can’t properly take advantage of character-level patterns. For example, there is no way to learn that all words ending with -ing are likely to be verbs. The best we can do is learn this for each word separately, but that doesn’t help when faced with new or rare words.

In this post I will look at different ways of extending word embeddings with character-level information, in the context of neural sequence labeling models.  You can find more information in the Coling 2016 paper “Attending to characters in neural sequence labeling models“.

Sequence labeling

We’ll investigate word representations in order to improve on the task of sequence labeling. In a sequence labeling setting, a system gets a series of tokens as input and it needs to assign a label to every token. The correct label typically depends on both the context and the token itself. Quite a large number of NLP tasks can be formulated as sequence labeling, for example:

DT  NN    VBD      NNS    IN      DT   DT  NN     CC  DT  NN   .
The pound extended losses against both the dollar and the euro .

Error detection
+ +    +  x       +   +      +   +    +    x      +
I like to playing the guitar and sing very louder .

Named entity recognition
PER _      _   _      _  ORG  ORG   _  TIME _
Jim bought 300 shares of Acme Corp. in 2006 .

Service on   the  line is   expected to   resume by   noon today .

In each of these cases, the model needs to understand how a word is being used in a specific context, and could also take advantage of character-level patterns and morphology.

Basic neural sequence labeling

Our baseline model for sequence labeling is as follows. Each word is represented as a 300-dimensional word embedding. This is passed through a bidirectional LSTM with hidden layers of size 200. The representations from both directions are concatenated, in order to get a word representation that is conditioned on the whole sentence. Next, we pass it through a 50-dimensional hidden layer and then an output layer, which can be a softmax or a CRF.

This configuration is based on a combination of my previous work on error detection (Rei and Yannakoudakis, 2016), and the models from Irsoy and Cardie (2014) and Lample et al. (2016).

Concatenating character-based word representations

Now let’s look at an architecture that builds word representations from individual characters. We process each word separately and map characters to character embeddings. Next, these are passed through a bidirectional LSTM and the last states from either direction are concatenated. The resulting vector is passed through another feedforward layer, in order to map it to a suitable space and change the vector size as needed. We then have a word representation m, built from individual characters.


We still have a normal word embedding x for each word, and in order to get the best of both worlds, we can combine these two representations. Following Lample et al. (2016), one method is simply concatenating the character-based representation with the word embedding.

\widetilde{x} = [x; m]

The resulting vector can then be used in the word-level sequence labeling model, instead of the regular word embedding. The whole network is connected together, so that the character-based component is also optimised during training.

Attending to character-based representations

Concatenating the two representations works, but we can do even better. We start off the same – character embeddings are passed through a bidirectional LSTM to build a word representation m. However, instead of concatenating this vector with the word embedding, we now combine them using dynamically predicted weights.


A vector of weights z is predicted by the model, as a function of x and m. In this case, we use a two-layer feedforward component, with tanh activation in the first layer and sigmoid on the second layer.

z = \sigma(W^{(3)}_z tanh(W^{(1)}_{z} x + W^{(2)}_{z} m))

Then, we combine x and m as a weighted sum, using z as the weights:

\widetilde{x} = z\cdot x + (1-z) \cdot m

This operation essentially looks at both word representations and decides, for each feature, whether it wants to take the value from the word embedding or from the character-based representation. Values close to 1 in z indicate higher weight for the word embedding, and values close to 0 assign more importance to the character-based vector.

This combination requires that the two vectors are aligned – each feature position in the character-based representation needs to capture the same properties as that position in the word embedding. In order to encourage this property, we actively optimise for these vectors to be similar, by maximising their cosine similarity:

\widetilde{E} = E + \sum_{t=1}^{T} g_t (1 – cos(m^{(t)}, x_t)) \hspace{3em}
g_t =
0, & \text{if}\ w_t = OOV \\
1, & \text{otherwise}

E is the main sequence labeling loss function that we minimise during training, T is the length of the sequence or sentence. Many OOV words share the same representation, and we do not want to optimise for this, therefore we use a variable g that limits optimisation only to non-OOV words.

In this setting, the model essentially learns two alternative representations for each word – a regular word embedding and a character-based representation. The word embedding itself is kind of a universal memory – we assign 300 elements to a word, and the model is free to save any information into it, including approximations of character-level information. For frequent words, there is really no reason to think that the character-based representation can offer much additional benefit. However, using word embeddings to save information is very inefficient – each feature needs to be learned and saved for every word separately. Therefore, we hope to get two benefits from including characters into the model:

  1. Previously unseen (OOV) words and infrequent words with low-quality embeddings can get extra information from character features and morphemes.
  2. The character-based component can act as a highly-generalised model of typical character-level patterns, allowing the word embeddings to act as a memory for storing exceptions to these patterns for each specific word.

While we optimise for the cosine similarity of m and x to be high, we are essentially teaching the model to predict distributional properties based only on character-level patterns and morphology. However, while m is optimised to be similar to x, we implement it in such a way that x is not optimised to be similar to m (using disconnected_grad in Theano). Because word embeddings are more flexible, we want them to store exceptions as opposed to learning more general patterns.

The resulting combined word representation is again plugged directly into the sequence labeling model. All the components, including the attention component for dynamically calculating z, are optimised at the same time.


We evaluated the alternative architectures on 8 different datasets, covering 4 different tasks: NER, POS-tagging, error detection and chunking. See the paper for more detailed results, but here is a summary:

Dataset Task #labels Measure Word-based Char concat Char attn
CoNLL00 chunking 22 F1 91.23 92.35 92.67
CoNLL03 NER 8 F1 79.86 83.37 84.09
PTB-POS POS-tagging 48 accuracy 96.42 97.22 97.27
FCEPUBLIC error detection 2 F0.5 41.24 41.27 41.88
BC2GM NER 3 F1* 84.21 87.75 87.99
CHEMDNER NER 3 F1 79.74 83.56 84.53
JNLPBA NER 11 F1 70.75 72.24 72.70
GENIA-POS POS-tagging 42 accuracy 97.39 98.49 98.60

As can be seen, including a character-based component into the sequence labeling model helps on every task. In addition, using the attention-based architecture outperforms character concatenation on all benchmarks.

We also compared the number of parameters in each model. While both character-based models require more parameters compared to a basic architecture using only word embeddings, the attention-based architecture is actually more efficient compared to concatenation. If the vectors are simply concatenated, this increases the size of all the weight matrices in the proceeding LSTMs, whereas the attention framework combines them without increasing length.

Miyamoto and Cho (2016) have independently also proposed a similar architecture, with some differences: 1) They focus on the task of language modeling, 2) they predict a scalar weight for combining the representations as opposed to making the decision separately for each element, 3) they do not condition the weights on the character-based representation, and 4) they do not have the component that optimises character-based representations to be similar to the word embeddings.

Since this model is aimed at learning morphological patterns, you might ask why are we not giving actual morphemes as input to the model, instead of starting from individual characters. The reason is that the definition of an informative morpheme is likely to change between tasks and datasets, and this allows the model to learn exactly what it finds most useful. The model for POS tagging can learn to detect specific suffixes, and the model for NER can focus more on capitalisation patterns.


Combining word embeddings with character-based representations makes neural models more powerful and allows us to have better representations for infrequent or unseen words. One option is to concatenate the two representations, treating them as separate sets of useful features. Alternatively, we can optimise them to be similar and combine them using a gating mechanism, essentially allowing the model to choose whether it wants to take each feature from the word embedding of from the character-based representation. We evaluated on 8 different sequence labeling datasets and found that the latter option performed consistently better, even with a fewer number of parameters.

See the paper for more details:

I have made the code for running these experiments publicly available on github:

Also, the dataset for performing error detection as a sequence labeling task is now available online:

NLP and ML Publications – Looking Back at 2016

After my last post on analysing publication patterns I received quite a lot of feedback and many feature requests, so I decided to create an update once 2016 is over. It is now quite a bit bigger than before, and includes 11 different conferences and journals: ACL, EACL, NAACL, EMNLP, COLING, CL, TACL, CoNLL, *Sem+SemEval, NIPS, and ICML.

The information used in these graphs was collected through crawling the web. ACL Anthology was very useful, listing papers in a consistent format. However, information such as the organisation names in each paper still needed to be extracted directly from the pdfs, which means there are likely to be some errors. I’ve tried to create exceptions to catch different spelling variations and other anomalies, but if you notice mistakes in the graphs, do let me know.

This analysis shouldn’t be taken too seriously – after all, quality of research matters much more than quantity, and that is considerably more difficult to measure. However, my motivation is to provide a high-level overview of what is happening in the field, where the big players are publishing, and perhaps supply a bit of inspiration and motivation for the new year.

Let’s start by looking at publications from 2016 for the 25 most active organisations:

Carnegie Mellon managed to beat Google by just 1 paper. Microsoft and Stanford also managed to publish more than 80 papers in 2016. IBM, Cambridge, Washington and MIT all reached the 50 publication barrier. Google, Stanford, MIT and Princeton are distinctly focused on the ML aspect, publishing mostly in NIPS and ICML. In fact, Google papers counted for nearly 10% of all NIPS papers. IBM, Peking, Edinburgh and Darmstadt however are distinctly focused on the NLP applications.

Next, let’s look at individual authors:

Chris Dyer continued his impressive publication record, by managing 24 papers in 2016. I’m curious as to why Chris doesn’t publish in NIPS or ICML, but he did have a paper in every single NLP conference (barring EACL, which didn’t take place in 2016). Following him are Yue Zhang (18), Hinrich Schütze (15), Timothy Baldwin (14), and Trevor Cohn (14). Ting Liu from the Harbin Institute of Technology stands out with 10 papers in COLING. Anders Søgaard and Yang Liu both managed to have 6 papers in ACL.

Now let’s look at the most prolific first authors from 2016:

Three researchers managed to publish 6 first-author papers: Ellie Pavlick (University of Pennsylvania), Gustavo Paetzold (University of Sheffield), Zeyuan Allen-Zhu (Princeton University and Institute for Advanced Study). Alan Akbik (IBM) published 5 papers, and 7 more researchers had 4 publications.
In addition, there were 42 people with 3 first-author papers, and 231 with 2 first-author papers.

Let’s also look at some time series. First, the total number of papers published at different conferences:

NIPS has been a very big conference for several years, but this year it seems to have exploded. Also, COLING was bigger than expected this year, even beating out ACL. This was the first year since 2012 when NAACL and COLING coincided.

Next, the number of papers per organisation in each year:

CMU is leading the race, having overtaken Microsoft in 2015. However, Google has almost caught up as well, after accelerating at a breakneck speed. Stanford also has a very respectable track record, followed by IBM and Cambridge.

Finally, let’s look at individual authors:

Chris Dyer gets a very distinct upward line on this graph. Other researchers who have managed to keep increasing their output over the past 5 years are: Preslav Nakov, Alessandro Moschitti, Yoshua Bengio, and Anders Søgaard.

Finally, I also decided to do some topic modeling on the publications. First, I extracted all the plain text from the papers, tokenised and lowercased it, and removed stopwords. Next, I passed it through LDA in order to discover 10 latent topics. I then used t-SNE to visualise top authors and organisations in a 2-dimensional graph, based on their latent topic similarity. Finally, I manually labelled each of the clusters with one word from the highest-ranked terms found by LDA. Here is the visualisation for top 50 authors:


I created the same graph for organisations, but didn’t attempt to label them with single words, since major universities tend to publish in many different subfields. I leave it to you to interpret the clusters:


That’s all for now. If you have any corrections or suggestions, let me know.
Have a great 2017 and I hope to see you in this list next year!

Edit: Fixed a bug with the autor-year graph.
Edit: Fixed a bug with mapping different forms of university names

Analysing NLP publication patterns

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Below is the graph of top 25 organisations and the conferences where they publish.

CMU comes out as the most prolific publisher with 305 papers. A close second is Microsoft with 302 publications, also leading in the industry category. I was somewhat surprised to find that Microsoft publishes so much, almost twice as many papers compared to Google, especially as Google seems to get much more publicity with their research. Stanford is also among the top 3 organisations that publish substantially more than others. Edinburgh and Cambridge represent the UK camp with 121 and 117 papers respectively.

When we look at the distribution of conferences, Princeton and UCL stand out as having very little NLP-specific research, with nearly all of their papers in ICML and NIPS. Stanford, Berkeley and MIT also seem to focus more on machine learning algorithms. In contrast, Edinburgh, Johns Hopkins and University of Maryland have most of their publications on NLP-related conferences. CMU, Microsoft and Columbia are the most balanced among the top publishers, with roughly 50:50 division between NLP and ML.

We can also plot the number of publications per year, focusing on the top 15 institutions.

Carnegie Mellon has a very good track record, but has only just recently overtaken Microsoft as the top publisher. Google, MIT, Berkeley, Cambridge and Princeton have also stepped up their publishing game, showing upward trends in the recent years. The sudden drop for 2016 is due to incomplete data – at the time of writing, ACL, EMNLP and NIPS papers for this year are not available yet.

Now let’s look at the same graphs but for individual authors.

Chris Dyer comes out on top with 50 papers. This result is even more impressive given that he started with just 2 papers in 2012, then rocketing to the top by quite a margin in 2015. Almost all of his papers are in NLP conferences, with only 1 paper each for NIPS and ICML. Noah Smith, Chris Manning and Dan Klein rank 2nd-4th, with more stable publishing records, but also focusing mainly on NLP conferences. In contrast, Zoubin Ghahramani, Yoshua Bengio and Lawrence Carin are focused mostly on machine learning algorithms.

There seems to be a clear separation between the two research communities, with researchers specialising to publishing either in NLP or ML. This seems somewhat unexpected, especially considering the widespread trend of publishing novel neural network architectures for NLP tasks. Both fields would probably benefit from slightly tighter integration in the future.

I hope this little analysis was interesting to fellow researchers. I’m happy to post an update some time in the future, to see how things have changed. In the meantime, let me know if you find any bugs in the statistics.

Update: As requested, I’ve also added the statistics for first authors with highest publication counts. Jiwei Li from Stanford towers above others with 14 publications. William Yang Wang (CMU), Young-Bum Kim (Microsoft), Manaal Faruqui (CMU), Elad Hazan (Princeton), and Eunho Yang (IBM) have all managed an impressive 9 first-author publications.

Update 2: Added a fix for Jordan Boyd-Graber who publishes under Jordan L. Boyd-Graber in NIPS.

Update 3: Added a fix for Hal Daumé III, mapping together different spellings.

Update 4: By showing top N authors on the graphs, some authors with equal numbers of publications were being excluded. I’ve adjusted the value N for each graph so this doesn’t happen.

Update 5: Added a fix for Pradeep K. Ravikumar who also publishes under Pradeep Ravikumar.

Update 6: Added fixes to capture name variations for INRIA.

Theano Tutorial

This is an introductory tutorial on using Theano, the Python library. I’m going to start from scratch and assume no previous knowledge of Theano. However, understanding how neural networks work will be useful when getting to the code examples towards the end.

The plan for the tutorial is as follows:

  1. Give a basic introduction to Theano and explain the important concepts.
  2. Go over the main operations that we have available in Theano.
  3. Look at working code examples.

I recently gave this tutorial as a talk in University of Cambridge and it turned out to be way more popular than expected. In order to give more people access to the material, I’m now writing it up as a blog post.

I do not claim to know everything about Theano, and I constantly learn new things myself. If you find any errors or have suggestions on how to improve this tutorial, do let me know.

The code examples can be found in the Github repository:

1. What is Theano?


Theano is a Python library for efficiently handling mathematical expressions involving multi-dimensional arrays (also known as tensors). It is a common choice for implementing neural network models. Theano has been developed in University of Montreal, in a group led by Yoshua Bengio, since 2008.

Some of the features include:

  • automatic differentiation – you only have to implement the forward (prediction) part of the model, and Theano will automatically figure out how to calculate the gradients at various points, allowing you to perform gradient descent for model training.
  • transparent use of a GPU – you can write the same code and run it either on CPU or GPU. More specifically, Theano will figure out which parts of the computation should be moved to the GPU.
  • speed and stability optimisations – Theano will internally reorganise and optimise your computations, in order to make them run faster and be more numerically stable. It will also try to compile some operations into C code, in order to speed up the computation.

Technically, Theano isn’t actually a machine learning library, as it doesn’t provide you with pre-built models that you can train on your dataset. Instead, it is a mathematical library that provides you with tools to build your own machine learning models. But if you are looking for machine learning toolkits, there are several good ones implemented on top of Theano:

2. Python refresher

Theano is a Python library, so let’s go over some important points in Python.

  • Python is an interpreted language, which makes it more platform independent but generally slower than C, for example.
  • Python uses dynamic typing. While each variable does have a specific type during execution, these are not explicitly stated in the code.
  • Python uses indentation for block delimiting. So where C or Java would use curly brackets to separate a block, Python uses whitespace. Here we define a function f to take parameter x and return 2*x:
    def f(x):
        return 2*x
  • We define a list in Python with square brackets:
    a = [1,2,3,4,5]
    a[1] == 2
  • We define a dictionary (key-value mapping) with curly brackets:
    b = {'key1': 1, 'key2':2}
    b['key2'] == 2
  • List comprehension is a neat shorthand in Python for constructing lists. Here we loop for 5 steps (values 0-4), and each time add i+1 to the list:
    c = [i+1 for i in range(5)]
    c[1] == 2

3. Using Theano

In order to use Theano, you will need to install the dependencies and install Theano itself. If you’re using Ubuntu (tested for 14.04), you might get away with just running these two commands:

sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git
sudo pip install Theano

If that doesn’t work for you, take a look at the original Theano homepage, which contains instructions for various platforms:

To use Theano in your Python script, include it using:

import theano

4. Minimal Working Example

Here is the smallest example I could come up with, which uses Theano and actually does something:

import theano
import numpy

x = theano.tensor.fvector('x')
W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')
y = (x * W).sum()

f = theano.function([x], y)

output = f([1.0, 1.0])
print output

So what’s happening here?

We first define a Theano variable x to be a vector of 32-bit floats, and give it name ‘x’:

x = theano.tensor.fvector('x')

Next, we create a Theano variable W, assign its value to be vector [0.2, 0.7], and name it ‘W’:

W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')

We define y to be the sum of all elements in the element-wise multiplication of x and W:

y = (x * W).sum()

We define a Theano function f, which takes as input x and outputs y:

f = theano.function([x], y)

Then call this function, giving as the argument vector [1.0, 1.0], essentially setting the value of variable x:

output = f([1.0, 1.0])

The script prints out the summed product of [0.2, 0.7] and [1.0, 1.0], which is:

0.2*1.0 + 0.7*1.0 = 0.9

Don’t worry if the code doesn’t fully make sense. We’ll go over the important parts in more detail.

5. Symbolic graphs in Theano (!)

I’d say this section contains the most crucial part to understanding Theano.

When we are creating a model with Theano, we first define a symbolic graph of all variables and operations that need to be performed. And then we can apply this graph on specific inputs to get outputs.

For example, what do you think happens when this line of Theano code is executed in our script?

y = (x * W).sum()

The system takes x and W, multiplies them together and sums the values. Right?


Instead, we create a Theano object y that knows its values can be calculated as the dot-product of x and W. But the required mathematical operations are not performed here. In fact, when this line was executed in our example code above, x didn’t even have a value yet.

By chaining up various operations, we are creating a graph of all the variables and functions that need to be used to reach the output values. This symbolic graph is also the reason why we can only use Theano-specific operations when defining our models. If we tried to integrate functions from some random Python library into our network, they would attempt to perform the calculations immediately, instead of returning a Theano variable as needed. Exceptions do exist – Theano overrides some basic Python operators to act as expected, and NumPy is quite well integrated with Theano.

6. Variables

We can define variables which don’t have any values yet. Normally, these would be used for inputs to our network.

The variables have to be of a specific type though. For example, here we define variable x to be a vector of 32-bit floats, and give it name ‘x’:

x = theano.tensor.fvector('x')

The names are generally useful for debugging and informative error messages. Theano won’t have access to your Python variable names, so you have to assign explicit Theano names for each variable if you want them to be referred to as something more useful than just “a tensor”.

There are a number of different variable types available, just have a look at the list here. Some of the more popular ones include:

Constructor  dtype  ndim
fvector float32 1
ivector int32 1
fscalar float32 0
fmatrix float32 2
ftensor3 float32 3
dtensor3 float64 3

You can also define a generic vector (or tensor) and set the type with an argument:

x = theano.tensor.vector('x', dtype=float32)

If you don’t set the dtype, you will create vectors of type config.floatX. This will become relevant in the section about GPUs.

7. Shared variables

We can also define shared variables, which are shared between different functions and different function calls. Normally, these would be used for weights in our neural network. Theano will automatically try to move shared variables to the GPU, provided one is available, in order to speed up computation.

Here we define a shared variable and set its value to [0.2, 0.7].

W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')

The values in shared variables can be accessed and modified outside of our Theano functions using these commands:

W.set_value([0.1, 0.9])

8. Functions

Theano functions are basically hooks for interacting with the symbolic graph. Commonly, we use them for passing input into our network and collecting the resulting output.

Here we define a Theano function f that takes x as input and returns y as output:

f = theano.function([x], y)

The first parameter is the list of input variables, and the second parameter is the list of output variables. Although if there’s only one output variable (like now) we don’t need to make it into a list.

When we construct a function, Theano takes over and performs some of its own magic. It builds the computational graph and optimises it as much as possible. It restructures mathematical operations to make them faster and more stable, compiles some parts to C, moves some tensors to the GPU, etc.

Theano compilation can be controlled by setting the value of mode in the environement variable THEANO_FLAGS:

  • FAST_COMPILE – Fast to compile, slow to run. Python implementations only, minimal graph optimisation.
  • FAST_RUN – Slow to compile, fast to run. C implementations where available, full range of optimisations

9. Minimal Training Example

Here’s a minimal script for actually training something in Theano. We will be training the weights in W using gradient descent, so that the result from the model would be 20 instead of the original 0.9.

import theano
import numpy

x = theano.tensor.fvector('x')
target = theano.tensor.fscalar('target')

W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')
y = (x * W).sum()

cost = theano.tensor.sqr(target - y)
gradients = theano.tensor.grad(cost, [W])
W_updated = W - (0.1 * gradients[0])
updates = [(W, W_updated)]

f = theano.function([x, target], y, updates=updates)

for i in xrange(10):
    output = f([1.0, 1.0], 20.0)
    print output

We create a second input variable called target, which will act as the target value we use for training:

target = theano.tensor.fscalar('target')

In order to train the model, we need a cost function. Here we use a simple squared distance from the target:

cost = theano.tensor.sqr(target - y)

Next, we want to calculate the partial gradients for the parameters that will be updated, with respect to the cost function. Luckily, Theano will do that for us. We simply call the grad function, pass in the real-valued cost and a list of all the variables we want gradients for, and it will return a list of those gradients:

gradients = theano.tensor.grad(cost, [W])

Now let’s define a symbolic variable for what the updated version of the parameters will look like. Using gradient descent, the update rule is to subtract the gradient, multiplied by the learning rate:

W_updated = W - (0.1 * gradients[0])

And next we create a list of updates. More specifically, a list of tuples where the first element is the variable we want to update, and the second element is a variable containing the values that we want the first variable to contain after the update. This is just a syntax that Theano requires.

updates = [(W, W_updated)]

Have to define a Theano function again, with a couple of changes:

f = theano.function([x, target], y, updates=updates)

It now takes two input arguments – one for the input vector, and another for the target value used for training. And the list of updates also gets attached to the function as well. Every time this function is called, we pass in values for x and target, get back the value for y as output, and Theano performs all the updates in the update list.

In order to train the parameters, we repeatedly call this function (10 times in this case). Normally, we’d pass in different examples from our training data, but for this example we use the same x=[1.0, 1.0] and target=20 each time:

for i in xrange(10):
    output = f([1.0, 1.0], 20.0)
    print output

When the script is executed, the output looks like this:


The first time the function is called, the output value is still 0.9 (like in the previous example), because the updates have not been applied yet. But with each consecutive step, the output value becomes closer and closer to the desired target 20.

10. Useful operations

This covers the basic logic behind building models with Theano. The example was very simple, but we are free to define increasingly complicated networks, as long as we use Theano-specific functions. Now let’s look at some of these building blocks that we have available.

Evaluate the value of a Theano variable

The eval() function forces the Theano variable to calculate and return its actual (numerical) value. If we try to just print the variable a, we only print its name. But if we use eval(), we get the actual square matrix that it is initialised to.

> a = theano.shared(numpy.asarray([[1.0,2.0],[3.0,4.0]]), 'a')
> a
> a.eval()
array([[1., 2.],
       [3., 4.]])

This eval() function isn’t really used for building models, but it can be useful for debugging and learning how Theano works. In the examples below, I will be using the matrix a and the eval() function to print the value of each variable and demonstrate how different operations work.

Basic element-wise operations: + – * /

c = ((a + a) / 4.0)

array([[ 0.5, 1. ],
       [ 1.5, 2. ]])

Dot product

c =, a)

array([[ 7., 10.],
       [15., 22.]])

Activation functions

c = theano.tensor.nnet.sigmoid(a)
c = theano.tensor.tanh(a)

array([[ 0.76159416,  0.96402758],
       [ 0.99505475,  0.9993293 ]])

Softmax (row-wise)

c = theano.tensor.nnet.softmax(a)

array([[ 0.26894142,  0.73105858],
       [ 0.26894142,  0.73105858]])


c = a.sum()
c = a.sum(axis=1)

array([ 3.,  7.])


c = a.max()
c = a.max(axis=1)

array([ 2.,  4.])


c = theano.tensor.argmax(a)
c = theano.tensor.argmax(a, axis=1)

array([1, 1])


We sometimes need to change the dimensions of a tensor and reshape() allows us to do that. It takes as input a tuple containing the new shape and returns a new tensor with that shape. In the first example below, we shape a square matrix into a 1×4 matrix. In the second example, we use -1 which means “as big as the dimension needs to be”.

a = theano.shared(numpy.asarray([[1,2],[3,4]]), 'a')
c = a.reshape((1,4))
array([[1, 2, 3, 4]])

c = a.reshape((-1,))
array([1, 2, 3, 4])

Zeros-like, ones-like

These functions create new tensors with the same shape but all values set to zero or one.

c = theano.tensor.zeros_like(a)
array([[0, 0],
       [0, 0]])

Reorder the tensor dimensions

Sometimes we need to reorder the dimensions in a tensor. In the examples below, the dimensions in a two-dimensional matrix are first swapped. Then, ‘x’ is used to create a brand new dimension.

array([[1, 2],
       [3, 4]])

c = a.dimshuffle((1,0))
array([[1, 3],
       [2, 4]])

c = a.dimshuffle(('x',0,1))
array([[[1, 2],
        [3, 4]]])


Using Python indexing tricks can make life so much easier. In the example below, we make a separate list b containing line numbers, and use it to construct a new matrix which contains exactly the lines we want from the original matrix. This can be useful when dealing with word embeddings – we can put word ids into a list and use this to retrieve exactly the correct sequence of embeddings from the whole embedding matrix.

a = theano.shared(numpy.asarray([[1.0,2.0],[3.0,4.0]]), 'a')
array([[1., 2.],
       [3., 4.]])

b = [1,1,0]
c = a[b]
array([[ 3.,  4.],
       [ 3.,  4.],
       [ 1.,  2.]])

For assignment, we can’t do this:

a[0] = [0.0, 0.0]

But instead, we can use set_subtensor(), which takes as arguments the selection of the original matrix that we want to reassign, and the value we want to assign it to. It returns a new tensor that has the corresponding values modified.

c = theano.tensor.set_subtensor(a[0],[0.0, 0.0])
array([[ 0.,  0.],
       [ 3.,  4.]])

11. Classifier Code Example

At this point, it’s time to move on to some more realistic examples.

Take a look at the code for a very basic classifier, which tries to train a small network on a tiny (but real) dataset. I won’t walk you through it line-by-line any more; you’ve learned all the necessary parts by now and there are comments in the code as well.

The task is to predict whether the GDP per capita for a country is more than the average GDP, based on the following features:

  • Population density (per suqare km)
  • Population growth rate (%)
  • Urban population (%)
  • Life expectancy at birth (years)
  • Fertility rate (births per woman)
  • Infant mortality (deaths per 1000 births)
  • Enrolment in tertiary education (%)
  • Unemployment (%)
  • Estimated control of corruption (score)
  • Estimated government effectiveness (score)
  • Internet users (per 100 people)

The data/ directory contains the files for training (121 countries) and testing (40 countries). Each row represents one country, the first column is the label, followed by the features. The feature values have been normalised, by subtracting the mean and dividing by the standard deviation. The label is 1 if the GDP is more than average, and 0 otherwise.

Once you clone the github repository (or just download the data files), you can run the script with:

python data/countries-classify-gdp-normalised.train.txt data/countries-classify-gdp-normalised.test.txt

The script will print information about 10 training epochs and the result on the test set:

Epoch: 0, Training_cost: 28.4304042768, Training_accuracy: 0.578512396694
Epoch: 1, Training_cost: 24.5186290354, Training_accuracy: 0.619834710744
Epoch: 2, Training_cost: 22.1283727037, Training_accuracy: 0.619834710744
Epoch: 3, Training_cost: 20.7941253329, Training_accuracy: 0.619834710744
Epoch: 4, Training_cost: 19.9641569475, Training_accuracy: 0.619834710744
Epoch: 5, Training_cost: 19.3749411377, Training_accuracy: 0.619834710744
Epoch: 6, Training_cost: 18.8899216914, Training_accuracy: 0.619834710744
Epoch: 7, Training_cost: 18.4006371608, Training_accuracy: 0.677685950413
Epoch: 8, Training_cost: 17.7210185975, Training_accuracy: 0.793388429752
Epoch: 9, Training_cost: 16.315597037, Training_accuracy: 0.876033057851
Test_cost: 5.01800578051, Test_accuracy: 0.925

12. Recurrent functions with scan

One more important operation to cover is scan, which can be used to create various recurrent functions: RNN, GRU, LSTM, etc.

Here is sample code for using scan to define a simple RNN over word vectors in the input_vectors matrix:

def rnn_step(x, previous_hidden_vector, W_input, W_recurrent):
    hidden_vector =, W_input) + 
          , W_recurrent)
    hidden_vector = theano.tensor.nnet.sigmoid(hidden_vector)

W_input = self.create_parameter_matrix('W_input', (word_embedding_size, recurrent_size))
W_recurrent = self.create_parameter_matrix('W_recurrent', (recurrent_size, recurrent_size))
initial_hidden_vector = theano.tensor.alloc(numpy.array(0, dtype=floatX), recurrent_size)

hidden_vector, _ = theano.scan(
    sequences = input_vectors,
    outputs_info = initial_hidden_vector,
    non_sequences = [W_input, W_recurrent]

hidden_vector = hidden_vector[-1]

The scan function is called on line 10 and it takes 4 important arguments:

  • fn: The function that is called at every step of the iteration.
  • sequences: The variables that we want to iterate over. If this is a matrix, we will be iterating over each row of that matrix.
  • outputs_info: The values that we use as the previous recurrent values for the very first step. Usually these are just set to 0.
  • non_sequences: Any additional variables that we want to pass into the function (fn) but don’t want to iterate over.

We’ve defined the helper function rnn_step on line 1, which gets called on each row of our input matrix. The scan function will be calling this rnn_step function internally, so we need to accept any arguments in the same order as Theano passes them. This is just something you need to know when dealing with scan. The order is as follows:

  1. First, the current items from the variables that we are iterating over. If we are iterating over a matrix, the current row is passed to the function.
  2. Next, anything that was output from the function at the previous time step. This is what we use to build recursive and recurrent representations. At the very first time step, the values will be those from outputs_info instead.
  3. Finally, anything we specified in non_sequences.

What comes out from the scan function contains the hidden states (eg the rnn_step outputs) at each step. Not just the last step, but all of them. So if you only want the last step, you need to explicitly retrieve it by indexing from -1 (the last element). Theano is actually smart enough to figure out that you’re only using the last result, and will optimise to discard all the intermediate ones.

In order to construct the weight matrices, I’m using a helper function (self.create_parameter_matrix, definition not shown here) which takes as input the variable name and the shape. This means I don’t need to define the weight initialisation part again each time.

13. RNN Classifier Code Example

Time to look at some more code, this time using recurrent functions and scan. The script is available at the Github repository. In this example, I’m using Gated Recurrent Units (GRU) from “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation” (Cho et al, 2014), which are essentially a simpler versions of LSTMs.

The task is to classify sentences into 5 classes, based on their fine-grained sentiment (very negative, slightly negative, neutral, slightly positive, very positive). We use the dataset published in “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank” (Socher et al., 2013).

Start by downloading the dataset from (the main zip file) and unpack it somewhere. Then, create training and test splits in the format that is more suitable for us, using the provided script in the repository:

python 1 full /path/to/sentiment/dataset/ > data/sentiment.train.txt
python 2 full /path/to/sentiment/dataset/ > data/sentiment.test.txt

Now we can run the classifier with:

python data/sentiment.train.txt data/sentiment.test.txt

The script will train for 3 passes over the training data, and will then print performance on the test data.

Epoch: 0 Cost: 25937.7372292 Accuracy: 0.285814606742
Epoch: 1 Cost: 21656.820174 Accuracy: 0.350655430712
Epoch: 2 Cost: 18020.619533 Accuracy: 0.429073033708
Test_cost: 4784.25137484 Test_accuracy: 0.388235294118

The accuracy on the test set is about 38%, which isn’t a great result. But it is quite a difficult task – the current state-of-the-art system (Tai ei al., 2015) achieves 50.9% accuracy, using a large amount of additional phrase-level annotations, and a much bigger network based on LSTMs and parse trees. As there are 5 classes to choose from, a random system would get 20% accuracy.

14. Running on a GPU

Theano is smart enough to move some parts of the processing to the GPU, as long as CUDA is installed and a graphics card is made available. To install CUDA, follow instructions on one of these links:

Then, when running your Python script, you need to point Theano to the CUDA installation. I do this by setting the environment variables in the command line:

LD_LIBRARY_PATH=/usr/lib:/usr/local/cuda-7.5/lib64 THEANO_FLAGS='cuda.root=/usr/local/cuda-7.5,device=gpu,floatX=float32' python

This command is for CUDA-7.5 in my system. You’ll need to make sure that the paths match the CUDA installation paths in your machine. If it works and Theano is using a GPU, the first line that gets printed will explicitly say so. Something like this:

Using gpu device 0: GeForce GTX 780

If you don’t get something similar, it probably means Theano is not properly hooked up to use the GPU.

At the time of writing, Theano only supports 32-bit variables on the GPU, and this is where the floatX=float32 setting comes it. It just allows you to set the data type during the execution of the script, without writing it into your code. For example, you can define your vectors like this:

x = theano.tensor.vector('x', dtype=config.floatX)

And now you can set floatX to be float32 when running the script on a GPU and float64 when running on your CPU.

Finally, if your machine has multiple GPUs, you can control which one is used for the script by setting device=gpu0, device=gpu1, etc. Based on personal experience, running multiple Theano jobs on the same GPU does not give any advantage, so it’s best to send them to different ones when possible.

15. Drawing the computation graph

Theano provides a command for printing a variable or a function, along with all the required computation, as an image:

f = theano.function([x], y)
theano.printing.pydotprint(f, outfile="f.png", var_with_name_simple=True)

When dealing with very simple models, this can give a nice graphical representation. For example, here is a model from our minimal working example:

Printed Theano function. Figure for the Theano tutorial.

However, when the models get more and more complicated, the images also tend to get less informative:

Printed graph of a much larger function. Figure for the Theano tutorial.


16. Profiling

Finally, Theano also provides a useful tool for analysing bottlenecks in your code. Just set profile=True in THEANO_FLAGS, and it will print information about how much time is spent on different operations in your code.

THEANO_FLAGS='profile=True' python

Example of profiling output from Theano. Figure for the Theano tutorial.

17. References

This concludes the Theano tutorial. If you haven’t yet had enough, take a look at the following links that I used for inspiration:
Official Theano homepage and documentation
Official Theano tutorial
A Simple Tutorial on Theano by Jiang Guo
Code samples for learning Theano by Alec Radford


Online Representation Learning in Recurrent Neural Language Models

In a basic neural language model, we optimise a fixed set of parameters based on a training corpus, and predictions on an unseen test set are a direct function of these parameters. What if instead of a static model we constantly measured the types of errors the model is making and adjust the parameters accordingly? It would potentially be more closer to how humans operate, constantly making small adjustments in their decisions based on feedback.

The necessary information is already available – language models use the previous word in the sequence as context, which means they know the correct answer for the previous time step (or at least need to assume they know). We can use this to calculate error derivatives at each time step and update parameters even during testing. This sounds like it would require loads of extra computation at test time, but by updating only a small part of the model we can actually get better results with faster execution and fewer parameter.

This post is a summary of my EMNLP 2015 paper “Online Representation Learning in Recurrent Neural Language Models“.


First a short description of the RNN language model that I use as a baseline. It follows the implementation by Mikolov et al. (2011) in the RNNLM Toolkit.


The previous word goes into the network as a 1-hot vector which is then multiplied with a weight matrix, giving us a corresponding word embedding. This, together with the previous hidden state, act as input to the current hidden state of the network:

\(hidden_t = \sigma(E \cdot input_t + W_h \cdot hidden_{t-1})\)

The hidden state is connected to the output layer, which predicts the next word in the sequence. In order to avoid performing a softmax operation over the whole vocabulary, all words are divided between classes and the probability of the next word is factored into the probability of the class and the probability of the next word given the class:

\(P(w_{t+1} | w_{1}^{t}) \approx classes_c \cdot output_{w_{t+1}}\)

\(classes = softmax(W_c \cdot hidden_t)\)

\(output = softmax(W_o^{(c)} \cdot hidden_t)\)

The words are divided into classes by frequency-based bucketing (following Mikolov et al., 2011), and the learning rate is divided by 2 if the improvement is not sufficient. The RNNLM Toolkit treats the training data as a continuous stream of tokens and performs backpropagation through time for a fixed number of steps – the text is essentially split into fixed-sized chunks for optimisation. Instead, we perform sentence splitting and backpropagate errors from the end of each sentence to the beginning.

RNNLM with online learning

Let’s introduce a special vector into the model, which will represent the current unit of text being processed (a sentence, a paragraph, or a document). We can then update it after each prediction, based on the errors the model has made on that document.


The output probabilities over classes and words are then conditioned on this new document vector:

\(classes = softmax(W_c \cdot hidden_t + W_{dc} \cdot doc)\)
\(output = softmax(W_o^{(c)} \cdot hidden_t + W_{do}^{(c)} \cdot doc)\)

Notice that there is no input going into the document vector. Instead of constructing it iteratively, like the values in a hidden layer, we treat it as a vector of parameters and optimise them both during training and testing. After predicting each word, we calculate the error in the output layer, backpropagate this into the document vector, and adjust the values. While the main language model is a smoothed static representation of the training data, the document vector will contain information about how a specific sentence/document differs from this main language model.

The document vector is connected directly to the output layers of the RNNLM, in parallel to the hidden layer. This allows us to update the document vector after every step, instead of waiting until the end of the sentence to perform backpropagation through time.

Le and Mikolov (2014) used a related approach for learning vector representations of sentences and achieved good results on the sentiment detection task. They added a vector for a sentence into a feedforward language model, stepped through the sentence, and used the values at the last step as a representation of that sentence. While they connected the vector as part of the input layer, we have connected it directly to the output layer – in an RNNLM the input layer only gets updated at the end of the sentence (during backpropagation-through-time), whereas we want to update the document vector after each time step.


We constructed a dataset from English Wikipedia to evaluate language modelling performance of the two models. The text was tokenised, sentence split and lowercased. The sentences were shuffled, in order to minimise any transfer effects between consecutive sentences, and then split into training, development and test sets. The final sentences were sampled randomly, in order to obtain reasonable training times for the experiments. Dataset sizes are as follows:

 Train Dev Test
Words  9,990,782  237,037  4,208,847
Sentences  419,278  10,000  176,564

The regular RNNLM with a 100-dimensional hidden layer (M = 100) and no document vector (D=0) is the baseline. In the experiments we increase the capacity of the model using different methods and measure how that affects the perplexity on the datasets.

 Train PPL  Dev PPL  Test PPL
Baseline M=100  92.65  103.56  102.51
M=120  88.60  98.78  97.79
M=100, D=20  87.28   95.36  94.39
M=135  85.17  96.33  95.71
M=100, D=35  80.11  91.05  90.29

Increasing the hidden layer size M does improve the model performance and perplexity decreases from 102.51 to 95.71. However, adding the same number of neurons into the actively-updated document vector gives an even lower perplexity of 90.29.

 Experiments with semantic similarity

The resulting document vector can also be used for calculating semantic similarity between texts. We sampled random sentences from the development data, processed them with the language model, and used the resulting document vector to find 3 most similar sentences in the development set. Below are some examples.

Input: Both Hufnagel and Marston also joined the long-standing technical death metal band Gorguts.

  • The band eventually went on to become the post-hardcore band Adair.
  • The band members originally came from different death metal bands, bonding over a common interest in d-beat.
  • The proceeds went towards a home studio, which enabled him to concentrate on his solo output and songs that were to become his debut mini-album “Feeding The Wolves”.

Input: The Chiefs reclaimed the title on September 29, 2014 in a Monday Night Football game against the New England Patriots, hitting 142.2 decibels.

  • He played in twenty-four regular season games for the Colts, all off the bench.
  • In May 2009 the Warriors announced they had re-signed him until the end of the 2011 season.
  • The team played inconsistently throughout the campaign from the outset, losing the opening two matches before winning four consecutive games during September 1927.

Input: He was educated at Llandovery College and Jesus College, Oxford, where he obtained an M.A. degree.

  • He studied at the Orthodox High School, then at the Faculty of Mathematics.
  • Kaigama studied for the priesthood at St. Augustine’s Seminary in Jos with further study in theology in Rome.
  • Under his stewardship, Zahira College became one of the leading schools in the country.


There has been a lot of work on developing static models for machine learning – we train the model parameters on the training data and then apply them on the test data. However, there is a lot of potential for dynamical models, which take advantage of immediate feedback signals and are able to continuously adjust the model parameters. Our experiment showed that, at least for language modelling, such a model is indeed a viable option.

26 Things I Learned in the Deep Learning Summer School

In the beginning of August I got the chance to attend the Deep Learning Summer School in Montreal. It consisted of 10 days of talks from some of the most well-known neural network researchers. During this time I learned a lot, way more than I could ever fit into a blog post. Instead of trying to pass on 60 hours worth of neural network knowledge, I have made a list of small interesting nuggets of information that I was able to summarise in a paragraph.

At the moment of writing, the summer school website is still online, along with all the presentation slides. All of the information and most of the illustrations come from these slides and are the work of their original authors. The talks in the summer school were filmed as well, hopefully they will also find their way to the web.

Update: the Deep Learning Summer School videos are now online.

Alright, let’s get started.

1. The need for distributed representations

During his first talk, Yoshua Bengio said “This is my most important slide”. You can see that slide below:dlss-3aug2015

Let’s say you have a classifier that needs to detect people that are male/female, have glasses or don’t have glasses, and are tall/short. With non-distributed representations, you are dealing with 2*2*2=8 different classes of people. In order to train an accurate classifier, you need to have enough training data for each of these 8 classes. However, with distributed representations, each of these properties could be captured by a different dimension. This means that even if your classifier has never encountered tall men with glasses, it would be able to detect them, because it has learned to detect gender, glasses and height independently from all the other examples.

2. Local minima are not a problem in high dimensions

The team of Yoshua Bengio have experimentally found that when optimising the parameters of high-dimensional neural nets, there effectively are no local minima. Instead, there are saddle points which are local minima in some dimensions but not all. This means that training can slow down quite a lot in these points, until the network figures out how to escape, but as long as we’re willing to wait long enough then it will find a way.

Below is a graph demonstrating a network during training, oscillating between two states: approaching a saddle point and then escaping it.


Given one specific dimension, there is some small probability \(p\) with which a point is a local minimum, but not a global minimum, in that dimension. Now, the probability of a point in a 1000-dimensional space being an incorrect local minimum in all of these would be \(p^{1000}\), which is just astronomically small. However, the probability of it being a local minimum in some of these dimensions is actually quite high. And when we get these minima in many dimensions at once, then training can appear to be stuck until it finds the right direction.

In addition, this probability \(p\) will increase as the loss function gets closer to the global minimum.  This means that if we do ever end up at a genuine local minimum, then for all intents and purposes it will be close enough to the global minimum that it will not matter.

3. Derivatives derivatives derivatives

Leon Bottou had some useful tables with activation functions, loss functions, and their corresponding derivatives. I’ll keep these here for later.




Update: As pointed out by commenters, the min and max functions in the ramp formula should be switched.

4. Weight initialisation strategy

The current recommended strategy for initialising weights in a neural network is to sample values \(W_{i,j}^{(k)}\) uniformly from \([-b,b]\), where

\(b = \sqrt{\frac{6}{H_k + H_{k+1}}}\)

\(H_k\) and \(H_{k+1}\) are the sizes of hidden layers before and after the weight matrix.

Recommended by Hugo Larochelle, published by Glorot & Bengio (2010).

5. Neural net training tricks

A few practical suggestions from Hugo Larochelle:

  • Normalise real-valued data. Subtract the mean and divide by standard deviation.
  • Decrease the learning rate during training.
  • Can update using mini-batches – the gradient is more stable.
  • Can use momentum, to get through plateaus.

6. Gradient checking

If you implemented your backprop by hand and it’s not working, then there’s roughly 99% chance that the gradient calculation has a bug. Use gradient checking to identify the issue. The idea is to use the definition of a gradient: how much will the model error change, if we increase a specific weight by a small amount.

\(\frac{\partial f(x)}{\partial x} \approx \frac{f(x+\epsilon) – f(x-\epsilon)}{2\epsilon}\)

A more in-depth explanation is available here: Gradient checking and advanced optimization

7. Motion tracking

Human motion tracking can be done with impressive accuracy. Below are examples from the paper Dynamical Binary Latent Variable Models for 3D Human Pose Tracking by Graham Taylor et al. (2010). The method uses conditional restricted Boltzmann machines.


8. Syntax or no syntax? (aka, “is syntax a thing?”)

Chris Manning and Richard Socher have put a lot of effort into developing compositional models that combine neural embeddings with more traditional parsing approaches. This culminated with a Recursive Neural Tensor Network (Socher et al., 2013), which uses both additive and multiplicative interactions to combine word meanings along a parse tree.

And then, the model was beaten (by quite a margin) by the Paragraph Vector (Le & Mikolov, 2014), which knows absolutely nothing about the sentence structure or syntax. Chris Manning referred to this result as “a defeat for creating ‘good’ compositional vectors”.

However, more recent work using parse trees has again surpassed this result. Irsoy & Cardie (NIPS, 2014) managed to beat paragraph vectors by going “deep” with their networks in multiple dimensions. Finally, Tai et al. (ACL, 2015) have improved the results again by combining LSTMs with parse trees.

The accuracies of these models on the Stanford 5-class sentiment dataset are as follows:

Method Accuracy
RNTN (Socher et al. 2013) 45.7
Paragraph Vector (Le & Mikolov 2014)  48.7
DRNN (Irsoy & Cardie 2014) 49.8
Tree LSTM (Tai et al. 2015) 50.9

So it seems that, at the moment, models using the parse tree are beating simpler approaches. I’m curious to see if and when the next syntax-free approach will emerge that will advance this race. After all, the goal of many neural models is not to discard the underlying grammar, but to implicitly capture it in the same network.

9. Distributed vs distributional

Chris Manning himself cleared up the confusion between the two words.

Distributed: A concept is represented as continuous activation levels in a number of elements. Like a dense word embedding, as opposed to 1-hot vectors.

Distributional: Meaning is represented by contexts of use. Word2vec is distributional, but so are count-based word vectors, as we use the contexts of the word to model the meaning.

10. The state of dependency parsing

Comparison of dependency parsers on the Penn Treebank:

Parser  Unlabelled Accuracy Labelled Acccuracy  Speed (sent/s)
MaltParser 89.8 87.2 469
MSTParser 91.4 88.1 10
TurboParser 92.3 89.6 8
Stanford Neural Dependency Parser 92.0 89.7 654
Google  94.3 92.4 ?

The last result is from Google “pulling out all the stops”, by putting massive amounts of resources into training the Stanford neural parser.

11. Theano

Well, I knew a bit about Theano before, but I learned a whole lot more during the summer school. And it is pretty awesome.

Since Theano originates from Montreal, it was especially helpful to be able to ask questions directly from the people who are developing it.

Most of the information that was presented is available online, in the form of interactive python tutorials.

12. Nvidia Digits

Nvidia has a toolkit called Digits that trains and visualises complex neural network models without needing to write any code. And they’re selling DevBox – a machine customised for running Digits and other deep learning software (Theano, Caffe, etc). It comes with 4 Titan X GPUs and currently costs $15,000.

13. Fuel

Fuel is a toolkit that manages iteration over your datasets – it can split them into minibatches, manage shuffling, apply various preprocessing steps, etc. There are prebuilt functions for some established datasets, such as MNIST, CIFAR-10, and Google’s 1B Word corpus. It is mainly designed for use with Blocks, a toolkit that simplifies network construction with Theano.

14. Multimodal linguistic regularities

Remember “king – man + woman = queen”? Turns out that works with images as well (Kiros et al., 2015).


15. Taylor series approximation

When we are at point \(x_0\) and take a step to \(x\), then we can estimate the function value in the new location by knowing the derivatives, using the Taylor series approximation.

f(x) = f(x_0) + (x – x_0)f'(x) + \frac{1}{2} (x – x_0)^2 f”(x) + …

Similarly, we can estimate the loss of a function, when we update parameters \(\theta_0\) to \(\theta\).

J(\theta) =J(\theta_0) + (\theta – \theta_0)^T g + \frac{1}{2} (\theta – \theta_0)^T H(\theta – \theta_0) + …

where \(g\) contains the derivatives with respect to \(\theta\), and \(H\) is the Hessian with second order derivatives with respect to \(\theta\).

This is the second-order Taylor approximation, but we could increase the accuracy by adding even higher-order derivatives.

16. Computational intensity

Adam Coates presented a strategy for analysing the speed of matrix operations on a GPU. It’s a simplified model that says your time is spent on either reading/writing to memory or doing calculations. It assumes you can do both in parallel so we are interested in which one of them takes more time.

Let’s say we are multiplying a matrix with a vector:


If \(M=1024\) and \(N=512\), then the number of bytes we need to read and store is:

\( 4\text{ bytes }\times (1024 \times 512 + 512 + 1024) = 2.1e6\text{ bytes} \)

And the number of calculations we need to do is:

\(2\times 1024\times 512 = 1e6\text{ FLOPs}\)

If we have a GPU that can do 6 TFLOP/s and has memory bandwidth of  300GB/s, then the total running time will be:

\(\text{max}\{2.1e6\text{ bytes }/ (300e9\text{ bytes}/s), 1e6\text{ FLOPs} / (6e12\text{ FLOP}/s) \} \\
= \text{max}\{ 7\mu s, 0.16\mu s \} \)

This means the process is bounded by the \(7\mu s\) spent on copying to/from the memory, and getting a faster GPU would not make any difference. As you can probably guess, this situation gets better with bigger matrices/vectors, and when doing matrix-matrix operations.

Adam also described the idea of calculating the intensity of an operation:

Intensity = (# arithmetic ops) / (# bytes to load or store)

In the previous scenario, this would be

Intensity = (1E6 FLOPs) / (2.1E6 bytes) = 0.5 FLOPs/bytes

Low intensity means the system is bottlenecked on memory, and high intensity means it’s bottlenecked by the GPU speed. This can be visualised, in order to find which of the two needs to improve in order to speed up the whole system, and where the sweet spot lies.


17. Minibatches

Continuing from the intensity calculations, one way of increasing the intensity of your network (in order to be limited by computation instead of memory), is to process data in minibatches. This avoids some memory operations, and GPUs are great at processing large matrices in parallel.

However, increasing the batch size too much will probably start to hurting the training algorithm and converging can take longer. It’s important to find a good balance in order to get the best results in the least amount of time.


18. Training on adversarial examples

It was recently revealed that neural networks are easily tricked by adversarial examples. In the example below, the image on the left is correctly classified as a goldfish. However, if we apply the noise pattern shown in the middle, resulting in the image on the right, the classifier becomes convinced this is a picture of a daisy. The image is from Andrej Karpathy’s blog post “Breaking Linear Classifiers on ImageNet”, and you can read more about it there. fish

The noise pattern isn’t random though – the noise is carefully calculated, in order to trick the network. But the point remains: the image on the right is clearly still a goldfish and not a daisy.

Apparently strategies like ensemble models, voting after multiple saccades, and unsupervised pretraining have all failed against this vulnerability. Applying heavy regularisation helps, but not before ruining the accuracy on the clean data.

Ian Goodfellow presented the idea of training on these adversarial examples. They can be automatically generated and added to the training set. The results below show that in addition to helping with the adversarial cases, this also improves accuracy on the clean examples.goodfellow_advFinally, we can improve this further by penalising the KL-divergence between the original predicted distribution and the predicted distribution on the adversarial example. This optimises the network to be more robust, and to predict similar class distributions for similar (adversarial) images.

19. Everything is language modelling

Phil Blunsom presented the idea that almost all NLP can be structured as a language model. We can do this by concatenating the output to the input and trying to predict the probability of the whole sequence.


\(P(\text{Les chiens aiment les os || Dogs love bones})\)

Question answering:

\(P(\text{What do dogs love? || bones .})\)


\(P(\text{How are you? || Fine thanks. And you?})\)

The latter two need to be additionally conditioned on some world knowledge. The second part doesn’t even need to be words, but could be labels or some structured output like dependency relations.

20. SMT had a rough start

When Frederick Jelinek and his team at IBM submitted one of the first papers on statistical machine translation to COLING in 1988, they got the following anonymous review:


The validity of a statistical (information theoretic) approach to MT has indeed been recognized, as the authors mention, by Weaver as early as 1949. And was universally recognized as mistaken by 1950 (cf. Hutchins, MT – Past, Present, Future, Ellis Horwood, 1986, p. 30ff and references therein). The crude force of computers is not science. The paper is simply beyond the scope of COLING.

21. The state of Neural Machine Translation

Apparently a very simple neural model can produce surprisingly good results. An example of translating from Chinese to English, from Phil Blunsom’s slides:


In this model, the vectors for the Chinese words are simply added together to form a sentence vector. The decoder consists of a conditional language model which takes the sentence vector, together with vectors from the two recently generated English words, and generates the next word in the translation.

However, neural models are still not outperforming the very best traditional MT systems. They do come very close though. Results from “Sequence to Sequence Learning with Neural Networks” by Sutskever et al. (2014):

Model  BLEU score
Baseline 33.30
Best WMT'14 result  37.0
Scoring with 5 LSTMs  36.5
Oracle (upper bound)  ∼45

Update: @stanfordnlp pointed out that there are some recent results where the neural model does indeed outperform the state-of-the-art traditional MT system. Check out “Effective Approaches to Attention-based Neural Machine Translation” (Luong et. al., 2015).

22. MetaMind classifier demo

Richard Socher demonstrated the MetaMind image classification demo, which you can train yourself by uploading images. I trained a classifier to detect Edison and Einstein (couldn’t find enough unique images of Tesla). 5 example images for both classes, testing on one held out image each. Seemed to work pretty well.


23. Optimising gradient updates

Mark Schmidt gave two presentations about numerical optimisation in different scenarios.

In a deterministic gradient method we calculate the gradient over the whole dataset and then apply the update. The iteration cost is linear with the dataset size.

In stochastic gradient methods we calculate the gradient on one datapoint and then apply the update. The iteration cost is independent of the dataset size.

Each iteration of the stochastic gradient descent is much faster, but it usually takes many more iterations to train the network, as this graph illustrates:


In order to get the best of both worlds, we can use batching. More specifically, we could do one pass of the dataset with stochastic gradient descent, in order to quickly get on the right track, and then start increasing the batch size. The gradient error decreases as the batch size increases, although eventually the iteration cost will become dependent on the dataset size again.

Stochastic Average Gradient (SAG) is a method that gets around this, providing a linear convergence rate with only 1 gradient per iteration. Unfortunately, it is not feasible for large neural networks, as it needs to remember the gradient updates for every datapoint, leading to large memory requirements. Stochastic Variance-Reduced Gradient (SVRG) is another method that reduces this memory cost, and only needs 2 gradient calculations per iteration (plus occasional full passes).

Mark said a student of his implemented a variety of optimisation methods (AdaGrad, momentum, SAG, etc). When asked, what he would use in a black box neural network system, the student said two methods: Streaming SVRG (Frostig et al., 2015), and a method they haven’t published yet.

24. Theano profiling

If you put “profile=True” into THEANO_FLAGS, it will analyse your program, showing a breakdown of how much is spent on each operation. Very handy for finding bottlenecks.

25. Adversarial nets framework

Following on from Ian Goodfellow’s talk on adversarial examples, Yoshua Bengio talked about having two systems competing with each other.

System D is a discriminative system that aims to classify between real data and artificially generated data.

System G is a generative system, that tries to generate artificial data, which D would incorrectly classify as real.

As we train one, the other needs to get better as well. In practice this does work, although the step needs to be quite small to make sure D can keep up with G. Below are some examples from “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks” – a more advanced version of this model which tries to generate images of churches.


26. numbering

The arXiv number contains the year and month of the submission, followed by the sequence number. So paper 1508.03854 was number 3854 in August 2015. Good to know.

Transforming Images to Feature Vectors

I’m keen to explore some challenges in multimodal learning, such as jointly learning visual and textual semantics. However, I would rather not start by attempting to train an image recognition system from scratch, and prefer to leave this part to researchers who are more experienced in vision and image analysis.

Therefore, the goal is to use an existing image recognition system, in order to extract useful features for a dataset of images, which can then be used as input to a separate machine learning system or neural network. We start with a directory of images, and create a text file containing feature vectors for each image.

1. Install Caffe

Caffe is an open-source neural network library developed in Berkeley, with a focus on image recognition. It can be used to construct and train your own network, or load one of the pretrained models. A web demo is available if you want to test it out.

Follow the installation instructions to compile Caffe. You will need to install quite a few dependencies (Boost, OpenCV, ATLAS, etc), but at least for Ubuntu 14.04 they were all available in public repositories.

Once you’re done, run

make test
make runtest

This will run the tests and make sure the installation is working properly.

2. Prepare your dataset

Put all your images you want to process into one directory. Then generate a file containing the path to each image. One image per line. We will use this file to read the images, and it will help you map images to the correct vectors later.

You can run something like this:

find `pwd`/images -type f -exec echo {} \; > images.txt

This will find all files in subdirectory called “images” and write their paths to images.txt

3. Download the model

There are a number of pretrained models publically available for Caffe. Four main models are part of the original Caffe distribution, but more are available in the Model Zoo wiki page, provided by community members and other researchers.

We’ll be using the BVLC GoogLeNet model, which is based on the model described in Going Deeper with Convolutions by Szegedy et al. (2014). It is a 22-layer deep convolutional network, trained on ImageNet data to detect 1,000 different image types. Just for fun, here’s a diragram of the network, rotated 90 degrees:


The Caffe models consist of two parts:

  1. A description of the model (in the form of *.prototxt files)
  2. The trained parameters of the model (in the form of a *.caffemodel file)

The prototxt files are small, and they came included with the Caffe code. But the parameters are large and need to be downloaded separately. Run the following command in your main Caffe directory to download the parameters for the GoogLeNet model:

python scripts/ models/bvlc_googlenet

This will find out where to download the caffemodel file, based on information already in the models/bvlc_googlenet/ directory, and will then place it into the same directory.

In addition, run this command as well:


It will download some auxiliary files for the ImageNet dataset, including the file of class labels which we will be using later.

4. Process images and print vectors

Now is the time to load the model into Caffe, process each image, and print a corresponding vector into a file. I created a script for that (see below, also available as a Gist):

import numpy as np
import os, sys, getopt

# Main path to your caffe installation
caffe_root = '/path/to/your/caffe/'

# Model prototxt file
model_prototxt = caffe_root + 'models/bvlc_googlenet/deploy.prototxt'

# Model caffemodel file
model_trained = caffe_root + 'models/bvlc_googlenet/bvlc_googlenet.caffemodel'

# File containing the class labels
imagenet_labels = caffe_root + 'data/ilsvrc12/synset_words.txt'

# Path to the mean image (used for input processing)
mean_path = caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy'

# Name of the layer we want to extract
layer_name = 'pool5/7x7_s1'

sys.path.insert(0, caffe_root + 'python')
import caffe

def main(argv):
    inputfile = ''
    outputfile = ''

        opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
    except getopt.GetoptError:
        print ' -i <inputfile> -o <outputfile>'

    for opt, arg in opts:
        if opt == '-h':
            print ' -i <inputfile> -o <outputfile>'
        elif opt in ("-i"):
            inputfile = arg
        elif opt in ("-o"):
            outputfile = arg

    print 'Reading images from "', inputfile
    print 'Writing vectors to "', outputfile

    # Setting this to CPU, but feel free to use GPU if you have CUDA installed
    # Loading the Caffe model, setting preprocessing parameters
    net = caffe.Classifier(model_prototxt, model_trained,
                           image_dims=(256, 256))

    # Loading class labels
    with open(imagenet_labels) as f:
        labels = f.readlines()

    # This prints information about the network layers (names and sizes)
    # You can uncomment this, to have a look inside the network and choose which layer to print
    #print [(k, for k, v in net.blobs.items()]

    # Processing one image at a time, printint predictions and writing the vector to a file
    with open(inputfile, 'r') as reader:
        with open(outputfile, 'w') as writer:
            for image_path in reader:
                image_path = image_path.strip()
                input_image =
                prediction = net.predict([input_image], oversample=False)
                print os.path.basename(image_path), ' : ' , labels[prediction[0].argmax()].strip() , ' (', prediction[0][prediction[0].argmax()] , ')'
                np.savetxt(writer, net.blobs[layer_name].data[0].reshape(1,-1), fmt='%.8g')

if __name__ == "__main__":

You will first need to set the caffe_root variable to point to your Caffe installation. Then run it with:

python -i <inputfile> -o <outputfile>

It will first print out a lot of model-specific debugging information, and will then print a line for each input image containing the image name, the label of the most probable class, and the class probability.

flower.jpg  :  n11939491 daisy  ( 0.576037 )
horse.jpg  :  n02389026 sorrel  ( 0.996444 )
beach.jpg  :  n09428293 seashore, coast, seacoast, sea-coast  ( 0.568305 )

At the same time, it will also print vectors into the output file. By default, it will extract the layer pool5/7x7_s1 after processing each image. This is the last layer before the final softmax in the end, and it contains 1024 elements. I haven’t experimented with choosing different layers yet, but this seemed like a reasonable place to start – it should contain all the high-level processing done in the network, but before forcing it to choose a specific class. Feel free to choose a different layer though, just change the corresponding parameter in the script. If you find that specific layers work better, let me know as well.

The outputfile will contain vectors for each image. There will be one line of values for each input image, and every line will contain 1024 values (if you printed the default layer). Mission accomplished!


Below are some tips for when you run into problems.

First, it’s worth making sure you have compiled the python bindings in the Caffe directory:

make pycaffe

I was getting some unusual errors when this code was in a subdirectory of the main Caffe folder. After some googling I found that others had similar problems with other projects, and apparently overlapping library names were causing the wrong dependencies to be included. The simple solution was to move this code out of the Caffe directory, and put it somewhere else.

I installed Caffe with CUDA support, and even though I turned GPU support off in the script, it was still complaining when I didn’t set the CUDA path. For example, I run the code like this (you may need to change the paths to match your system):

LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64/:$LD_LIBRARY_PATH PYTHONPATH=$PYTHONPATH:/path/to/caffe/python python -i images.txt -o vectors.txt

Finally, Caffe is compiled against a specific version of CUDA. I initially had CUDA 6.5 installed, but after upgrading to CUDA 7.0 the Caffe library had to be recompiled.


There you have it – going from images to vectors. Now you can use these vectors to represent your images in various tasks, such as classification, multi-modal learning, or clustering. Ideally, you will probably want to train the whole network on a specific task, including the visual component, but for starters these pretrained vectors should be quite helpful as well.

These instructions and the script are loosely based on Caffe examples on ImageNet classification and filter visualisation. If the code here isn’t doing quite what you want it to, it’s worth looking at these other similar applications.

If you have any suggestions or fixes, let me know and I’ll be happy to incorporate them in this post.

Multilingual Semantic Models

In this post I’ll discuss a model for learning word embeddings, such that they end up in the same space in different languages. This means we can find the similarity between some English and German words, or even compare the meaning of two sentences in different languages. It is a summary and analysis of the paper by Karl Moritz Hermann and Phil Blunsom, titled “Multilingual Models for Compositional Distributional Semantics“, published at ACL 2014.

The Task

The goal of this work is to extend the distributional hypothesis to multilingual data and joint-space embeddings. This would give us the ability to compare words and sentences in different languages, and also make use of labelled training data from languages other than the target language. For example, below is an illustration of English words and their Estonian translations in the same semantic space.



This actually turns out to be a very difficult task, because the distributional hypothesis stops working across different languages. While “fish” is an important feature of “cat”, because they occur together often, “kass” never occurs with “fish”, because they are in different languages and therefore used in separate sets of documents.

In order to learn these representations in the same space, the authors construct a neural network that learns from parallel sentences (pairs of the same sentence in different languages). The model is then evaluated on the task of topic classification, training on one language and testing on the other.

A bit of a spoiler, but here is a visualisation of some words from the final model, mapped into 2 dimensions.



The words from English, German and French are successfully mapped into clusters based on meaning. The colours indicate gender (blue=male, red=female, green=neutral).

The Multilingual Model

The main idea is as follows: We have sentence \(a\) in one language, and we have a function \(f(a)\) which maps that sentence into a vector representation (we’ll come back to that function). We then have sentence \(b\), which is the same sentence just in a different language, and function \(g(b)\) for mapping it into a vector representation. Our goal is to have \(f(a)\) and \(g(b)\) be identical, because both of these sentences have the same meaning. So during training, we show the model a series of parallel sentences \(a\) and \(b\), and each time we adjust the functions \(f(a)\) and \(g(b)\) so that they would produce more similar vectors.

Here is a graphical representation of the model:


\(b1\), \(b2\) and \(b3\) are words in sentence \(b\); \(a1\), \(a2\),\(a3\) and \(a4\) are words in sentence \(a\). The red vectors in the middle are the sentence representations that we want to be similar.

Next, let’s talk about the functions \(f(a)\) and \(g(b)\) that map a sentence to a vector. As you can see from the image above, each word is represented as a vector as well. The simplest option of going from words to sentences is to just add the individual word vectors together (the ADD model):

\(f_{ADD}(a) = \sum_{i=1}^{n} a_i\)

Here, \(a_i\) is the vector for word number \(i\) in sentence \(a\). This addition is similar to a basic bag-of-words model, because it doesn’t preserve any information about the order of the words. Therefore, the authors have also proposed a bigram version of this function (the BI model):

\(f_{BI}(a) = \sum_{i=1}^{n} tanh(a_{i-1} + a_i)\)

In this function, we step though the sentence, add together vectors for two consecutive words, and pass them through a nonlinearity (tanh). The result is then summed together into a sentence vector. This is essentially a multi-layer compositional network, where word vectors are first combined to bigram vectors, and then bigram vectors are combined to sentence vectors.

One more component to make this model work – the optimization function. The authors define an energy function given two sentences:

\(E(a,b) = || f(a) – g(b) ||^ 2\)

This means we find the Euclidean distance between the two vector representations and take the square of it. This value will be big when the vectors are different, and small when they are similar.

But we can’t directly use this for optimization, because functions \(f(a)\) and \(g(b)\) that always returned zero vectors would be the most optimal solution. We want the model to give similar vectors for similar sentences, but different vectors for semantically different sentences. Here’s a function for that:

\(E_{nc}(a,b,c) = [m + E(a,b) – E(a,c)]_{+}\)

We’ve introduced a randomly selected sentence \(c\) that probably has nothing to do with \(a\) or \(b\). Our objective is to minimze the \(E_{nc}(a,b,c)\) function, which means we want \(E(a,b)\) (for related sentences) to be small, and \(E(a,c)\) (for unrelated sentences) to be large. This form of training – teaching the model to distinguish between correct and incorrect samples – is called noise contrastive estimation. The formula also includes \(m\), which is the margin we want to have between the values of \(E(a,b)\) and \(E(a,c)\). The whole thing is passed through the function \([x]_{+} = max(x,0)\), which means that if \(E(a,c)\) is greater than \(E(a,b)\) by margin \(m\), then we’re already optimal and don’t need to adjust the model further.

The authors also experiment with a document-level variation of the model (the DOC model), where individual sentence vectors are combined into document vectors and these are also optimized to be similar, in addition to the sentence vectors.


The authors evaluate the system on the task of topic classification. The classifier is trained on one language (eg English) and the test results are reported on another language (eg German) for which no labelled training data was used. They run two main experiments:

  1. The cross-lingual document classification (CLDC) task, described by Klementiev et al. (2012). The system is trained on the parallel Europarl corpus, and tested on Reuters RCV1/RCV2. The language pairs used were English-German and English-French.
  2. The authors built a new corpus from parallel subtitles of TED talks (not yet online at the time of writing this), based on a previous TED corpus for IWSLT. Each talk also has topic tags assigned to them, and the task is to assign a correct tag to every talk, using the document-level vector.



First, results on the CLDC task:

I-Matrix is the previous state-of-the-art system, and all the models described here manage to outperform it. The +-variations of the model (ADD+ and BI+) use the French data as an extra training resource, thereby improving performance. This is an interesting result, as the classifier is trained on English and tested on German, and French seems completely unrelated to the task. But by adding it into the model, the English word representations are improved by having more information available, which in turn propagates on to having better German representations.

Next, experiments on the TED corpus:


The authors have performed a much larger number of experiments, and I’ve only chosen a few examples to show here.

The MT System is a machine translation baseline, where the test data is translated into the source language using a machine translation system. The most interesting scenarios for application are where the source language is English, and this is where the MT baseline often still wins. So if we want to topic classification in Dutch, but we only have English labelled data, it’s best to just automatically translate the Dutch text into English before classification.

Experiments in the other direction (where the target language is English) show different results, and the multilingual neural models manage to win on most languages. I’m curious about this difference – perhaps the MT systems are better tuned to translate into English, but not as good when translating from English into other languages? In any case, I think with some additional developments the neural network model will be able to beat the baseline in both directions.

In most cases, adding the document-level training signal (ADD/DOC) helped accuracy quite a bit. The bigram models (BI) however were outperformed by the basic ADD models on this task, and the authors suspect this is due to sparsity issues caused by less training data.

Finally, the ADD/DOC/joint model was trained on all languages simultaneously, taking advantage of parallel data in all the languages, and mapping all vectors into the same space. The results of this experiment seem to be mixed, leading to an improvement on some languages and decrease on others.

In conclusion, this is definitely a very interesting model, and it bridges the gap between vector representations of different languages, using only sentence-aligned plain text. Combining sentence-level and document-level training signals seems to give a fairly consistent improvement in classification accuracy. Unfortunately, in the most interesting scenario, mapping from English to other resource-poor languages, this system does not yet beat the MT baseline. But hopefully this is only the first step, and future research will further improve the results.


Hermann, K. M., & Blunsom, P. (2014). Multilingual Models for Compositional Distributed Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 58–68).

Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In COLING 2012.

Political ideology detection

Neural networks have a range of interesting applications, and here I will discuss on one them: recursive neural networks and the detection of political ideology. This post is a summary and analysis of a recent publication by Mohit Iyyer, Peter Anns, Jordan Boyd-Graber and Philip Resnik: “Political Ideology Detection Using Recursive Neural Networks“.

The Task

Given a sentence, we want the model to detect the political ideology expressed in that sentence. In this research, the authors deal with US politics, so the possible options are liberal (democrats) or conservative (republicans). As a practical application we might consider a system that processes a large amount of news articles or public speeches to detect and measure explicit or hidden political bias of the authors.


A traditional approach to this problem is a simple bag-of-words model, where each word is treated as a separate feature, but this ignores any syntactic structure and even word order. As shown below, political ideology can be compositionally complicated – while certain sections of the sentence are locally conservative, the way they are used in context makes the overall sentence liberal.political_sample_1

Figure 1: Sample sentence from Iyyer et al. (2014). Blue nodes are liberal, red nodes are conservative, grey nodes are neutral.

Recursive Neural Network

The recursive neural network in this work is based on Socher et al. (2011), where it was used for sentiment detection. The idea of semantic composition is that the meaning of a sentence or text is composed of its smaller subparts, and recursive neural networks aim to model this by recursively combining vector representations of words and phrases.

We represent each word as a vector of length d. We can then construct a network that takes two word vectors, concatenated into a vector of length 2d, as input and outputs a new vector of length d. We can then stack these networks together and use the output of one network as one of the inputs to another network. This way we can combine multiple words to phrases and phrases to sentences. We don’t need to have a separate network for each of these levels, and can just reuse the same network each time – which is why these are called recursive neural networks.

Mathematically, the composition happens as follows:

x_c = f(W_L\times x_a + W_R\times x_b + b)

\(x_a\) and \(x_b\) are the vectors for the two input words/phrases; \(x_c\) is the output vector; \(W_L\) and \(W_R\) are the weight matrices for the left and right input; \(b\) is the bias vector. If we think of the input as a concatenation of the vectors, this formula actually becomes the traditional single-layer neural network:

x_c = f(W\times x_{ab} + b)

An example of this recurrent neural network in action is shown below:


During training, an additional softmax layer is added on top, which calculates the probability for both classes (liberal or conservative). The model is then trained to minimize the negative log-probability of the correct class, with L2-regularization.  The error is backpropagated through the whole tree, so the information on the sentence level will be used to modify the individual word vectors, along with all the weights used in the composition.

Also, if I understand the paper correctly, the final prediction from the model at test time is actually generated by averaging vectors for all the nodes in the sentence tree, concatenating this with the root vector, and training a separate logistic regression model over that. Which seems a bit odd, because the recurrent network is quite capable of predicting the class label on the root node, and has been optimised to do so. I wonder what the performance difference would be, if the root-level predictions from the RNN were used directly for evaluation.


The authors use two datasets for evaluation:

  1. The Convote dataset (Thomas et al., 2006) contains transcripts of spoken text from the US Congressional floor debate. The experiments here use 7,816 sentences from the corpus.
  2. The Ideological Books Corpus (IBC) (Gross et al., 2013) is a collection of books and articles by authors whose political bias is well-known. Iyyer et al. (2014) extend the dataset by providing sentence-level and partial phrase-level annotation for 4,062 sentences, crowd-sourced through Crowdflower.

Both of these datasets are originally annotated on the level of the author and their political views, but this information can be used to also label each sentence with its most likely political class.

On both datasets, the authors apply heuristic filters to only keep sentences containing explicit bias in either direction, and remove neutral sentences which take up majority of the corpora. However, as this filtering is based on surface forms (essentially bag-of-words or bag-of phrases), it is unknown how this affects the final dataset. It would be interesting to also see results where the test set contains all of the original sentences.


The authors compare a number of different models.

  • Baselines
    • Random: The label (conservative/liberal) is chosen randomly.
    • LR1: Basic logistic regression using bag-of-words (BoW) features.
    • LR2: As previous, but adding phrase annotations as more training data.
    • LR3: Logistic regression with BoW and dependency-based features.
    • LR-w2v: Logistic regression over averaged word vectors from word2vec.
  • RNN Models
    • RNN1: The basic RNN model, trained on sentence-level annotations, using random initialization.
    • RNN1-w2v: As previous, but initialized with vectors from word2vec.
    • RNN2-w2v: The RNN model, trained on sentence-level and phrase-level annotations, initialized with vectors from word2vec.

The results (accuracy) can be seen in the graph.


Some conclusions we can draw from the results:

  • All the models outperform the random baseline by quite a margin.
  • All the recurrent neural network models outperform all the logistic regression models.
  • Adding the phrase-level annotations as extra training data for the logistic regression decreases performance (LR2), but adding them to the RNN model improves performance (RNN2-w2v).
  • Logistic regression trained on averaged word2vec vectors (LR-w2v) outperforms logistic regression with BoW features (LR1), and even dependency-based features (LR3, on the IBC dataset).
  • Initializing the RNN with word2vec vectors gives a little boost to the overall accuracy.


Gross, J., Acree, B., Sim, Y., & Smith, N. A. (2013). Testing the Etch-a-Sketch Hypothesis : Measuring Ideological Signaling via Candidates’ Use of Key Phrases. In APSA 2013.

Iyyer, M., Enns, P., Boyd-Graber, J., & Resnik, P. (2014). Political Ideology Detection Using Recursive Neural Networks. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 1113–1122).

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semi-supervised recursive autoencoders for predicting sentiment distributions. Proceedings of EMNLP. Retrieved from

Thomas, M., Pang, B., & Lee, L. (2006). Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing – EMNLP ’06 (pp. 327–335).

Don’t count, predict

In the past couple of years, neural networks have nearly taken over the field of NLP, as they are being used in recent state-of-the-art systems for many tasks. One interesting application is distributional semantics, as they can be used to learn intelligent dense vector representations for words. Marco Baroni, Georgiana Dinu and German Kruszewski presented a paper in ACL 2014 called “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors“, where they compare these new neural-network models with more traditional context vectors on a range of different tasks. Here I will try to give an overview and a summary of their work.

Distributional hypothesis

The goal is to find how similar two words are to each other semantically. The distributional hypothesis states:

Words which are similar in meaning occur in similar contexts
(Rubenstein & Goodenough, 1965).

Therefore, if we want to find a word similar to “magazine”, we can look for words that occur in similar contexts, such as “newspaper”.

I was reading a magazine today I was reading a newspaper today
The magazine published an article The newspaper published an article
He buys this magazine every day He buys this newspaper every day


Also, if we want to find how similar “magazine” and “newspaper” are, we can compare how similar are all the contexts in which they appear. For example, to find the similarity between two words, we can represent the contexts as feature vectors and calculate the cosine similarity between their corresponding vectors.

The counting model

In a traditional counting model, we count how many times a certain context word appears near our main word. We have to manually choose a window size – the number of words we count on either side of the main word.

For example, given sentences “He buys this newspaper every day” and “I read a newspaper every day”, we can look at window size 2 (on either side of the main word) and create the following count vector:

buys this every day read a
1 1 2 2 1 1

As you can see, “every” and “day” get a count of 2, because they appear in both sentences, whereas “he” doesn’t get included because it doesn’t fit into the window of 2 words.

In order to get the best performance, it is important to use a weighting function which turns these counts into more informative values. A good weighting scheme would downweight frequent words such as “this” and “a”, and upweight more informative words such as “read”. The authors experimented with two weighting schemes: Pointwise Mutual Information  and Local Mutual Information. An overview of various weighting methods can be found in the thesis of Curran (2003).

These vectors are also often compressed. The motivation is that feature words such as “buy” and “purchase” are very similar and compression techniques should be able to represent them as a single semantic class. In addition, this gives smaller dense vectors, which can be easier to operate on. The authors here experimented with SVD and Non-negative Matrix Factorization.

The predicting model

A the neural network (predicting) model, this work uses the word2vec implementation of Continuous Bag-of-Words (CBOW). Word2vec is a toolkit for efficiently learning word vectors from very large text corpora (Mikolov et al., 2013). The CBOW architecture is shown below:


The vectors of context words are given as input to the network. They are summed and then used to predict the main word. During training, the error is backpropagated and the context vectors are updated so that they would predict the correct word. Since similar words appear in similar contexts, the vectors of similar words will also be updated towards similar directions.

A normal network would require finding the probabilities of all possible words in our vocabulary (using softmax), which can be computationally very demanding. Word2vec implements two alternatives to speed this up:

  1. Hierarchical softmax, where the words are arranged in a tree (similar to a decision tree), making the complexity logarithmic.
  2. Negative sampling, Where the system learns to distinguish the correct answer from a sample of a few incorrect answers.

In addition, word2vec can downsample very frequent words (for example, function words such as “a” and “the”) which are not very informative.

From personal experience I have found skip-gram models (also implemented in word2vec) to perform better than CBOW, although they are slightly slower. Skip-grams were not compared in this work, so it is still an open question which of the two models gives better vectors, given the same training time.


The models were trained on a corpus of 2.8 billion tokens, combining ukWac, the English Wikipedia and the British National Corpus. They used 300,000 most frequent words as both the context and target words.

The authors performed on a number of experiments on different semantic similarity tasks and datasets.

  1. Semantic relatedness: In this task, humans were asked to rate the relatedness of two words, and the system correlation with these scores is measured.
    • rg: 65 noun pairs by Rubenstein & Goodenough (1965)
    • ws: Wordsim353, 353 word pairs by Finkelstein et al. (2002)
    • wss: Subset of Wordsim353 focused on similarity (Agirre et al., 2009)
    • wsr: Subset of Wordsim353 focused on relatedness (Agirre et al., 2009)
    • men: 1000 word pairs by Bruni et al. (2014)
  2. Synonym detection: The system needs to find the correct synonym for a word.
    • toefl: 80 multiple-choice questions with 4 synonym candidates (Landauer & Dumais, 1997).
  3. Concept categorization: Given a set of concepts, the system needs to group them into semantic categories.
    • ap: 402 concepts in 21 categories (Almuhareb, 2006)
    • esslli: 44 concepts in 6 categories (Baroni et al., 2008)
    • battig: 83 concepts in 10 categories (Baroni et al., 2010)
  4. Selectional preferences: Using some examples, the system needs to decide how likely a noun is to be a subject or object for a specific verb.
    • up: 221 word pairs (Pado, 2007)
    • mcrae: 100 noun-verb pairs (McRae et al., 1998)
  5. Analogy: Using examples, the system needs to find a suitable word to solve analogy questions, such as: “X is to ‘uncle’ as ‘sister’ is to ‘brother'”
    • an: ~19,500 analogy questions (Mikolov et al., 2013b)
    • ansyn: Subset of the analogy questions, focused on syntactic analogies
    • ansem: Subset of the analogy questions, focused on semantic analogies


The graph below illustrates the performance of different models on all the tasks. The blue bars represent the counting models, and the red bars are for neural network models. The “individual” models are the best models on that specific task, whereas the “best-overall” is the single best model across all the tasks.

In conclusion, the neural networks win with a large margin. The neural models have given a large improvement on the tasks of semantic relatedness, synonym detection and analogy detection. The performance is equal or slightly better on categorization and selectional preferences.

The best parameter choices for counting models are as follows:

  • window size 2 (bigger is not always better)
  • weighted with PMI, not LMI
  • no dimensionality reduction (not using SVD or NNMF)

The best parameters for the neural network model were:

  • window size 5
  • negative sampling (not hierarchical softmax)
  • subsampling of frequent words
  • dimensionality 400

The paper also finds that the neural models are much less sensitive to parameter choices, as even the worst neural models perform relatively well, whereas counting models can thoroughly fail with unsuitable values.

It seems appropriate to quote the final conclusion of the authors:

Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. A more realistic expectation was that a complex picture would emerge, with predict and count vectors beating each other on different tasks. Instead, we found that the predict models are so good that, while triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.

The results certainly support that vectors created by neural models are more suitable for distributional similarity tasks. We have also performed a similar comparison on the task of hyponym generation (Rei & Briscoe, 2014). Our findings match those here – the neural models outperform simple bag-of-words models. However, the neural models are in turn outperformed by a dependency-based model. It remains to be seen whether this finding also applies on a wider range of distributional similarity models.


Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on – NAACL ’09, (June), 19. doi:10.3115/1620754.1620758

Almuhareb, A. (2006) Attributes in Lexical Acquisition. Phd thesis, University of Essex.

Baroni, M., Barbu, E., Murphy, B., & Poesio, M. (2010). Strudel: A distributional semantic model based on properties and types. Cognitive Science, 34.

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 238–247).

Baroni, M., Evert, S., & Lenci, A., editors. (2008). Bridging the gap between Semantic
Theory and Computational Simulations: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics.

Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal Distributional Semantics. J. Artif. Intell. Res.(JAIR), 49, 1–47.

Curran, J. R. (2003). From distributional to semantic similarity. University of Edinburgh. University of Edinburgh.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. In ACM Transactions on Information Systems (Vol. 20, pp. 116–131). ACM. doi:10.1145/503104.503110

McRae, K., Spivey-Knowlton, M., & Tanenhaus, M. (1998). Modeling the influence of the- matic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38:283–312.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 1–12.

Mikolov, T., Yih, W., & Zweig, G. (2013b). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Pado, U. (2007). The Integration of Syntax and Semantic Plausibility in a Wide-Coverage Model of Sentence Processing. Dissertation, Saarland University, Saarbrücken.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. In CoNLL-2014 (pp. 68–77).

Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy.
Communications of the ACM, 8 (10), 627–633.