Category: Uncategorized

57 Summaries of Machine Learning and NLP Research

Staying on top of recent work is an important part of being a good researcher, but this can be quite difficult. Thousands of new papers are published every year at the main ML and NLP conferences, not to mention all the specialised workshops and everything that shows up on ArXiv. Going through all of them, even just to find the papers that you want to read in more depth, can be very time-consuming.

In this post, I have summarised 50 papers. After going through a paper, if I had the chance, I would write down a few notes and summarise the work in a couple of sentences. These are not meant as reviews – I’m not commenting on whether I think the paper is good or not. But I do try to present the crux of the paper as bluntly as possible, without unnecessary sales tactics. Hopefully this can give you the general idea of 50 papers, in roughly 20 minutes of reading time.

The papers are not selected or ordered based on any criteria. It is not a list of the best papers I have read, more like a random sample. The only filter that I applied was to exclude papers older than 2016, as the goal is to give an overview of the more recent work.

I set out to summarise 50 papers. Once I was done, I thought this would be a sensible place to summarise my own work as well. So at the end of the list you will also find brief summaries of the papers I published in 2017.

Let’s get started.

1. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Danqi Chen, Jason Bolton, Christopher D. Manning. Stanford. ACL 2016.

Hermann et al (2015) created a dataset for testing reading comprehension by extracting summarised bullet points from CNN and Daily Mail. All the entities in the text are anonymised and the task is to place correct entities into empty slots based on the news article.


This paper has hand-reviewed 100 samples from the dataset and concludes that around 25% of the questions are difficult or impossible to answer even for a human, mostly due to the anonymisation process. They present a simple classifier that achieves unexpectedly good results, and a neural network based on attention that beats all previous results by quite a margin.

2. Word Translation Without Parallel Data
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. Facebook, Le Mans, Sorbonne. ArXiv 2017.

Inducing word translations using only monolingual corpora for two languages. Separate embeddings are trained for each language and a mapping is learned though an adversarial objective, along with an orthogonality constraint on the most frequent words. A strategy for an unsupervised stopping criterion is also proposed.

Word Translation Without Parallel Data

3. A Nested Attention Neural Hybrid Model for Grammatical Error Correction
Jianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, Jianfeng Gao. ACL 2017.

Proposing character-based extensions to a neural MT system for grammatical error correction. OOV words are represented in the encoder and decoder using character-based RNNs. They evaluate on the CoNLL-14 dataset, integrate probabilities from a large language model, and achieve good results.

A Nested Attention Neural Hybrid Model for Grammatical Error Correction

4. On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems
Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, Steve Young. Cambridge. ACL 2016.

The goal is to improve the training process for a spoken dialogue system, more specifically a telephone-based system providing restaurant information for the Cambridge (UK) area. They train a supervised system which tries to predict the success on the current dialogue – if the model is certain about the outcome, the predicted label is used for training the dialogue system; if the model is uncertain, the user is asked to provide a label. Essentially it reduces the amount of annotation that is required, by choosing which examples should be annotated through active learning.


The dialogue is mapped to a vector representation using a bidirectional LSTM trained like an autoencoder, and a Gaussian Process is used for modelling dialogue success.

5. Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps
Luana Bulat, Douwe Kiela, Stephen Clark. Cambridge. NAACL 2016.

The task is to predict feature norms – object properties, for example is_yellow and is_edible for the word banana. They experiment with adding in image recognition features, in addition to using distributional word vectors.

An input word is used to retrieve 10 images from Google, these are passed through an ImageNet classifier to get feature vectors, and then averaged to get a vector representation for that word. A supervised model (partial least-squares regression) is then trained to predict vectors of feature norms based on the input vectors (image-based, distributional, or a combination). Including the image information helps quite a bit, especially for detecting properties like colour and shape.

Examples of predicted feature norms using the visual features.
Examples of predicted feature norms using the visual features.

6. Adversarial examples in the physical world
Alexey Kurakin, Ian J. Goodfellow, Samy Bengio. Google, OpenAI. ArXiv.

Adversarial examples are datapoints that are designed to fool a classifier. For example, we can take an image that is classified correctly using a neural network, then backprop through the model to find which changes we need to make in order for it to be classified as something else. And these changes can be quite small, such that a human would hardly notice a difference.

Examples of adversarial image
Examples of adversarial images.

In this paper, they show that much of this property holds even when the images are fed into the classifier from the real world – after being photographed with a cell phone camera. While the accuracy goes from 85.3% to 36.3% when adversarial modifications are applied on the source images, the performance still drops from 79.8% to 36.4% when the images are photographed. They also propose two modifications to the process of generating adversarial images  – making it into a more gradual iterative process, and optimising for a specific adversarial class.

7. Extracting token-level signals of syntactic processing from fMRI – with an application to POS induction
Joachim Bingel, Maria Barrett, Anders Søgaard. Copenhagen. ACL 2016.

They incorporate fMRI features into POS tagging, under the assumption that reading semantically/functionally different words will activate the brain in different ways. For this they use a dataset of fMRI recordings, where the subjects were reading a chapter of Harry Potter. The main issue is that fMRI has very low temporal resolution – there is only one fMRI reading per 4 tokens, and in general it takes around 4-14 seconds for something to show up in fMRI. Nevertheless, they construct token-level vectors by using a Gaussian weighted average, integrate them into an unsupervised POS tagger, and show that it is able to improve performance.

Neural activity by brain region, from Wehbe et al. (2014).
Neural activity by brain region, from Wehbe et al. (2014).


8. Joint Extraction of Events and Entities within a Document Context
Bishan Yang, Tom Mitchell. Carnegie Mellon. NAACL 2016.

They propose a joint model for 1) identifying event keywords in a text, 2) identifying entities, and 3) identifying the connections between these events and entities. They also do this across different sentences, jointly for the whole text.

Example of the entity and event annotation that the system is modelling.
Example of the entity and event annotation that the system is modelling.

The entity detection part is done with a CRF; the structure of an event is learned with a probabilistic graphical model; information is integrated from surrounding sentences using a Stanford coreference system; and these are all tied together across the whole document using Integer Linear Programming.

9. Candidate re-ranking for SMT-based grammatical error correction
Zheng Yuan, Ted Briscoe, Mariano Felice. Cambridge. BEA Workshop 2016.

They improve an existing error correction system by re-ranking its predictions. The basic approach uses machine translation to perform error correction on learner texts – the incorrect text is essentially translated into correct text. Here, they include a ranking SVM to score and reorder the n-best lists from the translation model.

The reranking features include various internal scores from the translation model, the rank in the original ordering, language model probabilities trained on large corpora, language model scores based on only the n-best list, word-level translation probabilities, and sentence length features. They show improvement on two error correction datasets.

Example output from the models.
Example output from the models.

10. Variational Neural Machine Translation
Biao Zhang, Deyi Xiong, Jinsong Su, Hong Duan, Min Zhang. Soochow University, Xiamen University. ArXiv.

They start with the neural machine translation model using alignment, by Bahdanau et al. (2014), and add an extra variational component.


The authors use two neural variational components to model a distribution over latent variables z that captures the semantics of a sentence being translated. First, they model the posterior probability of z, conditioned on both input and output. Then they also model the prior of z, conditioned only on the input. During training, these two distributions are optimised to be similar using Kullback-Leibler distance, and during testing the prior is used. They report improvements on Chinese-English and English-German translation, compared to using the original encoder-decoder NMT framework.

11. Numerically Grounded Language Models for Semantic Error Correction
Georgios P. Spithourakis, Isabelle Augenstein, Sebastian Riedel. UCL. EMNLP 2016.

They create an LSTM neural language model that 1) has better handling of numerical values, and 2) is conditioned on a knowledge base.


First the the numerical value each token is given as an additional signal to the network at each time step. While we normally represent token “25” as a normal word embedding, we now also have an extra feature with numerical value float(25). Second, they condition the language model on text in a knowledge base. All the information in the KB is converted to a string, passed through an LSTM and then used to condition the main LM.

They evaluate on a dataset of 16,003 clinical records which come paired with small KB tuples of 20 possible attributes. The numerical grounding helps quite a bit, and the best results are obtained when the KB conditioning is also added.

12. Black Holes and White Rabbits : Metaphor Identification with Visual Features
Ekaterina Shutova, Douwe Kiela, Jean Maillard. Cambridge. NAACL 2016.

They build a system for detecting metaphors (“blind alley”, “honest meal”, etc) from literal word pairs.

Annotated metaphor examples from Tsvetkov et al. (2014), used in this work.
Annotated metaphor examples from Tsvetkov et al. (2014), used in this work.

The basic system uses word embedding similarity – cosine between the word embeddings. Then they explore variations using phrase embeddings, cos(phrase-word2, word2), which is similar to the operations with word regularities by Mikolov.

Finally, they create vector representations for words and phrases using visual information. The words are used as queries in Google Image Search, and the returned images are passed through an image detection network in order to obtain vector representations. The best final system performs the task separately using linguistic and visual vectors, and then combines the resulting scores.

13. Counter-fitting Word Vectors to Linguistic Constraints
Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, Steve Young. Cambridge, Apple. NAACL 2016.

They describe a method for augmenting existing word embeddings with knowledge of semantic constraints. The idea is similar to retrofitting by Faruqui et al. (2015), but using additional constraints and a different optimisation function.

Existing word vectors are further optimised to 1) have high similarity for known synonyms, 2) have low similarity for known antonyms, and 3) have high similarity to words that were highly similar in the original space. They evaluate on SimLex-999, showing state-of-the-art performance. Also, they use the method to improve a dialogue tracking system.

14. Bidirectional RNN for Medical Event Detection in Electronic Health Records
Abhyuday N. Jagannatha, Hong Yu. University of Massachusetts. NAACL 2016.

The authors have a dataset of 780 electronic health records and they use it to detect various medical events such as adverse drug events, drug dosage, etc. The task is done by assigning a label to each word in the document.

Annotation statistics for the corpus of health records.
Annotation statistics for the corpus of health records.

They look at CRFs, LSTMs and GRUs. Both LSTMs and GRUs outperform the CRF, but the best performance is achieved by a GRU trained on whole documents.

15. Symmetric Patterns and Coordinations: Fast and Enhanced Representations of Verbs and Adjectives
Roy Schwartz, Roi Reichart, Ari Rappoport. The Hebrew Universit, IIT. NAACL 2016.

They train word2vec skip-gram embeddings using coordinations as context. They use 11 manual patterns to extract coordinations (eg “X and Y”, “either X or Y”, etc). From “boats or planes”, “boats” will be a context of “planes” and “planes” will be a context of “boats”.

They evaluate on SimLex-999 and find that this performs badly on nouns. However, it beats normal skip-gram and dependency-based skip-gram on verbs and adjectives.

16. Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
Douwe Kiela, Anita L. Verő, Stephen Clark. Cambridge. EMNLP 2016.

The authors compare different image recognition models and image data sources for multimodal word representation learning.

Image recognition models used for vector generation

Experiments are performed on SimLex-999 (similarity) and MEN (relatedness). The performance of different models (AlexNet, GoogLeNet, VGGNet) is found to be quite similar, with VGGNet performing slightly better at the cost of requiring more computation. Using search engines for image sources gives good coverage; ImageNet performs quite well with VGGNet; ESP Game dataset gave the lowest performance. Combining visual and linguistic vectors was found to be beneficial on both English and Italian.

17. Named Entity Recognition for Novel Types by Transfer Learning
Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, Timothy Baldwin. Melbourne. EMNLP 2016.

The authors tackle the problem of domain adaptation for NER, where the label set of the target domain is different from the source domain.

They first train a CRF model on the source domain. Next, they train a LR classifier to predict labels in the target domain, based on predicted label scores from the model. Finally, the weights from the classifier are used to initialise another CRF model, which is then fine-tuned on the target domain data.

Performance of the proposed model (TransInit) compared to baselines
Performance of the proposed model (TransInit) compared to baselines

18. Hybrid computing using a neural network with dynamic external memory
Alex Graves, Greg Wayne, Malcolm Reynolds et al. DeepMind. Nature.

The DeepMind guys present an extension to the Neural Turing Machine architecture.


They call it a Differentiable Neural Computer (DNC) and it uses 1) an attention mechanism to access information in a matrix that acts as a memory, 2) an attention mechanism to save information to that memory, and 3) a transition matrix that stores information about the order in which rows in the memory are modified, in order to better handle sequential information. They test on the bAbI question answering dataset, a graph inference task, and on solving a puzzle of arranging blocks.

19. A Neural Approach to Automated Essay Scoring
Kaveh Taghipour, Hwee Tou Ng. Singapore. EMNLP 2016.

The authors construct a neural network for automated essay scoring.


Convolution window of 3 is passed over the text, which is used as input to an LSTM. The output of the LSTM is averaged over all timesteps and then a single value in the range of [0,1] is predicted as a scaled-down score for the essay. They evaluate by measuring quadratic weighted Kappa on the Kaggle essay scoring dataset.

20. Globally Coherent Text Generation with Neural Checklist Models
Chloe Kiddon, Luke Zettlemoyer, Yejin Choi. Washington. EMNLP 2016.

They describe a neural model for text generation, which keeps track of a checklist of items that need to be mentioned in the text.

The basic system is an encoder-decoder GRU model for text generation. On top of that, the model uses attention over items that need to be mentioned and items that have already been mentioned, both of which are encoded as vectors. An additional cost objective encourages the checklist to be filled by the end of the text. Evaluation is performed on recipe and dialogue generation.

21. Automatic Features for Essay Scoring – An Empirical Study
Fei Dong, Yue Zhang. Singapore. EMNLP 2016.

The authors investigate convolutional networks for essay scoring. They use a two-level convolution – first over words and then over sentences. Evaluation is performed on the Kaggle ASAP dataset, training separate models on individual topics, and also reporting some cross-topic results.


22. Learning Deep Structure-Preserving Image-Text Embeddings
Liwei Wang, Yin Li, Svetlana Lazebnik. University of Illinois, Georgia Tech. CVPR 2016.…CVPR_2016_paper.pdf

The authors present a neural model that maps images and sentences into the same space, in order to perform cross-modal retrieval – find images based on a sentence or find sentences based on an image.

The image vectors come from a pre-trained VGG image detection network. The sentence vectors are constructed using Fisher vectors, but they also explore simpler options, such as mean word2vec vectors and tfidf. Both are then mapped through nonlinearities and normalised, and Euclidean distance is used to measure vector similarity. They also investigate the task of mapping noun phrases from the image caption to specific areas of the image.

23. Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. Google Brain, DeepMind. ICLR 2017.

The authors investigate the generalisation properties of several well-known image recognition networks.


They show that these networks are able to overfit to the training set with 100% accuracy even if the labels on the images are random, or if the pixels are randomly generated. Regularisation, such as weight decay and dropout, doesn’t stop overfitting as much as expected, still resulting in ~90% accuracy on random training data. They then argue that these models likely make use of massive memorization, in combination with learning low-complexity patterns, in order to perform well on these tasks.

24. Reinforcement Learning with Unsupervised Auxiliary Tasks
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki Tom Schaul, Joel Z Leibo, David Silver & Koray Kavukcuoglu. DeepMind. ICLR 2017.

They describe a version of reinforcement learning where the system also learns to solve some auxiliary tasks, which helps with the main objective.


In addition to normal Q-learning, which predicts the downstream reward, they have the system learning 1) a separate policy for maximally changing the pixels on the screen, 2) maximally activating units in a hidden layer, and 3) predicting the reward at the next step, using biased sampling. They show that this improves learning speed and performance on Atari games and Labyrinth (a Quake-like 3D game).

25. Modelling metaphor with attribute-based semantics
Luana Bulat, Stephen Clark, Ekaterina Shutova. Cambridge. EACL 2017.

They propose using attribute-based vectors for detecting metaphorical word pairs.

Traditional embeddings (word2vec and count-based) are mapped to attribute vectors, using a supervised system trained on McRae norms. These vectors for a word pair are then given as input to an SVM classifier and trained to detect metaphorical (black humour) vs literal (black dress) word pairs. They show that using the attribute vectors gives higher F score over using the original vector space.

26. Enriching Word Vectors with Subword Information
Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov. Facebook. ArXiv 2016.

They extend skip-grams for word embeddings to use character n-grams. Each word is represented as a bag of character n-grams, 3-6 characters long, plus the word itself. Each of these has their own embedding which gets optimised to predict the surrounding context words using skip-gram optimisation. They evaluate on word similarity and analogy tasks, in different languages, and show improvement on most benchmarks.

27. Learning to Compose Words into Sentences with Reinforcement Learning
Dani Yogatama, Phil Blunsom, Chris Dyer, Edward Grefenstette, Wang Ling. DeepMind. ICLR 2017.

The aim is to have the system discover a method for parsing that would benefit a downstream task.


They construct a neural shift-reduce parser – as it’s moving through the sentence, it can either shift the word to the stack or reduce two words on top of the stack by combining them. A Tree-LSTM is used for composing the nodes recursively. The whole system is trained using reinforcement learning, based on an objective function of the downstream task. The model learns parse rules that are beneficial for that specific task, either without any prior knowledge of parsing or by initially training it to act as a regular parser.

28. Identifying beneficial task relations for multi-task learning in deep neural networks
Joachim Bingel, Anders Søgaard. Copenhagen. EACL 2017.

The authors investigate the benefit of different task combinations when performing multi-task learning.


They experiment with all possible pairs of 10 sequence labeling datasets, switching between the datasets during training. They find that multi-task learning helps more when the main task quickly plateaus while the auxiliary task does not, likely helping the model out of local minima.
There does not seem to be any auxiliary task that would help on all main tasks, but chunking and semantic tagging seem to perform best.

29. Literal and Metaphorical Senses in Compositional Distributional Semantic Models
E. Darío Gutiérrez, Ekaterina Shutova, Tyler Marghetis, Benjamin K. Bergen. UCSD, Cambridge, Bloomington. ACL 2016.

The paper investigates compositional semantic models specialised for metaphors.


They construct a dataset of 8592 adjective-noun phrases, covering 23 different adjectives, annotated for being metaphorical or literal. They then train compositional models to predict the phrase vector based on the noun vector, as a linear combination with an adjective-specific weight matrix. They show that it’s better to learn separate adjective matrices for literal and metaphorical uses of each adjective, even though the amount of training data is smaller.

30. Data Noising as Smoothing in Neural Network Language Models
Ziang Xie, Sida I. Wang, Jiwei Li, Daniel Levy, Aiming Nie, Daniel Jurafsky, Andrew Y. Ng. Stanford. ICLR 2017.

The paper investigates better noising techniques for RNN language models.


A noising technique from previous work would be to randomly replace words in the context or replace them with a blank token. Here they investigate ways of choosing better which words to replace and choosing the replacements from a better distribution, inspired by methods in n-gram smoothing. They show improvement on language modeling (PTB and text8) and machine translation (English-German).

31. Neural Belief Tracker: Data-Driven Dialogue State Tracking
Nikola Mrkšić, Diarmuid Ó Séaghdha, Tsung-Hsien Wen, Blaise Thomson, Steve Young. Cambridge, Apple. ACL 2017.

They propose neural models for dialogue state tracking, making a binary decision for each possible slot-value pair, based on the latest context from the user and the system. The context utterances and the slot-value option are encoded into vectors, either by summing word representations or using a convnet. These vectors are then further combined to produce a binary output. The systems are evaluated on two dialogue datasets and show improvement over baselines that use hand-constructed lexicons.


32. Neural Architectures for Fine-grained Entity Type Classification
Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, Sebastian Riedel. Tohoku, UCL. EACL 2017.

They propose a neural architecture for assigning fine-grained labels to detected entity types. The model combines bidirectional LSTMs, attention over the context sequence, hand-engineered features, and the label hierarchy. They evaluate on Figer and OntoNotes datasets, showing improvements from each of the extensions.


33. Recurrent Additive Networks
Kenton Lee, Omer Levy, Luke Zettlemoyer. Washington, Allen Institute. ArXiv 2017.

The authors propose a simplified version of LSTMs. Some non-linearities and weighted components are removed, in order to arrive at the recurrent additive network (RAN). The model is evaluated on 3 language modeling datasets: PTB, the billion word benchmark, and character-level Text8.

34. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
Ye Zhang, Byron Wallace. UT Austin. IJCNLP 2017.

The authors perform a hyperparameter search for a single-layer CNN on 9 different sentence classification datasets.
They find that the optimal embedding initialisation, filter size and number of feature maps depends on the dataset and should be chosen through a search; ReLU and tanh are the best activation functions; 1-max pooling is the pooling method; dropout may help when the number of feature maps gets large.


35. On Using Monolingual Corpora in Neural Machine Translation
Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, Yoshua Bengio. Montreal, METech, Maine. Computer Speech and Language 2016.

The authors extend a seq2seq model for MT with a language model. They first pre-train a seq2seq model and a neural language model, then train a separate feedforward component that takes the hidden states from both and combines them together to make a prediction. They compare to simply combining the output probabilities from both models (shallow fusion) and show improvement on different MT datasets.


36. Semi-supervised sequence tagging with bidirectional language models
Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, Russell Power. Allen Institute. ACL 2017.

The paper proposes integrating a pre-trained language model into a sequence labeling model. The baseline model for sequence labeling is a two-layer LSTM/GRU. They concatenate the hidden states from pre-trained language models onto the output of the first LSTM layer. This provides an improvement on NER and chunking tasks.

Semi-supervised sequence tagging with bidirectional language models

37. Weakly Supervised Part-of-speech Tagging Using Eye-tracking Data
Maria Barrett, Joachim Bingel, Frank Keller, Anders Søgaard. Copenhagen. ACL 2016.

The paper explores the usefulness of eye tracking for the task of POS tagging. The assumption is that readers skip quickly over closed class words, and fixate longer on rare on ambiguous words.

The experiments are performed on unsupervised POS tagging – a second-order HMM uses constraints on possible tags for each word (based on a dictionary), but no explicit annotated data is required. They show that including the eye tracking features improves performance by quite a bit. Surprisingly, it seems to be better to average eye tracking features over all training tokens of the same type, as opposed to using using the data for each individual token, which means eye tracking is only used during the training stage.

38. Massive Exploration of Neural Machine Translation Architectures
Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le. Google Brain. EMNLP 2017.

Investigates different parameter choices for encoder-decoder NMT models. They find that LSTM is better than GRU, 2 bidirectional layers is enough, additive attention is the best, and a well-tuned beam search is important. They achieve good results on the WMT15 English->German task and release the code.

Massive Exploration of Neural Machine Translation Architectures

39. Learning to Reason: End-to-End Module Networks for Visual Question Answering
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko. Berkeley, Facebook, Boston. ICCV 2017.

A modular neural architecture for visual question answering. A seq2seq component predicts the sequence of neural modules (eg find() and compare()) based on the textual question, which are then dynamically combined and trained end-to-end. Achieves good results on three separate benchmarks that focus on reasoning about the image.

Learning to Reason: End-to-End Module Networks for Visual Question Answering

40. Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction
Christopher Bryant, Mariano Felice, Ted Briscoe. Cambridge. ACL 2017.

A toolkit for automatically annotating error correction data with error types. It takes original and corrected sentences as input, aligns them to infer error spans, and uses rules to assign error types. They use the tool to perform fine-grained evaluation of CoNLL-14 shared task participants.

41. Dynamic Evaluation of Neural Sequence Models
Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals. Edinburgh. ArXiv 2017.

Updating the parameters in a LSTM language model based on the observed sequence during testing. A slice of text is first processed and then used for a gradient descent update step. A regularisation term is also proposed which draws the parameters back towards the original model.

Dynamic Evaluation of Neural Sequence Models

42. Unsupervised Machine Translation Using Monolingual Corpora Only
Guillaume Lample, Ludovic Denoyer, Marc’Aurelio Ranzato. Facebook, Sorbonne. ArXiv 2017.

The model learns to translate using a seq2seq model, an autoencoder objective, and an adversarial objective for language identification.
The system is trained to correct noisy versions of its own output and iteratively improves performance.
Does not require parallel corpora, but relies on a separate method for inducing a parallel dictionary that bootstraps the translation.

Unsupervised Machine Translation Using Monolingual Corpora Only

43. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
Tal Linzen, Emmanuel Dupoux, Yoav Goldberg. ENS, Bar Ilan. TACL 2017.

Investigation of how well LSTMs capture long-distance dependencies. The task is to predict verb agreement (singular or plural) when the subject noun is separated by different numbers of distractors. They find that an LSTM trained explicitly for this task manages to handle even most of the difficult cases, but a regular language model is more prone to being misled by the distractors.

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

44. Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis
Stefanos Angelidis, Mirella Lapata. Edinburgh. ArXiv 2017.

A model for document sentiment classification which can also return sentence-level sentiment predictions. They construct sentence-level representations using a convnet, use this to predict a sentence-level probability distribution over possible sentiment labels, and then combine these over all sentences either with a fixed weight vector or using an attention mechanism. They release a new dataset of 200 documents annotated on the level of sentences and discourse units.

Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis

45. Learning how to Active Learn: A Deep Reinforcement Learning Approach
Meng Fang, Yuan Li, Trevor Cohn. Melbourne. EMNLP 2017.

Active learning (choosing which examples to annotate for training) is proposed as a reinforcement learning problem. The Q-learning network predicts for each sentence whether it should be annotated, and is trained based on the performance improvement from the main task. Evaluation is done on NER, with experiments on transferring the trained Q-learning function to other languages.

Learning how to Active Learn: A Deep Reinforcement Learning Approach

46. On the State of the Art of Evaluation in Neural Language Models
Gábor Melis, Chris Dyer, Phil Blunsom. Deepmind, Oxford. ArXiv 2017.

Comparison of three recurrent architectures for language modelling: LSTMs, Recurrent Highway Networks and the NAS architecture. Each model goes through a substantial hyperparameter search, under the constraint that the total number of parameters is kept constant. They conclude that basic LSTMs still outperform other architectures and achieve state-of-the-art perplexities on two datasets.

47. Dynamic Routing Between Capsules
Sara Sabour, Nicholas Frosst, Geoffrey E Hinton. Google Brain. NIPS 2017.

An attention-based architecture for combining information from different convolutional layers. The attention values are calculated using an iterative process, making use of a custom squashing function. The evaluations on MNIST show robustness to affine transformations.

48. Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
Barbara Plank, Anders Søgaard, Yoav Goldberg. Groningen, Copenhagen, Bar-Ilan. ACL 2016.

Doing POS tagging using a bidirectional LSTM with word- and character-based embeddings. They add an extra component to the loss function – predicting a frequency class for each word, together with their POS tag. Results show that overall performance remains similar, but there’s an improvement in tagging accuracy for low-frequency words.


49. Emergent Translation in Multi-Agent Communication
Jason Lee, Kyunghyun Cho, Jason Weston, Douwe Kiela. Facebook. ArXiv 2017.

Learning to translate using two monolingual image captioning datasets and pivoting through images. The model encodes an image and generates a caption in language A, this is then encoded into the same space as language B and the representation is optimised to be similar to the correct image. The model is trained end-to-end using Gumbel-softmax.

Emergent Translation in Multi-Agent Communication

50. Efficient softmax approximation for GPUs
Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou. Facebook. ICML 2017.

Modification of the 2-level hierarchical softmax for better efficiency. An equation of computational complexity is used to find the optimal number of words in each class. In addition, the most common words are considered on the same level as other classes.

Efficient softmax approximation for GPUs

51. Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei. Cambridge. ACL 2017.

Incorporating an unsupervised language modeling objective to help train a bidirectional LSTM for sequence labeling. At the same time as training the tagger, the forward-facing LSTM is optimised to predict the next word and the backward-facing LSTM is optimised to predict the previous word. The model learns a better composition function and improves performance on NER, error detection, chunking and POS-tagging, without using additional data.

Semi-supervised Multitask Learning for Sequence Labeling

52. Grasping the Finer Point: A Supervised Similarity Network for Metaphor Detection
Marek Rei, Luana Bulat, Douwe Kiela, Ekaterina Shutova. Cambridge, Facebook. EMNLP 2017.

A specialised architecture for detecting metaphorical phrases. Uses a gating mechanism to condition one word based on the other, a neural version of weighted cosine similarity to make a prediction and hinge loss to optimise the model. Achieves high results on detecting metaphorical adjective-noun, verb-object and verb-subject phrases.

Grasping the Finer Point: A Supervised Similarity Network for Metaphor Detection

53. Neural Sequence-Labelling Models for Grammatical Error Correction
Helen Yannakoudakis, Marek Rei, Øistein E. Andersen, Zheng Yuan. Cambridge. EMNLP 2017.

Using error detection to improve error correction. A neural sequence labeling model is used to find correctness probabilities for every token, which are then used to rerank possible correction candidates. The process consistently improves the performance of different correction systems.

Neural Sequence-Labelling Models for Grammatical Error Correction

54. Artificial Error Generation with Machine Translation and Syntactic Patterns
Marek Rei, Mariano Felice, Zheng Yuan, Ted Briscoe. Cambridge. BEA 2017.

Investigating methods for generating artificial data in order to train better systems for detecting grammatical errors. The first approach uses regular machine translation, essentially translating from correct English to incorrect English. The second method uses local patterns with slots and POS tags to insert errors into new text.

Artificial Error Generation with Machine Translation and Syntactic Patterns

55. Auxiliary Objectives for Neural Error Detection Models
Marek Rei, Helen Yannakoudakis. Cambridge. BEA 2017.

Investigating a range of auxiliary objectives for training a sequence labeling system for error detection. Automatically generated dependency relations and POS tags perform surprisingly well as gold labels for multi-task learning. Learning different objectives at the same time works better than doing them in sequence or switching.

Auxiliary Objectives for Neural Error Detection Models

56. An Error-Oriented Approach to Word Embedding Pre-Training
Youmna Farag, Marek Rei, Ted Briscoe. Cambridge. BEA 2017.

Introduces a process for pre-training word embeddings with an objective that optimises them to distinguish between grammatical and ungrammatical sequences. This is then extended to also distinguish between correct and incorrect versions of the same sentence. The embeddings are then used in a network for essay scoring, improving performance compared to previous methods.

An Error-Oriented Approach to Word Embedding Pre-Training

57. Detecting Off-topic Responses to Visual Prompts
Marek Rei. Cambridge. BEA 2017.

A neural architecture for detecting off-topic written responses, with respect to visual prompts. The text is composed with an LSTM and then used to condition the image representation. The two representations are then compared to calculate a confidence score for the text being written in response to the prompt image.

Detecting Off-topic Responses to Visual Prompts

ML/NLP Publications in 2017

It has been a very productive year for NLP and ML research. Both areas continued to grow, with conferences reaching record numbers of publications. In this post I will break these numbers down a bit more, by individual authors and organisations. The statistics cover the following venues: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, *Sem+SemEval, NIPS, ICML, ICLR. Compared to last year, I’ve now included ICLR which has grown very rapidly in the last two years and become a highly competitive conference.

The analysis is done automatically, by crawling publication information from the conference websites and ACL Anthology. Author names are usually listed in the proceedings and easily extractable, however the organisation names are more tricky and need to be extracted straight from the PDFs. I’ve created a number of rules to map together alternative names and misspellings, but let me know if you notice any errors.


First, let’s look at different publication venues between 2012-2017. NIPS is clearly heading off the charts, with 677 publications this year. Most other venues are also growing rapidly, with 2017 being the biggest year ever for ICML, ICLR, EMNLP, EACL and CoNLL. In contrast, TACL and CL seem to be keeping a constant number of publications per year. NAACL and COLING were notably missing from 2017, but we can look forward to both of them in 2018.


The most prolific author of 2017 is Iryna Gurevych (TU Darmstadt) with 18 papers. Lawrence Carin (Duke University) has 16 publications, with an impressive 10 papers at NIPS. Following them closely are Yue Zhang (Singapore), Yoshua Bengio (Montreal), and Hinrich Schütze (Munich).

Looking at cumulative statistics from 2012-2017, Chris Dyer (DeepMind) is at the top with an impressive lead, followed by Iryna Gurevych (TU Darmstadt) and Noah A. Smith (Washington). Lawrence Carin (Duke), Zoubin Ghahramani (Cambridge) and Pradeep K. Ravikumar (CMU) are publishing mainly in the general ML venues, while the others are balanced between NLP and ML.

Separating the publications by year shows that Chris Dyer has scaled down the publication count to a more manageable level this year, and Iryna Gurevych is developing an impressive upward trajectory.

First Authors

Now let’s look at first authors, as these are usually the people implementing the code and running the experiments. Ivan Vulić (Cambridge), Ryan Cotterell (Johns Hopkins) and Zeyuan Allen-Zhu (Microsoft Research) have all produced an impressive 6 first-author publications in 2017. They are followed by Henning Wachsmuth (Weimar), Tsendsuren Munkhdalai (Microsoft Maluuba), Jiwei Li (Stanford) and Simon S. Du (CMU).


Looking at the publishing patterns of different organisations in 2017, Carnegie Mellon is leading the charge with 126 publications, followed by Google, Microsoft and Stanford. Universities that publish proportionally more in the general ML area compared to NLP include MIT, Columbia, Oxford, Harvard, Toronto, Princeton and Zürich. In contrast, universities and organisations that focus more on the NLP venues include Edinburgh, IBM, Peking, Washington, Johns Hopkins, Pennsylvania, CAS, Darmstadt, and Qatar.

Looking at the whole period between 2012-2017, CMU is again in the lead with Microsoft, Google and Stanford close behind.

Looking at the time series, it seems CMU, Stanford, MIT and Berkeley are on the upward trajectory in terms of publications. In contrast, the industry leaders Google, Microsoft and IBM have slightly scaled back their publication numbers.

Topic Clustering

Finally, I did LDA on all the paper texts from authors that had 9 or more publications and visualised the results using tsne. In the middle is the topic of general machine learning, neural networks and adversarial learning. The top cluster covers reinforcement learning and different learning policies. The cluster on the left contains NLP applications, language modelling, parsing and machine translation. The cluster at the bottom covers information modelling and feature spaces.


That’s it for 2017. If you notice errors, let me know and I will continue updating this post.
Looking forward to all the exciting research coming in 2018!

Attending to characters in neural sequence labeling models

Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar vectors means these words also behave similarly in the model, which is what we want for good generalisation properties.

However, word embeddings have a couple of weaknesses:

  1. If a word doesn’t exist in the training data, we can’t have an embedding for it. Therefore, the best we can do is clump all unseen words together under a single OOV (out-of-vocabulary) token.
  2. If a word only occurs a couple of times, the word embedding likely has very poor quality. We simply don’t have enough information to learn how these words behave in different contexts.
  3. We can’t properly take advantage of character-level patterns. For example, there is no way to learn that all words ending with -ing are likely to be verbs. The best we can do is learn this for each word separately, but that doesn’t help when faced with new or rare words.

In this post I will look at different ways of extending word embeddings with character-level information, in the context of neural sequence labeling models.  You can find more information in the Coling 2016 paper “Attending to characters in neural sequence labeling models“.

Sequence labeling

We’ll investigate word representations in order to improve on the task of sequence labeling. In a sequence labeling setting, a system gets a series of tokens as input and it needs to assign a label to every token. The correct label typically depends on both the context and the token itself. Quite a large number of NLP tasks can be formulated as sequence labeling, for example:

DT  NN    VBD      NNS    IN      DT   DT  NN     CC  DT  NN   .
The pound extended losses against both the dollar and the euro .

Error detection
+ +    +  x       +   +      +   +    +    x      +
I like to playing the guitar and sing very louder .

Named entity recognition
PER _      _   _      _  ORG  ORG   _  TIME _
Jim bought 300 shares of Acme Corp. in 2006 .

Service on   the  line is   expected to   resume by   noon today .

In each of these cases, the model needs to understand how a word is being used in a specific context, and could also take advantage of character-level patterns and morphology.

Basic neural sequence labeling

Our baseline model for sequence labeling is as follows. Each word is represented as a 300-dimensional word embedding. This is passed through a bidirectional LSTM with hidden layers of size 200. The representations from both directions are concatenated, in order to get a word representation that is conditioned on the whole sentence. Next, we pass it through a 50-dimensional hidden layer and then an output layer, which can be a softmax or a CRF.

This configuration is based on a combination of my previous work on error detection (Rei and Yannakoudakis, 2016), and the models from Irsoy and Cardie (2014) and Lample et al. (2016).

Concatenating character-based word representations

Now let’s look at an architecture that builds word representations from individual characters. We process each word separately and map characters to character embeddings. Next, these are passed through a bidirectional LSTM and the last states from either direction are concatenated. The resulting vector is passed through another feedforward layer, in order to map it to a suitable space and change the vector size as needed. We then have a word representation m, built from individual characters.


We still have a normal word embedding x for each word, and in order to get the best of both worlds, we can combine these two representations. Following Lample et al. (2016), one method is simply concatenating the character-based representation with the word embedding.

\widetilde{x} = [x; m]

The resulting vector can then be used in the word-level sequence labeling model, instead of the regular word embedding. The whole network is connected together, so that the character-based component is also optimised during training.

Attending to character-based representations

Concatenating the two representations works, but we can do even better. We start off the same – character embeddings are passed through a bidirectional LSTM to build a word representation m. However, instead of concatenating this vector with the word embedding, we now combine them using dynamically predicted weights.


A vector of weights z is predicted by the model, as a function of x and m. In this case, we use a two-layer feedforward component, with tanh activation in the first layer and sigmoid on the second layer.

z = \sigma(W^{(3)}_z tanh(W^{(1)}_{z} x + W^{(2)}_{z} m))

Then, we combine x and m as a weighted sum, using z as the weights:

\widetilde{x} = z\cdot x + (1-z) \cdot m

This operation essentially looks at both word representations and decides, for each feature, whether it wants to take the value from the word embedding or from the character-based representation. Values close to 1 in z indicate higher weight for the word embedding, and values close to 0 assign more importance to the character-based vector.

This combination requires that the two vectors are aligned – each feature position in the character-based representation needs to capture the same properties as that position in the word embedding. In order to encourage this property, we actively optimise for these vectors to be similar, by maximising their cosine similarity:

\widetilde{E} = E + \sum_{t=1}^{T} g_t (1 – cos(m^{(t)}, x_t)) \hspace{3em}
g_t =
0, & \text{if}\ w_t = OOV \\
1, & \text{otherwise}

E is the main sequence labeling loss function that we minimise during training, T is the length of the sequence or sentence. Many OOV words share the same representation, and we do not want to optimise for this, therefore we use a variable g that limits optimisation only to non-OOV words.

In this setting, the model essentially learns two alternative representations for each word – a regular word embedding and a character-based representation. The word embedding itself is kind of a universal memory – we assign 300 elements to a word, and the model is free to save any information into it, including approximations of character-level information. For frequent words, there is really no reason to think that the character-based representation can offer much additional benefit. However, using word embeddings to save information is very inefficient – each feature needs to be learned and saved for every word separately. Therefore, we hope to get two benefits from including characters into the model:

  1. Previously unseen (OOV) words and infrequent words with low-quality embeddings can get extra information from character features and morphemes.
  2. The character-based component can act as a highly-generalised model of typical character-level patterns, allowing the word embeddings to act as a memory for storing exceptions to these patterns for each specific word.

While we optimise for the cosine similarity of m and x to be high, we are essentially teaching the model to predict distributional properties based only on character-level patterns and morphology. However, while m is optimised to be similar to x, we implement it in such a way that x is not optimised to be similar to m (using disconnected_grad in Theano). Because word embeddings are more flexible, we want them to store exceptions as opposed to learning more general patterns.

The resulting combined word representation is again plugged directly into the sequence labeling model. All the components, including the attention component for dynamically calculating z, are optimised at the same time.


We evaluated the alternative architectures on 8 different datasets, covering 4 different tasks: NER, POS-tagging, error detection and chunking. See the paper for more detailed results, but here is a summary:

Dataset Task #labels Measure Word-based Char concat Char attn
CoNLL00 chunking 22 F1 91.23 92.35 92.67
CoNLL03 NER 8 F1 79.86 83.37 84.09
PTB-POS POS-tagging 48 accuracy 96.42 97.22 97.27
FCEPUBLIC error detection 2 F0.5 41.24 41.27 41.88
BC2GM NER 3 F1* 84.21 87.75 87.99
CHEMDNER NER 3 F1 79.74 83.56 84.53
JNLPBA NER 11 F1 70.75 72.24 72.70
GENIA-POS POS-tagging 42 accuracy 97.39 98.49 98.60

As can be seen, including a character-based component into the sequence labeling model helps on every task. In addition, using the attention-based architecture outperforms character concatenation on all benchmarks.

We also compared the number of parameters in each model. While both character-based models require more parameters compared to a basic architecture using only word embeddings, the attention-based architecture is actually more efficient compared to concatenation. If the vectors are simply concatenated, this increases the size of all the weight matrices in the proceeding LSTMs, whereas the attention framework combines them without increasing length.

Miyamoto and Cho (2016) have independently also proposed a similar architecture, with some differences: 1) They focus on the task of language modeling, 2) they predict a scalar weight for combining the representations as opposed to making the decision separately for each element, 3) they do not condition the weights on the character-based representation, and 4) they do not have the component that optimises character-based representations to be similar to the word embeddings.

Since this model is aimed at learning morphological patterns, you might ask why are we not giving actual morphemes as input to the model, instead of starting from individual characters. The reason is that the definition of an informative morpheme is likely to change between tasks and datasets, and this allows the model to learn exactly what it finds most useful. The model for POS tagging can learn to detect specific suffixes, and the model for NER can focus more on capitalisation patterns.


Combining word embeddings with character-based representations makes neural models more powerful and allows us to have better representations for infrequent or unseen words. One option is to concatenate the two representations, treating them as separate sets of useful features. Alternatively, we can optimise them to be similar and combine them using a gating mechanism, essentially allowing the model to choose whether it wants to take each feature from the word embedding of from the character-based representation. We evaluated on 8 different sequence labeling datasets and found that the latter option performed consistently better, even with a fewer number of parameters.

See the paper for more details:

I have made the code for running these experiments publicly available on github:

Also, the dataset for performing error detection as a sequence labeling task is now available online:

NLP and ML Publications – Looking Back at 2016

After my last post on analysing publication patterns I received quite a lot of feedback and many feature requests, so I decided to create an update once 2016 is over. It is now quite a bit bigger than before, and includes 11 different conferences and journals: ACL, EACL, NAACL, EMNLP, COLING, CL, TACL, CoNLL, *Sem+SemEval, NIPS, and ICML.

The information used in these graphs was collected through crawling the web. ACL Anthology was very useful, listing papers in a consistent format. However, information such as the organisation names in each paper still needed to be extracted directly from the pdfs, which means there are likely to be some errors. I’ve tried to create exceptions to catch different spelling variations and other anomalies, but if you notice mistakes in the graphs, do let me know.

This analysis shouldn’t be taken too seriously – after all, quality of research matters much more than quantity, and that is considerably more difficult to measure. However, my motivation is to provide a high-level overview of what is happening in the field, where the big players are publishing, and perhaps supply a bit of inspiration and motivation for the new year.

Let’s start by looking at publications from 2016 for the 25 most active organisations:

Carnegie Mellon managed to beat Google by just 1 paper. Microsoft and Stanford also managed to publish more than 80 papers in 2016. IBM, Cambridge, Washington and MIT all reached the 50 publication barrier. Google, Stanford, MIT and Princeton are distinctly focused on the ML aspect, publishing mostly in NIPS and ICML. In fact, Google papers counted for nearly 10% of all NIPS papers. IBM, Peking, Edinburgh and Darmstadt however are distinctly focused on the NLP applications.

Next, let’s look at individual authors:

Chris Dyer continued his impressive publication record, by managing 24 papers in 2016. I’m curious as to why Chris doesn’t publish in NIPS or ICML, but he did have a paper in every single NLP conference (barring EACL, which didn’t take place in 2016). Following him are Yue Zhang (18), Hinrich Schütze (15), Timothy Baldwin (14), and Trevor Cohn (14). Ting Liu from the Harbin Institute of Technology stands out with 10 papers in COLING. Anders Søgaard and Yang Liu both managed to have 6 papers in ACL.

Now let’s look at the most prolific first authors from 2016:

Three researchers managed to publish 6 first-author papers: Ellie Pavlick (University of Pennsylvania), Gustavo Paetzold (University of Sheffield), Zeyuan Allen-Zhu (Princeton University and Institute for Advanced Study). Alan Akbik (IBM) published 5 papers, and 7 more researchers had 4 publications.
In addition, there were 42 people with 3 first-author papers, and 231 with 2 first-author papers.

Let’s also look at some time series. First, the total number of papers published at different conferences:

NIPS has been a very big conference for several years, but this year it seems to have exploded. Also, COLING was bigger than expected this year, even beating out ACL. This was the first year since 2012 when NAACL and COLING coincided.

Next, the number of papers per organisation in each year:

CMU is leading the race, having overtaken Microsoft in 2015. However, Google has almost caught up as well, after accelerating at a breakneck speed. Stanford also has a very respectable track record, followed by IBM and Cambridge.

Finally, let’s look at individual authors:

Chris Dyer gets a very distinct upward line on this graph. Other researchers who have managed to keep increasing their output over the past 5 years are: Preslav Nakov, Alessandro Moschitti, Yoshua Bengio, and Anders Søgaard.

Finally, I also decided to do some topic modeling on the publications. First, I extracted all the plain text from the papers, tokenised and lowercased it, and removed stopwords. Next, I passed it through LDA in order to discover 10 latent topics. I then used t-SNE to visualise top authors and organisations in a 2-dimensional graph, based on their latent topic similarity. Finally, I manually labelled each of the clusters with one word from the highest-ranked terms found by LDA. Here is the visualisation for top 50 authors:


I created the same graph for organisations, but didn’t attempt to label them with single words, since major universities tend to publish in many different subfields. I leave it to you to interpret the clusters:


That’s all for now. If you have any corrections or suggestions, let me know.
Have a great 2017 and I hope to see you in this list next year!

Edit: Fixed a bug with the autor-year graph.
Edit: Fixed a bug with mapping different forms of university names

Analysing NLP publication patterns

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Below is the graph of top 25 organisations and the conferences where they publish.

CMU comes out as the most prolific publisher with 305 papers. A close second is Microsoft with 302 publications, also leading in the industry category. I was somewhat surprised to find that Microsoft publishes so much, almost twice as many papers compared to Google, especially as Google seems to get much more publicity with their research. Stanford is also among the top 3 organisations that publish substantially more than others. Edinburgh and Cambridge represent the UK camp with 121 and 117 papers respectively.

When we look at the distribution of conferences, Princeton and UCL stand out as having very little NLP-specific research, with nearly all of their papers in ICML and NIPS. Stanford, Berkeley and MIT also seem to focus more on machine learning algorithms. In contrast, Edinburgh, Johns Hopkins and University of Maryland have most of their publications on NLP-related conferences. CMU, Microsoft and Columbia are the most balanced among the top publishers, with roughly 50:50 division between NLP and ML.

We can also plot the number of publications per year, focusing on the top 15 institutions.

Carnegie Mellon has a very good track record, but has only just recently overtaken Microsoft as the top publisher. Google, MIT, Berkeley, Cambridge and Princeton have also stepped up their publishing game, showing upward trends in the recent years. The sudden drop for 2016 is due to incomplete data – at the time of writing, ACL, EMNLP and NIPS papers for this year are not available yet.

Now let’s look at the same graphs but for individual authors.

Chris Dyer comes out on top with 50 papers. This result is even more impressive given that he started with just 2 papers in 2012, then rocketing to the top by quite a margin in 2015. Almost all of his papers are in NLP conferences, with only 1 paper each for NIPS and ICML. Noah Smith, Chris Manning and Dan Klein rank 2nd-4th, with more stable publishing records, but also focusing mainly on NLP conferences. In contrast, Zoubin Ghahramani, Yoshua Bengio and Lawrence Carin are focused mostly on machine learning algorithms.

There seems to be a clear separation between the two research communities, with researchers specialising to publishing either in NLP or ML. This seems somewhat unexpected, especially considering the widespread trend of publishing novel neural network architectures for NLP tasks. Both fields would probably benefit from slightly tighter integration in the future.

I hope this little analysis was interesting to fellow researchers. I’m happy to post an update some time in the future, to see how things have changed. In the meantime, let me know if you find any bugs in the statistics.

Update: As requested, I’ve also added the statistics for first authors with highest publication counts. Jiwei Li from Stanford towers above others with 14 publications. William Yang Wang (CMU), Young-Bum Kim (Microsoft), Manaal Faruqui (CMU), Elad Hazan (Princeton), and Eunho Yang (IBM) have all managed an impressive 9 first-author publications.

Update 2: Added a fix for Jordan Boyd-Graber who publishes under Jordan L. Boyd-Graber in NIPS.

Update 3: Added a fix for Hal Daumé III, mapping together different spellings.

Update 4: By showing top N authors on the graphs, some authors with equal numbers of publications were being excluded. I’ve adjusted the value N for each graph so this doesn’t happen.

Update 5: Added a fix for Pradeep K. Ravikumar who also publishes under Pradeep Ravikumar.

Update 6: Added fixes to capture name variations for INRIA.

Theano Tutorial

This is an introductory tutorial on using Theano, the Python library. I’m going to start from scratch and assume no previous knowledge of Theano. However, understanding how neural networks work will be useful when getting to the code examples towards the end.

The plan for the tutorial is as follows:

  1. Give a basic introduction to Theano and explain the important concepts.
  2. Go over the main operations that we have available in Theano.
  3. Look at working code examples.

I recently gave this tutorial as a talk in University of Cambridge and it turned out to be way more popular than expected. In order to give more people access to the material, I’m now writing it up as a blog post.

I do not claim to know everything about Theano, and I constantly learn new things myself. If you find any errors or have suggestions on how to improve this tutorial, do let me know.

The code examples can be found in the Github repository:

1. What is Theano?


Theano is a Python library for efficiently handling mathematical expressions involving multi-dimensional arrays (also known as tensors). It is a common choice for implementing neural network models. Theano has been developed in University of Montreal, in a group led by Yoshua Bengio, since 2008.

Some of the features include:

  • automatic differentiation – you only have to implement the forward (prediction) part of the model, and Theano will automatically figure out how to calculate the gradients at various points, allowing you to perform gradient descent for model training.
  • transparent use of a GPU – you can write the same code and run it either on CPU or GPU. More specifically, Theano will figure out which parts of the computation should be moved to the GPU.
  • speed and stability optimisations – Theano will internally reorganise and optimise your computations, in order to make them run faster and be more numerically stable. It will also try to compile some operations into C code, in order to speed up the computation.

Technically, Theano isn’t actually a machine learning library, as it doesn’t provide you with pre-built models that you can train on your dataset. Instead, it is a mathematical library that provides you with tools to build your own machine learning models. But if you are looking for machine learning toolkits, there are several good ones implemented on top of Theano:

2. Python refresher

Theano is a Python library, so let’s go over some important points in Python.

  • Python is an interpreted language, which makes it more platform independent but generally slower than C, for example.
  • Python uses dynamic typing. While each variable does have a specific type during execution, these are not explicitly stated in the code.
  • Python uses indentation for block delimiting. So where C or Java would use curly brackets to separate a block, Python uses whitespace. Here we define a function f to take parameter x and return 2*x:
    def f(x):
        return 2*x
  • We define a list in Python with square brackets:
    a = [1,2,3,4,5]
    a[1] == 2
  • We define a dictionary (key-value mapping) with curly brackets:
    b = {'key1': 1, 'key2':2}
    b['key2'] == 2
  • List comprehension is a neat shorthand in Python for constructing lists. Here we loop for 5 steps (values 0-4), and each time add i+1 to the list:
    c = [i+1 for i in range(5)]
    c[1] == 2

3. Using Theano

In order to use Theano, you will need to install the dependencies and install Theano itself. If you’re using Ubuntu (tested for 14.04), you might get away with just running these two commands:

sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git
sudo pip install Theano

If that doesn’t work for you, take a look at the original Theano homepage, which contains instructions for various platforms:

To use Theano in your Python script, include it using:

import theano

4. Minimal Working Example

Here is the smallest example I could come up with, which uses Theano and actually does something:

import theano
import numpy

x = theano.tensor.fvector('x')
W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')
y = (x * W).sum()

f = theano.function([x], y)

output = f([1.0, 1.0])
print output

So what’s happening here?

We first define a Theano variable x to be a vector of 32-bit floats, and give it name ‘x’:

x = theano.tensor.fvector('x')

Next, we create a Theano variable W, assign its value to be vector [0.2, 0.7], and name it ‘W’:

W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')

We define y to be the sum of all elements in the element-wise multiplication of x and W:

y = (x * W).sum()

We define a Theano function f, which takes as input x and outputs y:

f = theano.function([x], y)

Then call this function, giving as the argument vector [1.0, 1.0], essentially setting the value of variable x:

output = f([1.0, 1.0])

The script prints out the summed product of [0.2, 0.7] and [1.0, 1.0], which is:

0.2*1.0 + 0.7*1.0 = 0.9

Don’t worry if the code doesn’t fully make sense. We’ll go over the important parts in more detail.

5. Symbolic graphs in Theano (!)

I’d say this section contains the most crucial part to understanding Theano.

When we are creating a model with Theano, we first define a symbolic graph of all variables and operations that need to be performed. And then we can apply this graph on specific inputs to get outputs.

For example, what do you think happens when this line of Theano code is executed in our script?

y = (x * W).sum()

The system takes x and W, multiplies them together and sums the values. Right?


Instead, we create a Theano object y that knows its values can be calculated as the dot-product of x and W. But the required mathematical operations are not performed here. In fact, when this line was executed in our example code above, x didn’t even have a value yet.

By chaining up various operations, we are creating a graph of all the variables and functions that need to be used to reach the output values. This symbolic graph is also the reason why we can only use Theano-specific operations when defining our models. If we tried to integrate functions from some random Python library into our network, they would attempt to perform the calculations immediately, instead of returning a Theano variable as needed. Exceptions do exist – Theano overrides some basic Python operators to act as expected, and NumPy is quite well integrated with Theano.

6. Variables

We can define variables which don’t have any values yet. Normally, these would be used for inputs to our network.

The variables have to be of a specific type though. For example, here we define variable x to be a vector of 32-bit floats, and give it name ‘x’:

x = theano.tensor.fvector('x')

The names are generally useful for debugging and informative error messages. Theano won’t have access to your Python variable names, so you have to assign explicit Theano names for each variable if you want them to be referred to as something more useful than just “a tensor”.

There are a number of different variable types available, just have a look at the list here. Some of the more popular ones include:

Constructor  dtype  ndim
fvector float32 1
ivector int32 1
fscalar float32 0
fmatrix float32 2
ftensor3 float32 3
dtensor3 float64 3

You can also define a generic vector (or tensor) and set the type with an argument:

x = theano.tensor.vector('x', dtype=float32)

If you don’t set the dtype, you will create vectors of type config.floatX. This will become relevant in the section about GPUs.

7. Shared variables

We can also define shared variables, which are shared between different functions and different function calls. Normally, these would be used for weights in our neural network. Theano will automatically try to move shared variables to the GPU, provided one is available, in order to speed up computation.

Here we define a shared variable and set its value to [0.2, 0.7].

W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')

The values in shared variables can be accessed and modified outside of our Theano functions using these commands:

W.set_value([0.1, 0.9])

8. Functions

Theano functions are basically hooks for interacting with the symbolic graph. Commonly, we use them for passing input into our network and collecting the resulting output.

Here we define a Theano function f that takes x as input and returns y as output:

f = theano.function([x], y)

The first parameter is the list of input variables, and the second parameter is the list of output variables. Although if there’s only one output variable (like now) we don’t need to make it into a list.

When we construct a function, Theano takes over and performs some of its own magic. It builds the computational graph and optimises it as much as possible. It restructures mathematical operations to make them faster and more stable, compiles some parts to C, moves some tensors to the GPU, etc.

Theano compilation can be controlled by setting the value of mode in the environement variable THEANO_FLAGS:

  • FAST_COMPILE – Fast to compile, slow to run. Python implementations only, minimal graph optimisation.
  • FAST_RUN – Slow to compile, fast to run. C implementations where available, full range of optimisations

9. Minimal Training Example

Here’s a minimal script for actually training something in Theano. We will be training the weights in W using gradient descent, so that the result from the model would be 20 instead of the original 0.9.

import theano
import numpy

x = theano.tensor.fvector('x')
target = theano.tensor.fscalar('target')

W = theano.shared(numpy.asarray([0.2, 0.7]), 'W')
y = (x * W).sum()

cost = theano.tensor.sqr(target - y)
gradients = theano.tensor.grad(cost, [W])
W_updated = W - (0.1 * gradients[0])
updates = [(W, W_updated)]

f = theano.function([x, target], y, updates=updates)

for i in xrange(10):
    output = f([1.0, 1.0], 20.0)
    print output

We create a second input variable called target, which will act as the target value we use for training:

target = theano.tensor.fscalar('target')

In order to train the model, we need a cost function. Here we use a simple squared distance from the target:

cost = theano.tensor.sqr(target - y)

Next, we want to calculate the partial gradients for the parameters that will be updated, with respect to the cost function. Luckily, Theano will do that for us. We simply call the grad function, pass in the real-valued cost and a list of all the variables we want gradients for, and it will return a list of those gradients:

gradients = theano.tensor.grad(cost, [W])

Now let’s define a symbolic variable for what the updated version of the parameters will look like. Using gradient descent, the update rule is to subtract the gradient, multiplied by the learning rate:

W_updated = W - (0.1 * gradients[0])

And next we create a list of updates. More specifically, a list of tuples where the first element is the variable we want to update, and the second element is a variable containing the values that we want the first variable to contain after the update. This is just a syntax that Theano requires.

updates = [(W, W_updated)]

Have to define a Theano function again, with a couple of changes:

f = theano.function([x, target], y, updates=updates)

It now takes two input arguments – one for the input vector, and another for the target value used for training. And the list of updates also gets attached to the function as well. Every time this function is called, we pass in values for x and target, get back the value for y as output, and Theano performs all the updates in the update list.

In order to train the parameters, we repeatedly call this function (10 times in this case). Normally, we’d pass in different examples from our training data, but for this example we use the same x=[1.0, 1.0] and target=20 each time:

for i in xrange(10):
    output = f([1.0, 1.0], 20.0)
    print output

When the script is executed, the output looks like this:


The first time the function is called, the output value is still 0.9 (like in the previous example), because the updates have not been applied yet. But with each consecutive step, the output value becomes closer and closer to the desired target 20.

10. Useful operations

This covers the basic logic behind building models with Theano. The example was very simple, but we are free to define increasingly complicated networks, as long as we use Theano-specific functions. Now let’s look at some of these building blocks that we have available.

Evaluate the value of a Theano variable

The eval() function forces the Theano variable to calculate and return its actual (numerical) value. If we try to just print the variable a, we only print its name. But if we use eval(), we get the actual square matrix that it is initialised to.

> a = theano.shared(numpy.asarray([[1.0,2.0],[3.0,4.0]]), 'a')
> a
> a.eval()
array([[1., 2.],
       [3., 4.]])

This eval() function isn’t really used for building models, but it can be useful for debugging and learning how Theano works. In the examples below, I will be using the matrix a and the eval() function to print the value of each variable and demonstrate how different operations work.

Basic element-wise operations: + – * /

c = ((a + a) / 4.0)

array([[ 0.5, 1. ],
       [ 1.5, 2. ]])

Dot product

c =, a)

array([[ 7., 10.],
       [15., 22.]])

Activation functions

c = theano.tensor.nnet.sigmoid(a)
c = theano.tensor.tanh(a)

array([[ 0.76159416,  0.96402758],
       [ 0.99505475,  0.9993293 ]])

Softmax (row-wise)

c = theano.tensor.nnet.softmax(a)

array([[ 0.26894142,  0.73105858],
       [ 0.26894142,  0.73105858]])


c = a.sum()
c = a.sum(axis=1)

array([ 3.,  7.])


c = a.max()
c = a.max(axis=1)

array([ 2.,  4.])


c = theano.tensor.argmax(a)
c = theano.tensor.argmax(a, axis=1)

array([1, 1])


We sometimes need to change the dimensions of a tensor and reshape() allows us to do that. It takes as input a tuple containing the new shape and returns a new tensor with that shape. In the first example below, we shape a square matrix into a 1×4 matrix. In the second example, we use -1 which means “as big as the dimension needs to be”.

a = theano.shared(numpy.asarray([[1,2],[3,4]]), 'a')
c = a.reshape((1,4))
array([[1, 2, 3, 4]])

c = a.reshape((-1,))
array([1, 2, 3, 4])

Zeros-like, ones-like

These functions create new tensors with the same shape but all values set to zero or one.

c = theano.tensor.zeros_like(a)
array([[0, 0],
       [0, 0]])

Reorder the tensor dimensions

Sometimes we need to reorder the dimensions in a tensor. In the examples below, the dimensions in a two-dimensional matrix are first swapped. Then, ‘x’ is used to create a brand new dimension.

array([[1, 2],
       [3, 4]])

c = a.dimshuffle((1,0))
array([[1, 3],
       [2, 4]])

c = a.dimshuffle(('x',0,1))
array([[[1, 2],
        [3, 4]]])


Using Python indexing tricks can make life so much easier. In the example below, we make a separate list b containing line numbers, and use it to construct a new matrix which contains exactly the lines we want from the original matrix. This can be useful when dealing with word embeddings – we can put word ids into a list and use this to retrieve exactly the correct sequence of embeddings from the whole embedding matrix.

a = theano.shared(numpy.asarray([[1.0,2.0],[3.0,4.0]]), 'a')
array([[1., 2.],
       [3., 4.]])

b = [1,1,0]
c = a[b]
array([[ 3.,  4.],
       [ 3.,  4.],
       [ 1.,  2.]])

For assignment, we can’t do this:

a[0] = [0.0, 0.0]

But instead, we can use set_subtensor(), which takes as arguments the selection of the original matrix that we want to reassign, and the value we want to assign it to. It returns a new tensor that has the corresponding values modified.

c = theano.tensor.set_subtensor(a[0],[0.0, 0.0])
array([[ 0.,  0.],
       [ 3.,  4.]])

11. Classifier Code Example

At this point, it’s time to move on to some more realistic examples.

Take a look at the code for a very basic classifier, which tries to train a small network on a tiny (but real) dataset. I won’t walk you through it line-by-line any more; you’ve learned all the necessary parts by now and there are comments in the code as well.

The task is to predict whether the GDP per capita for a country is more than the average GDP, based on the following features:

  • Population density (per suqare km)
  • Population growth rate (%)
  • Urban population (%)
  • Life expectancy at birth (years)
  • Fertility rate (births per woman)
  • Infant mortality (deaths per 1000 births)
  • Enrolment in tertiary education (%)
  • Unemployment (%)
  • Estimated control of corruption (score)
  • Estimated government effectiveness (score)
  • Internet users (per 100 people)

The data/ directory contains the files for training (121 countries) and testing (40 countries). Each row represents one country, the first column is the label, followed by the features. The feature values have been normalised, by subtracting the mean and dividing by the standard deviation. The label is 1 if the GDP is more than average, and 0 otherwise.

Once you clone the github repository (or just download the data files), you can run the script with:

python data/countries-classify-gdp-normalised.train.txt data/countries-classify-gdp-normalised.test.txt

The script will print information about 10 training epochs and the result on the test set:

Epoch: 0, Training_cost: 28.4304042768, Training_accuracy: 0.578512396694
Epoch: 1, Training_cost: 24.5186290354, Training_accuracy: 0.619834710744
Epoch: 2, Training_cost: 22.1283727037, Training_accuracy: 0.619834710744
Epoch: 3, Training_cost: 20.7941253329, Training_accuracy: 0.619834710744
Epoch: 4, Training_cost: 19.9641569475, Training_accuracy: 0.619834710744
Epoch: 5, Training_cost: 19.3749411377, Training_accuracy: 0.619834710744
Epoch: 6, Training_cost: 18.8899216914, Training_accuracy: 0.619834710744
Epoch: 7, Training_cost: 18.4006371608, Training_accuracy: 0.677685950413
Epoch: 8, Training_cost: 17.7210185975, Training_accuracy: 0.793388429752
Epoch: 9, Training_cost: 16.315597037, Training_accuracy: 0.876033057851
Test_cost: 5.01800578051, Test_accuracy: 0.925

12. Recurrent functions with scan

One more important operation to cover is scan, which can be used to create various recurrent functions: RNN, GRU, LSTM, etc.

Here is sample code for using scan to define a simple RNN over word vectors in the input_vectors matrix:

def rnn_step(x, previous_hidden_vector, W_input, W_recurrent):
    hidden_vector =, W_input) + 
          , W_recurrent)
    hidden_vector = theano.tensor.nnet.sigmoid(hidden_vector)

W_input = self.create_parameter_matrix('W_input', (word_embedding_size, recurrent_size))
W_recurrent = self.create_parameter_matrix('W_recurrent', (recurrent_size, recurrent_size))
initial_hidden_vector = theano.tensor.alloc(numpy.array(0, dtype=floatX), recurrent_size)

hidden_vector, _ = theano.scan(
    sequences = input_vectors,
    outputs_info = initial_hidden_vector,
    non_sequences = [W_input, W_recurrent]

hidden_vector = hidden_vector[-1]

The scan function is called on line 10 and it takes 4 important arguments:

  • fn: The function that is called at every step of the iteration.
  • sequences: The variables that we want to iterate over. If this is a matrix, we will be iterating over each row of that matrix.
  • outputs_info: The values that we use as the previous recurrent values for the very first step. Usually these are just set to 0.
  • non_sequences: Any additional variables that we want to pass into the function (fn) but don’t want to iterate over.

We’ve defined the helper function rnn_step on line 1, which gets called on each row of our input matrix. The scan function will be calling this rnn_step function internally, so we need to accept any arguments in the same order as Theano passes them. This is just something you need to know when dealing with scan. The order is as follows:

  1. First, the current items from the variables that we are iterating over. If we are iterating over a matrix, the current row is passed to the function.
  2. Next, anything that was output from the function at the previous time step. This is what we use to build recursive and recurrent representations. At the very first time step, the values will be those from outputs_info instead.
  3. Finally, anything we specified in non_sequences.

What comes out from the scan function contains the hidden states (eg the rnn_step outputs) at each step. Not just the last step, but all of them. So if you only want the last step, you need to explicitly retrieve it by indexing from -1 (the last element). Theano is actually smart enough to figure out that you’re only using the last result, and will optimise to discard all the intermediate ones.

In order to construct the weight matrices, I’m using a helper function (self.create_parameter_matrix, definition not shown here) which takes as input the variable name and the shape. This means I don’t need to define the weight initialisation part again each time.

13. RNN Classifier Code Example

Time to look at some more code, this time using recurrent functions and scan. The script is available at the Github repository. In this example, I’m using Gated Recurrent Units (GRU) from “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation” (Cho et al, 2014), which are essentially a simpler versions of LSTMs.

The task is to classify sentences into 5 classes, based on their fine-grained sentiment (very negative, slightly negative, neutral, slightly positive, very positive). We use the dataset published in “Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank” (Socher et al., 2013).

Start by downloading the dataset from (the main zip file) and unpack it somewhere. Then, create training and test splits in the format that is more suitable for us, using the provided script in the repository:

python 1 full /path/to/sentiment/dataset/ > data/sentiment.train.txt
python 2 full /path/to/sentiment/dataset/ > data/sentiment.test.txt

Now we can run the classifier with:

python data/sentiment.train.txt data/sentiment.test.txt

The script will train for 3 passes over the training data, and will then print performance on the test data.

Epoch: 0 Cost: 25937.7372292 Accuracy: 0.285814606742
Epoch: 1 Cost: 21656.820174 Accuracy: 0.350655430712
Epoch: 2 Cost: 18020.619533 Accuracy: 0.429073033708
Test_cost: 4784.25137484 Test_accuracy: 0.388235294118

The accuracy on the test set is about 38%, which isn’t a great result. But it is quite a difficult task – the current state-of-the-art system (Tai ei al., 2015) achieves 50.9% accuracy, using a large amount of additional phrase-level annotations, and a much bigger network based on LSTMs and parse trees. As there are 5 classes to choose from, a random system would get 20% accuracy.

14. Running on a GPU

Theano is smart enough to move some parts of the processing to the GPU, as long as CUDA is installed and a graphics card is made available. To install CUDA, follow instructions on one of these links:

Then, when running your Python script, you need to point Theano to the CUDA installation. I do this by setting the environment variables in the command line:

LD_LIBRARY_PATH=/usr/lib:/usr/local/cuda-7.5/lib64 THEANO_FLAGS='cuda.root=/usr/local/cuda-7.5,device=gpu,floatX=float32' python

This command is for CUDA-7.5 in my system. You’ll need to make sure that the paths match the CUDA installation paths in your machine. If it works and Theano is using a GPU, the first line that gets printed will explicitly say so. Something like this:

Using gpu device 0: GeForce GTX 780

If you don’t get something similar, it probably means Theano is not properly hooked up to use the GPU.

At the time of writing, Theano only supports 32-bit variables on the GPU, and this is where the floatX=float32 setting comes it. It just allows you to set the data type during the execution of the script, without writing it into your code. For example, you can define your vectors like this:

x = theano.tensor.vector('x', dtype=config.floatX)

And now you can set floatX to be float32 when running the script on a GPU and float64 when running on your CPU.

Finally, if your machine has multiple GPUs, you can control which one is used for the script by setting device=gpu0, device=gpu1, etc. Based on personal experience, running multiple Theano jobs on the same GPU does not give any advantage, so it’s best to send them to different ones when possible.

15. Drawing the computation graph

Theano provides a command for printing a variable or a function, along with all the required computation, as an image:

f = theano.function([x], y)
theano.printing.pydotprint(f, outfile="f.png", var_with_name_simple=True)

When dealing with very simple models, this can give a nice graphical representation. For example, here is a model from our minimal working example:

Printed Theano function. Figure for the Theano tutorial.

However, when the models get more and more complicated, the images also tend to get less informative:

Printed graph of a much larger function. Figure for the Theano tutorial.


16. Profiling

Finally, Theano also provides a useful tool for analysing bottlenecks in your code. Just set profile=True in THEANO_FLAGS, and it will print information about how much time is spent on different operations in your code.

THEANO_FLAGS='profile=True' python

Example of profiling output from Theano. Figure for the Theano tutorial.

17. References

This concludes the Theano tutorial. If you haven’t yet had enough, take a look at the following links that I used for inspiration:
Official Theano homepage and documentation
Official Theano tutorial
A Simple Tutorial on Theano by Jiang Guo
Code samples for learning Theano by Alec Radford


Online Representation Learning in Recurrent Neural Language Models

In a basic neural language model, we optimise a fixed set of parameters based on a training corpus, and predictions on an unseen test set are a direct function of these parameters. What if instead of a static model we constantly measured the types of errors the model is making and adjust the parameters accordingly? It would potentially be more closer to how humans operate, constantly making small adjustments in their decisions based on feedback.

The necessary information is already available – language models use the previous word in the sequence as context, which means they know the correct answer for the previous time step (or at least need to assume they know). We can use this to calculate error derivatives at each time step and update parameters even during testing. This sounds like it would require loads of extra computation at test time, but by updating only a small part of the model we can actually get better results with faster execution and fewer parameter.

This post is a summary of my EMNLP 2015 paper “Online Representation Learning in Recurrent Neural Language Models“.


First a short description of the RNN language model that I use as a baseline. It follows the implementation by Mikolov et al. (2011) in the RNNLM Toolkit.


The previous word goes into the network as a 1-hot vector which is then multiplied with a weight matrix, giving us a corresponding word embedding. This, together with the previous hidden state, act as input to the current hidden state of the network:

\(hidden_t = \sigma(E \cdot input_t + W_h \cdot hidden_{t-1})\)

The hidden state is connected to the output layer, which predicts the next word in the sequence. In order to avoid performing a softmax operation over the whole vocabulary, all words are divided between classes and the probability of the next word is factored into the probability of the class and the probability of the next word given the class:

\(P(w_{t+1} | w_{1}^{t}) \approx classes_c \cdot output_{w_{t+1}}\)

\(classes = softmax(W_c \cdot hidden_t)\)

\(output = softmax(W_o^{(c)} \cdot hidden_t)\)

The words are divided into classes by frequency-based bucketing (following Mikolov et al., 2011), and the learning rate is divided by 2 if the improvement is not sufficient. The RNNLM Toolkit treats the training data as a continuous stream of tokens and performs backpropagation through time for a fixed number of steps – the text is essentially split into fixed-sized chunks for optimisation. Instead, we perform sentence splitting and backpropagate errors from the end of each sentence to the beginning.

RNNLM with online learning

Let’s introduce a special vector into the model, which will represent the current unit of text being processed (a sentence, a paragraph, or a document). We can then update it after each prediction, based on the errors the model has made on that document.


The output probabilities over classes and words are then conditioned on this new document vector:

\(classes = softmax(W_c \cdot hidden_t + W_{dc} \cdot doc)\)
\(output = softmax(W_o^{(c)} \cdot hidden_t + W_{do}^{(c)} \cdot doc)\)

Notice that there is no input going into the document vector. Instead of constructing it iteratively, like the values in a hidden layer, we treat it as a vector of parameters and optimise them both during training and testing. After predicting each word, we calculate the error in the output layer, backpropagate this into the document vector, and adjust the values. While the main language model is a smoothed static representation of the training data, the document vector will contain information about how a specific sentence/document differs from this main language model.

The document vector is connected directly to the output layers of the RNNLM, in parallel to the hidden layer. This allows us to update the document vector after every step, instead of waiting until the end of the sentence to perform backpropagation through time.

Le and Mikolov (2014) used a related approach for learning vector representations of sentences and achieved good results on the sentiment detection task. They added a vector for a sentence into a feedforward language model, stepped through the sentence, and used the values at the last step as a representation of that sentence. While they connected the vector as part of the input layer, we have connected it directly to the output layer – in an RNNLM the input layer only gets updated at the end of the sentence (during backpropagation-through-time), whereas we want to update the document vector after each time step.


We constructed a dataset from English Wikipedia to evaluate language modelling performance of the two models. The text was tokenised, sentence split and lowercased. The sentences were shuffled, in order to minimise any transfer effects between consecutive sentences, and then split into training, development and test sets. The final sentences were sampled randomly, in order to obtain reasonable training times for the experiments. Dataset sizes are as follows:

 Train Dev Test
Words  9,990,782  237,037  4,208,847
Sentences  419,278  10,000  176,564

The regular RNNLM with a 100-dimensional hidden layer (M = 100) and no document vector (D=0) is the baseline. In the experiments we increase the capacity of the model using different methods and measure how that affects the perplexity on the datasets.

 Train PPL  Dev PPL  Test PPL
Baseline M=100  92.65  103.56  102.51
M=120  88.60  98.78  97.79
M=100, D=20  87.28   95.36  94.39
M=135  85.17  96.33  95.71
M=100, D=35  80.11  91.05  90.29

Increasing the hidden layer size M does improve the model performance and perplexity decreases from 102.51 to 95.71. However, adding the same number of neurons into the actively-updated document vector gives an even lower perplexity of 90.29.

 Experiments with semantic similarity

The resulting document vector can also be used for calculating semantic similarity between texts. We sampled random sentences from the development data, processed them with the language model, and used the resulting document vector to find 3 most similar sentences in the development set. Below are some examples.

Input: Both Hufnagel and Marston also joined the long-standing technical death metal band Gorguts.

  • The band eventually went on to become the post-hardcore band Adair.
  • The band members originally came from different death metal bands, bonding over a common interest in d-beat.
  • The proceeds went towards a home studio, which enabled him to concentrate on his solo output and songs that were to become his debut mini-album “Feeding The Wolves”.

Input: The Chiefs reclaimed the title on September 29, 2014 in a Monday Night Football game against the New England Patriots, hitting 142.2 decibels.

  • He played in twenty-four regular season games for the Colts, all off the bench.
  • In May 2009 the Warriors announced they had re-signed him until the end of the 2011 season.
  • The team played inconsistently throughout the campaign from the outset, losing the opening two matches before winning four consecutive games during September 1927.

Input: He was educated at Llandovery College and Jesus College, Oxford, where he obtained an M.A. degree.

  • He studied at the Orthodox High School, then at the Faculty of Mathematics.
  • Kaigama studied for the priesthood at St. Augustine’s Seminary in Jos with further study in theology in Rome.
  • Under his stewardship, Zahira College became one of the leading schools in the country.


There has been a lot of work on developing static models for machine learning – we train the model parameters on the training data and then apply them on the test data. However, there is a lot of potential for dynamical models, which take advantage of immediate feedback signals and are able to continuously adjust the model parameters. Our experiment showed that, at least for language modelling, such a model is indeed a viable option.

26 Things I Learned in the Deep Learning Summer School

In the beginning of August I got the chance to attend the Deep Learning Summer School in Montreal. It consisted of 10 days of talks from some of the most well-known neural network researchers. During this time I learned a lot, way more than I could ever fit into a blog post. Instead of trying to pass on 60 hours worth of neural network knowledge, I have made a list of small interesting nuggets of information that I was able to summarise in a paragraph.

At the moment of writing, the summer school website is still online, along with all the presentation slides. All of the information and most of the illustrations come from these slides and are the work of their original authors. The talks in the summer school were filmed as well, hopefully they will also find their way to the web.

Update: the Deep Learning Summer School videos are now online.

Alright, let’s get started.

1. The need for distributed representations

During his first talk, Yoshua Bengio said “This is my most important slide”. You can see that slide below:dlss-3aug2015

Let’s say you have a classifier that needs to detect people that are male/female, have glasses or don’t have glasses, and are tall/short. With non-distributed representations, you are dealing with 2*2*2=8 different classes of people. In order to train an accurate classifier, you need to have enough training data for each of these 8 classes. However, with distributed representations, each of these properties could be captured by a different dimension. This means that even if your classifier has never encountered tall men with glasses, it would be able to detect them, because it has learned to detect gender, glasses and height independently from all the other examples.

2. Local minima are not a problem in high dimensions

The team of Yoshua Bengio have experimentally found that when optimising the parameters of high-dimensional neural nets, there effectively are no local minima. Instead, there are saddle points which are local minima in some dimensions but not all. This means that training can slow down quite a lot in these points, until the network figures out how to escape, but as long as we’re willing to wait long enough then it will find a way.

Below is a graph demonstrating a network during training, oscillating between two states: approaching a saddle point and then escaping it.


Given one specific dimension, there is some small probability \(p\) with which a point is a local minimum, but not a global minimum, in that dimension. Now, the probability of a point in a 1000-dimensional space being an incorrect local minimum in all of these would be \(p^{1000}\), which is just astronomically small. However, the probability of it being a local minimum in some of these dimensions is actually quite high. And when we get these minima in many dimensions at once, then training can appear to be stuck until it finds the right direction.

In addition, this probability \(p\) will increase as the loss function gets closer to the global minimum.  This means that if we do ever end up at a genuine local minimum, then for all intents and purposes it will be close enough to the global minimum that it will not matter.

3. Derivatives derivatives derivatives

Leon Bottou had some useful tables with activation functions, loss functions, and their corresponding derivatives. I’ll keep these here for later.




Update: As pointed out by commenters, the min and max functions in the ramp formula should be switched.

4. Weight initialisation strategy

The current recommended strategy for initialising weights in a neural network is to sample values \(W_{i,j}^{(k)}\) uniformly from \([-b,b]\), where

\(b = \sqrt{\frac{6}{H_k + H_{k+1}}}\)

\(H_k\) and \(H_{k+1}\) are the sizes of hidden layers before and after the weight matrix.

Recommended by Hugo Larochelle, published by Glorot & Bengio (2010).

5. Neural net training tricks

A few practical suggestions from Hugo Larochelle:

  • Normalise real-valued data. Subtract the mean and divide by standard deviation.
  • Decrease the learning rate during training.
  • Can update using mini-batches – the gradient is more stable.
  • Can use momentum, to get through plateaus.

6. Gradient checking

If you implemented your backprop by hand and it’s not working, then there’s roughly 99% chance that the gradient calculation has a bug. Use gradient checking to identify the issue. The idea is to use the definition of a gradient: how much will the model error change, if we increase a specific weight by a small amount.

\(\frac{\partial f(x)}{\partial x} \approx \frac{f(x+\epsilon) – f(x-\epsilon)}{2\epsilon}\)

A more in-depth explanation is available here: Gradient checking and advanced optimization

7. Motion tracking

Human motion tracking can be done with impressive accuracy. Below are examples from the paper Dynamical Binary Latent Variable Models for 3D Human Pose Tracking by Graham Taylor et al. (2010). The method uses conditional restricted Boltzmann machines.


8. Syntax or no syntax? (aka, “is syntax a thing?”)

Chris Manning and Richard Socher have put a lot of effort into developing compositional models that combine neural embeddings with more traditional parsing approaches. This culminated with a Recursive Neural Tensor Network (Socher et al., 2013), which uses both additive and multiplicative interactions to combine word meanings along a parse tree.

And then, the model was beaten (by quite a margin) by the Paragraph Vector (Le & Mikolov, 2014), which knows absolutely nothing about the sentence structure or syntax. Chris Manning referred to this result as “a defeat for creating ‘good’ compositional vectors”.

However, more recent work using parse trees has again surpassed this result. Irsoy & Cardie (NIPS, 2014) managed to beat paragraph vectors by going “deep” with their networks in multiple dimensions. Finally, Tai et al. (ACL, 2015) have improved the results again by combining LSTMs with parse trees.

The accuracies of these models on the Stanford 5-class sentiment dataset are as follows:

Method Accuracy
RNTN (Socher et al. 2013) 45.7
Paragraph Vector (Le & Mikolov 2014)  48.7
DRNN (Irsoy & Cardie 2014) 49.8
Tree LSTM (Tai et al. 2015) 50.9

So it seems that, at the moment, models using the parse tree are beating simpler approaches. I’m curious to see if and when the next syntax-free approach will emerge that will advance this race. After all, the goal of many neural models is not to discard the underlying grammar, but to implicitly capture it in the same network.

9. Distributed vs distributional

Chris Manning himself cleared up the confusion between the two words.

Distributed: A concept is represented as continuous activation levels in a number of elements. Like a dense word embedding, as opposed to 1-hot vectors.

Distributional: Meaning is represented by contexts of use. Word2vec is distributional, but so are count-based word vectors, as we use the contexts of the word to model the meaning.

10. The state of dependency parsing

Comparison of dependency parsers on the Penn Treebank:

Parser  Unlabelled Accuracy Labelled Acccuracy  Speed (sent/s)
MaltParser 89.8 87.2 469
MSTParser 91.4 88.1 10
TurboParser 92.3 89.6 8
Stanford Neural Dependency Parser 92.0 89.7 654
Google  94.3 92.4 ?

The last result is from Google “pulling out all the stops”, by putting massive amounts of resources into training the Stanford neural parser.

11. Theano

Well, I knew a bit about Theano before, but I learned a whole lot more during the summer school. And it is pretty awesome.

Since Theano originates from Montreal, it was especially helpful to be able to ask questions directly from the people who are developing it.

Most of the information that was presented is available online, in the form of interactive python tutorials.

12. Nvidia Digits

Nvidia has a toolkit called Digits that trains and visualises complex neural network models without needing to write any code. And they’re selling DevBox – a machine customised for running Digits and other deep learning software (Theano, Caffe, etc). It comes with 4 Titan X GPUs and currently costs $15,000.

13. Fuel

Fuel is a toolkit that manages iteration over your datasets – it can split them into minibatches, manage shuffling, apply various preprocessing steps, etc. There are prebuilt functions for some established datasets, such as MNIST, CIFAR-10, and Google’s 1B Word corpus. It is mainly designed for use with Blocks, a toolkit that simplifies network construction with Theano.

14. Multimodal linguistic regularities

Remember “king – man + woman = queen”? Turns out that works with images as well (Kiros et al., 2015).


15. Taylor series approximation

When we are at point \(x_0\) and take a step to \(x\), then we can estimate the function value in the new location by knowing the derivatives, using the Taylor series approximation.

f(x) = f(x_0) + (x – x_0)f'(x) + \frac{1}{2} (x – x_0)^2 f”(x) + …

Similarly, we can estimate the loss of a function, when we update parameters \(\theta_0\) to \(\theta\).

J(\theta) =J(\theta_0) + (\theta – \theta_0)^T g + \frac{1}{2} (\theta – \theta_0)^T H(\theta – \theta_0) + …

where \(g\) contains the derivatives with respect to \(\theta\), and \(H\) is the Hessian with second order derivatives with respect to \(\theta\).

This is the second-order Taylor approximation, but we could increase the accuracy by adding even higher-order derivatives.

16. Computational intensity

Adam Coates presented a strategy for analysing the speed of matrix operations on a GPU. It’s a simplified model that says your time is spent on either reading/writing to memory or doing calculations. It assumes you can do both in parallel so we are interested in which one of them takes more time.

Let’s say we are multiplying a matrix with a vector:


If \(M=1024\) and \(N=512\), then the number of bytes we need to read and store is:

\( 4\text{ bytes }\times (1024 \times 512 + 512 + 1024) = 2.1e6\text{ bytes} \)

And the number of calculations we need to do is:

\(2\times 1024\times 512 = 1e6\text{ FLOPs}\)

If we have a GPU that can do 6 TFLOP/s and has memory bandwidth of  300GB/s, then the total running time will be:

\(\text{max}\{2.1e6\text{ bytes }/ (300e9\text{ bytes}/s), 1e6\text{ FLOPs} / (6e12\text{ FLOP}/s) \} \\
= \text{max}\{ 7\mu s, 0.16\mu s \} \)

This means the process is bounded by the \(7\mu s\) spent on copying to/from the memory, and getting a faster GPU would not make any difference. As you can probably guess, this situation gets better with bigger matrices/vectors, and when doing matrix-matrix operations.

Adam also described the idea of calculating the intensity of an operation:

Intensity = (# arithmetic ops) / (# bytes to load or store)

In the previous scenario, this would be

Intensity = (1E6 FLOPs) / (2.1E6 bytes) = 0.5 FLOPs/bytes

Low intensity means the system is bottlenecked on memory, and high intensity means it’s bottlenecked by the GPU speed. This can be visualised, in order to find which of the two needs to improve in order to speed up the whole system, and where the sweet spot lies.


17. Minibatches

Continuing from the intensity calculations, one way of increasing the intensity of your network (in order to be limited by computation instead of memory), is to process data in minibatches. This avoids some memory operations, and GPUs are great at processing large matrices in parallel.

However, increasing the batch size too much will probably start to hurting the training algorithm and converging can take longer. It’s important to find a good balance in order to get the best results in the least amount of time.


18. Training on adversarial examples

It was recently revealed that neural networks are easily tricked by adversarial examples. In the example below, the image on the left is correctly classified as a goldfish. However, if we apply the noise pattern shown in the middle, resulting in the image on the right, the classifier becomes convinced this is a picture of a daisy. The image is from Andrej Karpathy’s blog post “Breaking Linear Classifiers on ImageNet”, and you can read more about it there. fish

The noise pattern isn’t random though – the noise is carefully calculated, in order to trick the network. But the point remains: the image on the right is clearly still a goldfish and not a daisy.

Apparently strategies like ensemble models, voting after multiple saccades, and unsupervised pretraining have all failed against this vulnerability. Applying heavy regularisation helps, but not before ruining the accuracy on the clean data.

Ian Goodfellow presented the idea of training on these adversarial examples. They can be automatically generated and added to the training set. The results below show that in addition to helping with the adversarial cases, this also improves accuracy on the clean examples.goodfellow_advFinally, we can improve this further by penalising the KL-divergence between the original predicted distribution and the predicted distribution on the adversarial example. This optimises the network to be more robust, and to predict similar class distributions for similar (adversarial) images.

19. Everything is language modelling

Phil Blunsom presented the idea that almost all NLP can be structured as a language model. We can do this by concatenating the output to the input and trying to predict the probability of the whole sequence.


\(P(\text{Les chiens aiment les os || Dogs love bones})\)

Question answering:

\(P(\text{What do dogs love? || bones .})\)


\(P(\text{How are you? || Fine thanks. And you?})\)

The latter two need to be additionally conditioned on some world knowledge. The second part doesn’t even need to be words, but could be labels or some structured output like dependency relations.

20. SMT had a rough start

When Frederick Jelinek and his team at IBM submitted one of the first papers on statistical machine translation to COLING in 1988, they got the following anonymous review:


The validity of a statistical (information theoretic) approach to MT has indeed been recognized, as the authors mention, by Weaver as early as 1949. And was universally recognized as mistaken by 1950 (cf. Hutchins, MT – Past, Present, Future, Ellis Horwood, 1986, p. 30ff and references therein). The crude force of computers is not science. The paper is simply beyond the scope of COLING.

21. The state of Neural Machine Translation

Apparently a very simple neural model can produce surprisingly good results. An example of translating from Chinese to English, from Phil Blunsom’s slides:


In this model, the vectors for the Chinese words are simply added together to form a sentence vector. The decoder consists of a conditional language model which takes the sentence vector, together with vectors from the two recently generated English words, and generates the next word in the translation.

However, neural models are still not outperforming the very best traditional MT systems. They do come very close though. Results from “Sequence to Sequence Learning with Neural Networks” by Sutskever et al. (2014):

Model  BLEU score
Baseline 33.30
Best WMT'14 result  37.0
Scoring with 5 LSTMs  36.5
Oracle (upper bound)  ∼45

Update: @stanfordnlp pointed out that there are some recent results where the neural model does indeed outperform the state-of-the-art traditional MT system. Check out “Effective Approaches to Attention-based Neural Machine Translation” (Luong et. al., 2015).

22. MetaMind classifier demo

Richard Socher demonstrated the MetaMind image classification demo, which you can train yourself by uploading images. I trained a classifier to detect Edison and Einstein (couldn’t find enough unique images of Tesla). 5 example images for both classes, testing on one held out image each. Seemed to work pretty well.


23. Optimising gradient updates

Mark Schmidt gave two presentations about numerical optimisation in different scenarios.

In a deterministic gradient method we calculate the gradient over the whole dataset and then apply the update. The iteration cost is linear with the dataset size.

In stochastic gradient methods we calculate the gradient on one datapoint and then apply the update. The iteration cost is independent of the dataset size.

Each iteration of the stochastic gradient descent is much faster, but it usually takes many more iterations to train the network, as this graph illustrates:


In order to get the best of both worlds, we can use batching. More specifically, we could do one pass of the dataset with stochastic gradient descent, in order to quickly get on the right track, and then start increasing the batch size. The gradient error decreases as the batch size increases, although eventually the iteration cost will become dependent on the dataset size again.

Stochastic Average Gradient (SAG) is a method that gets around this, providing a linear convergence rate with only 1 gradient per iteration. Unfortunately, it is not feasible for large neural networks, as it needs to remember the gradient updates for every datapoint, leading to large memory requirements. Stochastic Variance-Reduced Gradient (SVRG) is another method that reduces this memory cost, and only needs 2 gradient calculations per iteration (plus occasional full passes).

Mark said a student of his implemented a variety of optimisation methods (AdaGrad, momentum, SAG, etc). When asked, what he would use in a black box neural network system, the student said two methods: Streaming SVRG (Frostig et al., 2015), and a method they haven’t published yet.

24. Theano profiling

If you put “profile=True” into THEANO_FLAGS, it will analyse your program, showing a breakdown of how much is spent on each operation. Very handy for finding bottlenecks.

25. Adversarial nets framework

Following on from Ian Goodfellow’s talk on adversarial examples, Yoshua Bengio talked about having two systems competing with each other.

System D is a discriminative system that aims to classify between real data and artificially generated data.

System G is a generative system, that tries to generate artificial data, which D would incorrectly classify as real.

As we train one, the other needs to get better as well. In practice this does work, although the step needs to be quite small to make sure D can keep up with G. Below are some examples from “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks” – a more advanced version of this model which tries to generate images of churches.


26. numbering

The arXiv number contains the year and month of the submission, followed by the sequence number. So paper 1508.03854 was number 3854 in August 2015. Good to know.

Transforming Images to Feature Vectors

I’m keen to explore some challenges in multimodal learning, such as jointly learning visual and textual semantics. However, I would rather not start by attempting to train an image recognition system from scratch, and prefer to leave this part to researchers who are more experienced in vision and image analysis.

Therefore, the goal is to use an existing image recognition system, in order to extract useful features for a dataset of images, which can then be used as input to a separate machine learning system or neural network. We start with a directory of images, and create a text file containing feature vectors for each image.

1. Install Caffe

Caffe is an open-source neural network library developed in Berkeley, with a focus on image recognition. It can be used to construct and train your own network, or load one of the pretrained models. A web demo is available if you want to test it out.

Follow the installation instructions to compile Caffe. You will need to install quite a few dependencies (Boost, OpenCV, ATLAS, etc), but at least for Ubuntu 14.04 they were all available in public repositories.

Once you’re done, run

make test
make runtest

This will run the tests and make sure the installation is working properly.

2. Prepare your dataset

Put all your images you want to process into one directory. Then generate a file containing the path to each image. One image per line. We will use this file to read the images, and it will help you map images to the correct vectors later.

You can run something like this:

find `pwd`/images -type f -exec echo {} \; > images.txt

This will find all files in subdirectory called “images” and write their paths to images.txt

3. Download the model

There are a number of pretrained models publically available for Caffe. Four main models are part of the original Caffe distribution, but more are available in the Model Zoo wiki page, provided by community members and other researchers.

We’ll be using the BVLC GoogLeNet model, which is based on the model described in Going Deeper with Convolutions by Szegedy et al. (2014). It is a 22-layer deep convolutional network, trained on ImageNet data to detect 1,000 different image types. Just for fun, here’s a diragram of the network, rotated 90 degrees:


The Caffe models consist of two parts:

  1. A description of the model (in the form of *.prototxt files)
  2. The trained parameters of the model (in the form of a *.caffemodel file)

The prototxt files are small, and they came included with the Caffe code. But the parameters are large and need to be downloaded separately. Run the following command in your main Caffe directory to download the parameters for the GoogLeNet model:

python scripts/ models/bvlc_googlenet

This will find out where to download the caffemodel file, based on information already in the models/bvlc_googlenet/ directory, and will then place it into the same directory.

In addition, run this command as well:


It will download some auxiliary files for the ImageNet dataset, including the file of class labels which we will be using later.

4. Process images and print vectors

Now is the time to load the model into Caffe, process each image, and print a corresponding vector into a file. I created a script for that (see below, also available as a Gist):

import numpy as np
import os, sys, getopt

# Main path to your caffe installation
caffe_root = '/path/to/your/caffe/'

# Model prototxt file
model_prototxt = caffe_root + 'models/bvlc_googlenet/deploy.prototxt'

# Model caffemodel file
model_trained = caffe_root + 'models/bvlc_googlenet/bvlc_googlenet.caffemodel'

# File containing the class labels
imagenet_labels = caffe_root + 'data/ilsvrc12/synset_words.txt'

# Path to the mean image (used for input processing)
mean_path = caffe_root + 'python/caffe/imagenet/ilsvrc_2012_mean.npy'

# Name of the layer we want to extract
layer_name = 'pool5/7x7_s1'

sys.path.insert(0, caffe_root + 'python')
import caffe

def main(argv):
    inputfile = ''
    outputfile = ''

        opts, args = getopt.getopt(argv,"hi:o:",["ifile=","ofile="])
    except getopt.GetoptError:
        print ' -i <inputfile> -o <outputfile>'

    for opt, arg in opts:
        if opt == '-h':
            print ' -i <inputfile> -o <outputfile>'
        elif opt in ("-i"):
            inputfile = arg
        elif opt in ("-o"):
            outputfile = arg

    print 'Reading images from "', inputfile
    print 'Writing vectors to "', outputfile

    # Setting this to CPU, but feel free to use GPU if you have CUDA installed
    # Loading the Caffe model, setting preprocessing parameters
    net = caffe.Classifier(model_prototxt, model_trained,
                           image_dims=(256, 256))

    # Loading class labels
    with open(imagenet_labels) as f:
        labels = f.readlines()

    # This prints information about the network layers (names and sizes)
    # You can uncomment this, to have a look inside the network and choose which layer to print
    #print [(k, for k, v in net.blobs.items()]

    # Processing one image at a time, printint predictions and writing the vector to a file
    with open(inputfile, 'r') as reader:
        with open(outputfile, 'w') as writer:
            for image_path in reader:
                image_path = image_path.strip()
                input_image =
                prediction = net.predict([input_image], oversample=False)
                print os.path.basename(image_path), ' : ' , labels[prediction[0].argmax()].strip() , ' (', prediction[0][prediction[0].argmax()] , ')'
                np.savetxt(writer, net.blobs[layer_name].data[0].reshape(1,-1), fmt='%.8g')

if __name__ == "__main__":

You will first need to set the caffe_root variable to point to your Caffe installation. Then run it with:

python -i <inputfile> -o <outputfile>

It will first print out a lot of model-specific debugging information, and will then print a line for each input image containing the image name, the label of the most probable class, and the class probability.

flower.jpg  :  n11939491 daisy  ( 0.576037 )
horse.jpg  :  n02389026 sorrel  ( 0.996444 )
beach.jpg  :  n09428293 seashore, coast, seacoast, sea-coast  ( 0.568305 )

At the same time, it will also print vectors into the output file. By default, it will extract the layer pool5/7x7_s1 after processing each image. This is the last layer before the final softmax in the end, and it contains 1024 elements. I haven’t experimented with choosing different layers yet, but this seemed like a reasonable place to start – it should contain all the high-level processing done in the network, but before forcing it to choose a specific class. Feel free to choose a different layer though, just change the corresponding parameter in the script. If you find that specific layers work better, let me know as well.

The outputfile will contain vectors for each image. There will be one line of values for each input image, and every line will contain 1024 values (if you printed the default layer). Mission accomplished!


Below are some tips for when you run into problems.

First, it’s worth making sure you have compiled the python bindings in the Caffe directory:

make pycaffe

I was getting some unusual errors when this code was in a subdirectory of the main Caffe folder. After some googling I found that others had similar problems with other projects, and apparently overlapping library names were causing the wrong dependencies to be included. The simple solution was to move this code out of the Caffe directory, and put it somewhere else.

I installed Caffe with CUDA support, and even though I turned GPU support off in the script, it was still complaining when I didn’t set the CUDA path. For example, I run the code like this (you may need to change the paths to match your system):

LD_LIBRARY_PATH=/usr/local/cuda-7.0/lib64/:$LD_LIBRARY_PATH PYTHONPATH=$PYTHONPATH:/path/to/caffe/python python -i images.txt -o vectors.txt

Finally, Caffe is compiled against a specific version of CUDA. I initially had CUDA 6.5 installed, but after upgrading to CUDA 7.0 the Caffe library had to be recompiled.


There you have it – going from images to vectors. Now you can use these vectors to represent your images in various tasks, such as classification, multi-modal learning, or clustering. Ideally, you will probably want to train the whole network on a specific task, including the visual component, but for starters these pretrained vectors should be quite helpful as well.

These instructions and the script are loosely based on Caffe examples on ImageNet classification and filter visualisation. If the code here isn’t doing quite what you want it to, it’s worth looking at these other similar applications.

If you have any suggestions or fixes, let me know and I’ll be happy to incorporate them in this post.

Multilingual Semantic Models

In this post I’ll discuss a model for learning word embeddings, such that they end up in the same space in different languages. This means we can find the similarity between some English and German words, or even compare the meaning of two sentences in different languages. It is a summary and analysis of the paper by Karl Moritz Hermann and Phil Blunsom, titled “Multilingual Models for Compositional Distributional Semantics“, published at ACL 2014.

The Task

The goal of this work is to extend the distributional hypothesis to multilingual data and joint-space embeddings. This would give us the ability to compare words and sentences in different languages, and also make use of labelled training data from languages other than the target language. For example, below is an illustration of English words and their Estonian translations in the same semantic space.



This actually turns out to be a very difficult task, because the distributional hypothesis stops working across different languages. While “fish” is an important feature of “cat”, because they occur together often, “kass” never occurs with “fish”, because they are in different languages and therefore used in separate sets of documents.

In order to learn these representations in the same space, the authors construct a neural network that learns from parallel sentences (pairs of the same sentence in different languages). The model is then evaluated on the task of topic classification, training on one language and testing on the other.

A bit of a spoiler, but here is a visualisation of some words from the final model, mapped into 2 dimensions.



The words from English, German and French are successfully mapped into clusters based on meaning. The colours indicate gender (blue=male, red=female, green=neutral).

The Multilingual Model

The main idea is as follows: We have sentence \(a\) in one language, and we have a function \(f(a)\) which maps that sentence into a vector representation (we’ll come back to that function). We then have sentence \(b\), which is the same sentence just in a different language, and function \(g(b)\) for mapping it into a vector representation. Our goal is to have \(f(a)\) and \(g(b)\) be identical, because both of these sentences have the same meaning. So during training, we show the model a series of parallel sentences \(a\) and \(b\), and each time we adjust the functions \(f(a)\) and \(g(b)\) so that they would produce more similar vectors.

Here is a graphical representation of the model:


\(b1\), \(b2\) and \(b3\) are words in sentence \(b\); \(a1\), \(a2\),\(a3\) and \(a4\) are words in sentence \(a\). The red vectors in the middle are the sentence representations that we want to be similar.

Next, let’s talk about the functions \(f(a)\) and \(g(b)\) that map a sentence to a vector. As you can see from the image above, each word is represented as a vector as well. The simplest option of going from words to sentences is to just add the individual word vectors together (the ADD model):

\(f_{ADD}(a) = \sum_{i=1}^{n} a_i\)

Here, \(a_i\) is the vector for word number \(i\) in sentence \(a\). This addition is similar to a basic bag-of-words model, because it doesn’t preserve any information about the order of the words. Therefore, the authors have also proposed a bigram version of this function (the BI model):

\(f_{BI}(a) = \sum_{i=1}^{n} tanh(a_{i-1} + a_i)\)

In this function, we step though the sentence, add together vectors for two consecutive words, and pass them through a nonlinearity (tanh). The result is then summed together into a sentence vector. This is essentially a multi-layer compositional network, where word vectors are first combined to bigram vectors, and then bigram vectors are combined to sentence vectors.

One more component to make this model work – the optimization function. The authors define an energy function given two sentences:

\(E(a,b) = || f(a) – g(b) ||^ 2\)

This means we find the Euclidean distance between the two vector representations and take the square of it. This value will be big when the vectors are different, and small when they are similar.

But we can’t directly use this for optimization, because functions \(f(a)\) and \(g(b)\) that always returned zero vectors would be the most optimal solution. We want the model to give similar vectors for similar sentences, but different vectors for semantically different sentences. Here’s a function for that:

\(E_{nc}(a,b,c) = [m + E(a,b) – E(a,c)]_{+}\)

We’ve introduced a randomly selected sentence \(c\) that probably has nothing to do with \(a\) or \(b\). Our objective is to minimze the \(E_{nc}(a,b,c)\) function, which means we want \(E(a,b)\) (for related sentences) to be small, and \(E(a,c)\) (for unrelated sentences) to be large. This form of training – teaching the model to distinguish between correct and incorrect samples – is called noise contrastive estimation. The formula also includes \(m\), which is the margin we want to have between the values of \(E(a,b)\) and \(E(a,c)\). The whole thing is passed through the function \([x]_{+} = max(x,0)\), which means that if \(E(a,c)\) is greater than \(E(a,b)\) by margin \(m\), then we’re already optimal and don’t need to adjust the model further.

The authors also experiment with a document-level variation of the model (the DOC model), where individual sentence vectors are combined into document vectors and these are also optimized to be similar, in addition to the sentence vectors.


The authors evaluate the system on the task of topic classification. The classifier is trained on one language (eg English) and the test results are reported on another language (eg German) for which no labelled training data was used. They run two main experiments:

  1. The cross-lingual document classification (CLDC) task, described by Klementiev et al. (2012). The system is trained on the parallel Europarl corpus, and tested on Reuters RCV1/RCV2. The language pairs used were English-German and English-French.
  2. The authors built a new corpus from parallel subtitles of TED talks (not yet online at the time of writing this), based on a previous TED corpus for IWSLT. Each talk also has topic tags assigned to them, and the task is to assign a correct tag to every talk, using the document-level vector.



First, results on the CLDC task:

I-Matrix is the previous state-of-the-art system, and all the models described here manage to outperform it. The +-variations of the model (ADD+ and BI+) use the French data as an extra training resource, thereby improving performance. This is an interesting result, as the classifier is trained on English and tested on German, and French seems completely unrelated to the task. But by adding it into the model, the English word representations are improved by having more information available, which in turn propagates on to having better German representations.

Next, experiments on the TED corpus:


The authors have performed a much larger number of experiments, and I’ve only chosen a few examples to show here.

The MT System is a machine translation baseline, where the test data is translated into the source language using a machine translation system. The most interesting scenarios for application are where the source language is English, and this is where the MT baseline often still wins. So if we want to topic classification in Dutch, but we only have English labelled data, it’s best to just automatically translate the Dutch text into English before classification.

Experiments in the other direction (where the target language is English) show different results, and the multilingual neural models manage to win on most languages. I’m curious about this difference – perhaps the MT systems are better tuned to translate into English, but not as good when translating from English into other languages? In any case, I think with some additional developments the neural network model will be able to beat the baseline in both directions.

In most cases, adding the document-level training signal (ADD/DOC) helped accuracy quite a bit. The bigram models (BI) however were outperformed by the basic ADD models on this task, and the authors suspect this is due to sparsity issues caused by less training data.

Finally, the ADD/DOC/joint model was trained on all languages simultaneously, taking advantage of parallel data in all the languages, and mapping all vectors into the same space. The results of this experiment seem to be mixed, leading to an improvement on some languages and decrease on others.

In conclusion, this is definitely a very interesting model, and it bridges the gap between vector representations of different languages, using only sentence-aligned plain text. Combining sentence-level and document-level training signals seems to give a fairly consistent improvement in classification accuracy. Unfortunately, in the most interesting scenario, mapping from English to other resource-poor languages, this system does not yet beat the MT baseline. But hopefully this is only the first step, and future research will further improve the results.


Hermann, K. M., & Blunsom, P. (2014). Multilingual Models for Compositional Distributed Semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 58–68).

Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. In COLING 2012.