Menu Close

Category: Uncategorized

ML and NLP Publications in 2019

It is about time we once again take a look at the publication statistics of the past year. 2019 was another record breaking year in machine learning and NLP research. Nearly all conferences had more attendees and more publications than ever before. For example, NeurIPS had 6,743 submissions and 1,428 accepted papers, which eclipses all the previous iterations. Because the conference sold out so fast last year, the organizers had to implement a randomised lottery for the tickets this time.

In this post you will find a number of graphs to illustrate the publication patterns from 2019.
I have included the following conferences and journals: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, NeurIPS, ICML, ICLR, AAAI. The paper information is crawled and processed automatically from the corresponding proceedings. While names and titles are quite easily accessible, getting the author affiliations is the noisiest part of the process, as this needs to be extracted directly from the pdf files. I've kept improving the pipeline every year so it should be more accurate than any of the previous iterations. If you do spot some mistakes, let me know.

The analysis this year includes some brand new statistics and graphs. This is thanks to Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap) who annotated the extracted organization names with origin countries. Andrew started the process with a subset of 2018 conferences for our joint article on The Geographic Diversity of NLP Conferences. Jonas then scaled this up to cover all of the organizations in all the conferences in this analysis. This allows us to create some new country-specific visualisations, which you'll see below.

While this post highlights authors and organizations who have published the most in the past year, I want to emphasize that publication quantity should not be used as the main metric for good research. It's always better to have one really groundbreaking and influential piece of work over 10 forgettable and incremental papers. These statistics are just meant to give an informative overview of the most active publication centres in this field and hopefully inspire some new researchers to publish their own ideas.

Venues

Essentially all conferences had record-breaking numbers of publications in 2019. The journals have a fairly stable publication rate, COLING and EACL didn't happen this year, and every other conference just grew substantially. NeurIPS is by far the biggest conference and it now has a lead of nearly 300 papers over AAAI.

Organizations

Let's see which organizations published the most in 2019. Google has taken a comfortable lead, publishing a substantial number of papers in every venue. They had more than twice as many papers at ICML as the next closest organization (MIT). Worth noting that in previous years I included DeepMind papers also under Google, whereas this time DeepMind is separated as its own entity. Microsoft and CMU are also publishing an impressive amount of research.

Next up, we can see the overall stats between 2012-2019. While Google is dominating the previous year, CMU and Microsoft are still ahead in the marathon. Notably, the counts for CMU and Microsoft came out identical(!), with 1,215 papers from both organizations. After them, the main heavy hitters are Google, Stanford, MIT, IBM, Berkeley and Tsinghua.

If we look at the separation in time, we see that Google is sort of a late bloomer. Compared to Microsoft and CMU they were publishing much less between 2012-2016, but have overtaken everyone else by quite a bit after that. All of the top players seem to have an upward trend and they all published more in 2019 than ever before.

Authors

Looking at individual authors, Sergey Levine (Berkeley) published an impressive 33 papers in 2019 - 12 in NeurIPS, 6 in ICML and 15 in ICLR. The other top authors are Graham Neubig (CMU), Yoshua Bengio (Montreal), Zhiyuan Liu (Tsinghua), Tao Qin (MSR) and Tie-Yan Liu (MSR).

Looking at the whole period of 2012-2019, we see that Yoshua Bengio (Montreal) has overtaken Chris Dyer (DeepMind) as the most prolific author. Ming Zhou (MSR), Yue Zhang (Westlake), Noah A. Smith (Washington) and Ting Liu (Harbin) all have more than 90 papers from that period. I have had to remove Yang Liu from the list, as there seem to be two or more people publishing under this name and I was not able to automatically separate them.

And looking at the separation in years shows that Sergey Levine, Graham Neubig and Yoshua Bengio have all overtaken the previous publication record set by Chris Dyer in 2016. They each also published considerably more than they did in the previous years.

First Authors

The authors with the highest publication counts are often the leaders of large groups, coordinating the work. But now let's see the first authors, who are usually the people doing the actual implementation and writing. Gabriele Farina is a 4th year PhD at CMU and he has authored 6 papers in 2019, half of them at NeurIPS. Ilias Diakonikolas (UW Madison), Hanrui Zhang (Duke), Rui Zhang (NUS), Chuhan Wu (Tsinghua), Pengcheng Yang (Peking), Sanjeev Arora (Princeton), Zeyuan Allen-Zhu (MSR) and Mikhail Yurochkin (IBM) were all first authors to 5 publications.

Looking at first-author papers through time, Zeyuan Allen-Zhu (MSR), Jiwei Li (Shannon AI), Ivan Vulić (Cambridge), Ryan Cotterell (Cambridge), Young-Bum Kim (Amazon) and Sanjeev Arora (Princeton) have published the most overall.

Countries

For the first time we are now actually able to analyse which countries published the most in 2019. Admittedly, this graph mainly highlights just how much the US dominates the research in this area. China, UK, Germany and Canada are also putting in a strong effort. China has a proportionally very large presence at AAAI, whereas the US is publishing more in NeurIPS and ICML.

The picture looks very similar when looking at the whole 2012-2019 period.

Through the years, the US has always published much more than everyone else and now the pace has accelerated even more. China is also trying to match this and has increased the lead over all the others by quite a bit. The UK is in a respectable third position.

USA

We can also take a look at the 2019 publishing statistics of individual organizations within each country/continent. Given how much the US publishes, the graph here looks sort of similar to the overall statistics, with Google in the lead.

China

In China, Tsinghua and Peking universities distinctly stand out in terms of publication numbers. The other top ranks are also mostly held by universities, with Baidu and Alibaba being the main industry publishers.

UK

In the UK, DeepMind takes the lead. They are followed by Cambridge, Oxford, Edinburgh, UCL, Imperial and the Alan Turing Institute. Worth noting that the Turing institute is largely virtual, so academic there often have affiliations with other universities as well. Out of the top 7, Cambridge and Edinburgh publish quite a bit in NLP, while the others are focusing mainly on ML.

Germany

In Germany, Darmstadt is the top publisher, with 2/3 of the papers in published in the area of NLP. Bosch is putting up some competition, ranking second with mostly ML papers. Saarland, LMU Munich, Tübingen, TU Munich and the Max Planck Institute for Intelligent Systems are also represented at the top.

Canada

Among the Canadian organizations, University of Toronto stands out with an impressive publication count. Université de Montréal and the Vector Institute are also at the top, along with Alberta, McGill, Waterloo, MILA and University of British Columbia. University of Waterloo seems to be the only one with a larger focus on NLP, with the others publishing mostly in the general ML conferences.

Topic Similarity

To try and capture a bit about the topics in the papers as well, I ran them through LDA and then visualised the results with t-SNE. Looking at the organizations, it is interesting to see how geographical clusters emerge. Chinese universities are at the top, US mostly on the right, Europe on the left, and industry right in the middle.

We can do the same for authors. The closeness in the graph reflects a combination of topic similarity and the frequency of collaboration.

Finally, we can do the same for countries. Given that all countries are working on a range of different topics, this graph is likely more representative of their collaboration frequency.

Additional Statistics

Finally, let's finish off with a couple of fun stats.

That's it for this time. Looking forward to all the interesting work that is going to be published in 2020!

74 Summaries of Machine Learning and NLP Research

My previous post on summarising 57 research papers turned out to be quite useful for people working in this field, so it is about time for a sequel.

Below you will find short summaries of a number of different research papers published in the areas of Machine Learning and Natural Language Processing in the past couple of years (2017-2019). They cover a wide range of different topics, authors and venues. These are not meant to be reviews showing my subjective opinion, but instead I aim to provide a blunt and concise overview of the core contribution of each publication.

Given how many papers are published in our area every year, it is getting more and more difficult to keep track of all of them. The goal of this post is to save some time for both new and experienced readers in the field and allow them to get a quick overview of 74 research papers in about 30 minutes reading time.

I set out to post 60 summaries (up from 50 compared to last time). At the end, I also include the summaries for my own published papers since the last iteration (papers 61-74).

Here we go.

1. Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. OpenAI. 2018.
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

A transformer architecture that is trained as a language model on a large corpus, then fine-tuned for individual text classification and similarity tasks. Multiple sentences are combined together into a single sequence using delimiters in order to work with the same model. Reporting high results on entailment, question answering and semantic similarity tasks.

2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. Google. NAACL 2019.
https://www.aclweb.org/anthology/N19-1423.pdf

A bidirectional transformer architecture for pre-training language representations. The model is optimized on unlabaled data by 1) predicting masked words in the input sequence, and 2) predicting whether the input sequences occur together. The parameters can then be fine-tuned for a specific task, such as classifying sentences, sentence pairs, or tokens.

3. LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Hao Tan, Mohit Bansal. UNC. ArXiv 2019.
https://arxiv.org/pdf/1908.07490.pdf

Building a cross-modal pre-trained model for both vision and language. Both images and text are encoded and attended over jointly with a cross-modal encoder, the model is then optimized with both unimodal and multimodal tasks (masked LM, image classification, image-caption matching, visual QA).
The model achieves new state-of-the-art on several VQA datasets.

The Geographic Diversity of NLP Conferences

The growth of interest in NLP technology, fuelled largely by investment in AI applications, has been accompanied by unprecedented expansion of the preeminent NLP conferences: ACL, NAACL and EMNLP in particular. Grzegorz Chrupała’s graph shows the rapid growth in submissions to each conference in recent years:

Figure 1. Graph of conference submission counts, by Grzegorz Chrupała

These trends can also be seen in the blog post by the ACL 2019 Chairs. Meanwhile, acceptance rates have tended to remain constant, meaning that the conferences have grown in line with submissions, e.g. from ACL 2019:

Table 1. Acceptance rates at ACL 2019.

NLP has developed as a field narrowly focused on English, a point highlighted by the recent emergence (and need for) the #BenderRule. The ability of NLP models to deal with English is important, due to its status as an international language of politics, commerce and culture, and indeed with these tools we can handle, for instance, more than half the content on the internet. But with such a strong focus on English we miss cross-linguistic insights and lack coverage for the majority of languages in the world.

ML and NLP Publications in 2018

It is time for another yearly update of the publication statistics in Machine Learning and Natural Language Processing. The field has continued to grow very rapidly, both in number of publications and number of attendees, breaking all sorts of previous records. Perhaps most notably the initial release of NeurIPS conference tickets sold out in 11 minutes and 38 seconds. In this post I will provide some finer-grained statistics on these numbers, showing which authors and organizations are publishing most at specific conferences.

This year, I have included the following conferences/journals: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, NeurIPS, ICML, ICLR, AAAI. This selection aims to cover the most well-known and high-ranking venues for publishing work on both machine learning and language technologies. Compared to last year, I've removed SemEval, as it has a large focus on shared task papers and I'm not including these for other conferences either. I've also added AAAI, which is one of the bigger conferences and was previously missing from the rankings. NeurIPS (previously known as NIPS) changed its name this year, but for consistency I will use the new name to refer to all the previous iterations as well.

This analysis is done automatically with a collection of scripts that I've continued to improve over the years. The paper lists are crawled from online proceedings and author names can usually be found there as well. Organization names need to be extracted straight from the PDFs which can lead to quite a bit of noise. I've created various methods for detecting and mapping different types of names, but let me know if you spot any remaining errors.

While this post highlights authors and organizations who have published the most in the recent year, I want to specify that I do not think that publication quantity is something that we as a field should be pursuing or rewarding. As the graphs below show, the field is becoming more and more popular, and this rapid increase in numbers comes with very varying quality. Authoring 1 piece of groundbreaking work is always better than releasing 10 totally forgettable incremental papers. This post is just meant to give a light high-level view of who is currently publishing and at which conferences, and perhaps provide a bit of inspiration for new researchers with great ideas.

Venues

We start off by looking at the publications at all the conferences between 2012-2018. Most of the ML venues continued their growth in the number of published papers, with AAAI and NeurIPS going past the 1,000 paper mark. EMNLP and NAACL also had their record years by quite a margin, whereas ACL and COLING stayed closer to the previous numbers. EACL took this year to rest, and the number of papers in TACL and CL has remained relatively stable throughout the years.

57 Summaries of Machine Learning and NLP Research

Staying on top of recent work is an important part of being a good researcher, but this can be quite difficult. Thousands of new papers are published every year at the main ML and NLP conferences, not to mention all the specialised workshops and everything that shows up on ArXiv. Going through all of them, even just to find the papers that you want to read in more depth, can be very time-consuming.

In this post, I have summarised 50 papers. After going through a paper, if I had the chance, I would write down a few notes and summarise the work in a couple of sentences. These are not meant as reviews – I’m not commenting on whether I think the paper is good or not. But I do try to present the crux of the paper as bluntly as possible, without unnecessary sales tactics. Hopefully this can give you the general idea of 50 papers, in roughly 20 minutes of reading time.

The papers are not selected or ordered based on any criteria. It is not a list of the best papers I have read, more like a random sample. The only filter that I applied was to exclude papers older than 2016, as the goal is to give an overview of the more recent work.

I set out to summarise 50 papers. Once I was done, I thought this would be a sensible place to summarise my own work as well. So at the end of the list you will also find brief summaries of the papers I published in 2017.

Let’s get started.

1. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Danqi Chen, Jason Bolton, Christopher D. Manning. Stanford. ACL 2016.
https://arxiv.org/pdf/1606.02858.pdf

Hermann et al (2015) created a dataset for testing reading comprehension by extracting summarised bullet points from CNN and Daily Mail. All the entities in the text are anonymised and the task is to place correct entities into empty slots based on the news article.

cnn_daily_mail

This paper has hand-reviewed 100 samples from the dataset and concludes that around 25% of the questions are difficult or impossible to answer even for a human, mostly due to the anonymisation process. They present a simple classifier that achieves unexpectedly good results, and a neural network based on attention that beats all previous results by quite a margin.

2. Word Translation Without Parallel Data
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. Facebook, Le Mans, Sorbonne. ArXiv 2017.
https://arxiv.org/pdf/1710.04087.pdf

Inducing word translations using only monolingual corpora for two languages. Separate embeddings are trained for each language and a mapping is learned though an adversarial objective, along with an orthogonality constraint on the most frequent words. A strategy for an unsupervised stopping criterion is also proposed.

Word Translation Without Parallel Data

ML/NLP Publications in 2017

It has been a very productive year for NLP and ML research. Both areas continued to grow, with conferences reaching record numbers of publications. In this post I will break these numbers down a bit more, by individual authors and organisations. The statistics cover the following venues: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, *Sem+SemEval, NIPS, ICML, ICLR. Compared to last year, I’ve now included ICLR which has grown very rapidly in the last two years and become a highly competitive conference.

The analysis is done automatically, by crawling publication information from the conference websites and ACL Anthology. Author names are usually listed in the proceedings and easily extractable, however the organisation names are more tricky and need to be extracted straight from the PDFs. I’ve created a number of rules to map together alternative names and misspellings, but let me know if you notice any errors.

Venues

First, let’s look at different publication venues between 2012-2017. NIPS is clearly heading off the charts, with 677 publications this year. Most other venues are also growing rapidly, with 2017 being the biggest year ever for ICML, ICLR, EMNLP, EACL and CoNLL. In contrast, TACL and CL seem to be keeping a constant number of publications per year. NAACL and COLING were notably missing from 2017, but we can look forward to both of them in 2018.

Attending to characters in neural sequence labeling models

Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar vectors means these words also behave similarly in the model, which is what we want for good generalisation properties.

However, word embeddings have a couple of weaknesses:

  1. If a word doesn’t exist in the training data, we can’t have an embedding for it. Therefore, the best we can do is clump all unseen words together under a single OOV (out-of-vocabulary) token.
  2. If a word only occurs a couple of times, the word embedding likely has very poor quality. We simply don’t have enough information to learn how these words behave in different contexts.
  3. We can’t properly take advantage of character-level patterns. For example, there is no way to learn that all words ending with -ing are likely to be verbs. The best we can do is learn this for each word separately, but that doesn’t help when faced with new or rare words.

In this post I will look at different ways of extending word embeddings with character-level information, in the context of neural sequence labeling models.  You can find more information in the Coling 2016 paper “Attending to characters in neural sequence labeling models“.

Sequence labeling

We’ll investigate word representations in order to improve on the task of sequence labeling. In a sequence labeling setting, a system gets a series of tokens as input and it needs to assign a label to every token. The correct label typically depends on both the context and the token itself. Quite a large number of NLP tasks can be formulated as sequence labeling, for example:

POS-tagging
DT  NN    VBD      NNS    IN      DT   DT  NN     CC  DT  NN   .
The pound extended losses against both the dollar and the euro .

Error detection
+ +    +  x       +   +      +   +    +    x      +
I like to playing the guitar and sing very louder .

Named entity recognition
PER _      _   _      _  ORG  ORG   _  TIME _
Jim bought 300 shares of Acme Corp. in 2006 .

Chunking
B-NP    B-PP B-NP I-NP B-VP I-VP     I-VP I-VP   B-PP B-NP B-NP  O
Service on   the  line is   expected to   resume by   noon today .

In each of these cases, the model needs to understand how a word is being used in a specific context, and could also take advantage of character-level patterns and morphology.

NLP and ML Publications – Looking Back at 2016

After my last post on analysing publication patterns I received quite a lot of feedback and many feature requests, so I decided to create an update once 2016 is over. It is now quite a bit bigger than before, and includes 11 different conferences and journals: ACL, EACL, NAACL, EMNLP, COLING, CL, TACL, CoNLL, *Sem+SemEval, NIPS, and ICML.

The information used in these graphs was collected through crawling the web. ACL Anthology was very useful, listing papers in a consistent format. However, information such as the organisation names in each paper still needed to be extracted directly from the pdfs, which means there are likely to be some errors. I’ve tried to create exceptions to catch different spelling variations and other anomalies, but if you notice mistakes in the graphs, do let me know.

This analysis shouldn’t be taken too seriously – after all, quality of research matters much more than quantity, and that is considerably more difficult to measure. However, my motivation is to provide a high-level overview of what is happening in the field, where the big players are publishing, and perhaps supply a bit of inspiration and motivation for the new year.

Analysing NLP publication patterns

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Theano Tutorial

This is an introductory tutorial on using Theano, the Python library. I’m going to start from scratch and assume no previous knowledge of Theano. However, understanding how neural networks work will be useful when getting to the code examples towards the end.

The plan for the tutorial is as follows:

  1. Give a basic introduction to Theano and explain the important concepts.
  2. Go over the main operations that we have available in Theano.
  3. Look at working code examples.

I recently gave this tutorial as a talk in University of Cambridge and it turned out to be way more popular than expected. In order to give more people access to the material, I’m now writing it up as a blog post.

I do not claim to know everything about Theano, and I constantly learn new things myself. If you find any errors or have suggestions on how to improve this tutorial, do let me know.

The code examples can be found in the Github repository: https://github.com/marekrei/theano-tutorial

1. What is Theano?

CYh2GMnWkAELDTL

Theano is a Python library for efficiently handling mathematical expressions involving multi-dimensional arrays (also known as tensors). It is a common choice for implementing neural network models. Theano has been developed in University of Montreal, in a group led by Yoshua Bengio, since 2008.

Some of the features include:

  • automatic differentiation – you only have to implement the forward (prediction) part of the model, and Theano will automatically figure out how to calculate the gradients at various points, allowing you to perform gradient descent for model training.
  • transparent use of a GPU – you can write the same code and run it either on CPU or GPU. More specifically, Theano will figure out which parts of the computation should be moved to the GPU.
  • speed and stability optimisations – Theano will internally reorganise and optimise your computations, in order to make them run faster and be more numerically stable. It will also try to compile some operations into C code, in order to speed up the computation.