Menu Close

Author: Marek

ML and NLP Publications in 2021

I am sharing here the yearly paper analysis for 2021, containing statistics about ML and NLP publications from the past year. It has arrived later than in previous years - preparing it this time took quite a bit longer than intended. The new data required some manual cleaning and updating of the pipeline, which meant the analysis got delayed quite a bit. But finally, here it is now. 🙂

The analysis of the papers is done using a series of automated tools. These processes are not perfect so some noise and errors may occur. Some authors have also recently started releasing their papers in an obfuscated form for some reason, preventing any copying or automated extraction of the content, so these have had to be excluded from some of the analysis. But overall, each year the pipeline gets a bit better and the bugs from previous years get fixed, so it should provide a good picture of the field.

Many thanks to Chen Cecilia Liu and Jonas Pfeiffer for their help with matching organizations to countries!

This post isn't meant to glorify publishing huge amounts of papers. Quality is definitely more important than quantity. In fact, I think our field has gotten too focused on publishing rapidly in large quantities, which tends to give an advantage to quick iterative papers over thorough groundbreaking ideas. The aim of this analysis is just to provide a bit of a higher-level view of what is happening in the field, which institutions are currently the major players and which researchers have the largest groups.

Venues

Let's start by looking at the conferences themselves. The publication numbers for most of the conferences again kept going up and breaking new records. One exception seems to be ACL - this is likely due to a heavier use of the Findings format, which I didn't include in these statistics. AAAI seems to be almost levelling out, whereas NeurIPS still keeps a steady growth rate. COLING and AACL did not take place in 2021, but both EACL and NAACL did.

Organizations

The organization with most published papers in 2021 (by a wide margin) was Google. Microsoft also manages to offer some competition in that space, with CMU, Stanford, Facebook and MIT ranking after that. Microsoft, CAS, Amazon, Tencent, Cambridge, Washington and Alibaba stand out as having quite a large proportion of papers at NLP conferences, whereas the other top organizations seem to focus mostly on ML venues.

Looking at the statistics for the whole period of 2012-2021, Google with 2170 papers has finally overtaken Microsoft with 2013 papers. CMU with 1881 publications is also represented in the top 3 cluster.

Most of the organizations have also continued to increase their yearly publication count. Google has finally broken their linear acceleration in publication numbers, but still released more papers than ever before. CMU had a plateau last year but has made up for it this year. IBM seems to be the only top organization with a slight downward trajectory, likely related to them selling large parts of Watson recently.

Authors

Next, let's look at the researchers who published the most papers in 2021. Sergey Levine (Berkeley, Google) towers above everyone else with 42 papers. Tie-Yan Liu (Microsoft), Jie Zhou (Tsinghua), Mohit Bansal (UNC) and Graham Neubig (CMU) are also among the top publishing researchers.

Looking at the whole period of 2012-2021, Sergey Levine (Berkeley, Google) is again at the top position. Having ranked 6th last year, he had a really prolific year and has overtaken everyone else. Yoshua Bengio (Montreal), Graham Neubig (CMU), Yue Zhang (Westlake), Ming Zhou (Sinovation, Microsoft) and Ting Liu (Harbin) make up the rest of the overall top publishers.

Note that a couple of names have been manually removed from the list. While trying to identify their affiliation, I found that publications from multiple researchers with identical names have been aggregated. Unfortunately I don't have the necessary technology at the moment to separate the publications in such cases.

The breakdown over the years gives an overview of when each researcher has published most. Sergey Levine has set the new overall record, by quite a margin. Mohit Bansal also increased his paper output by quite a lot, releasing 31 publications in 2021, to the same number as Graham Neubig. Yoshua Bengio had a decrease in the number of papers in 2020, but that count is now back up again.

First Authors

The researchers with the largest numbers of papers are generally supervisors for many post-docs and students working in their group. In contrast, first authors are usually those who do the practical work, so it's good to analyse their counts separately.

Ramit Sawhney (Tower Research Capital, IIIT Delhi) published a very impressive 9 papers in 2021. Jason Wei (Google Research) and Tiago Pimentel (Cambridge) also stand out with 6 publications.

Looking at the year range 2012-2021, Ivan Vulić (Cambridge, PolyAI) and Zeyuan Allen-Zhu (Microsoft) have both managed to publish an imprissive 24 papers as first authors. Yi Tay (Google) and Jiwei Li (Shannon.AI, Zhejiang, Stanford) are ranked next, with 23 and 22 papers, respectively. Ilias Diakonikolas (UW Madison) has an impressive 15 first-author NeurIPS papers. Haris Aziz (UNSW Sydney) has published exclusively in AAAI.

Countries

Looking at the 2021 publication counts by country highlights how much the United States publishes. China and the UK are also among the top 3. NeurIPS has the largest proportion for the US and the UK, while AAAI is the preferred venue for China.

Nearly all the top countries have continued to increase their publication counts, setting new individual records in 2021. For the US, this increase is the largest, further widening the lead.

USA

As the US publishes so much, their graph for 2021 looks very similar to the overall graph. Google, Microsoft and CMU are again at the top of the publishing counts.

China

In China, Tsinghua, CAS and Peking published the most in 2021.

UK

DeepMind, Oxford and Cambridge are towering above the rest in the UK.

Canada

Toronto really stands out in Canada, with the Vector Institute, McGill and Montreal also being well-represented.

Germany

In Germany, Tübingen and Robert Bosch published more than 40 papers.

South Korea

In South Korea, KAIST has an impressive lead over the other organizations. Seoul and Samsung also stand out.

Topic Similarity

For this section, I ran the papers through LDA and then visualised them using t-SNE.

Visualization of the organizations shows that they are mostly clustered according to geographical proximity, with companies centered in the middle.

We can do the same visualization for the authors, although these clusters are a bit more difficult to interpret.

The countries are again clustered based on geography, this time with USA in the middle of the graph.

Keywords

We can also plot the proportion of papers that contain a particular keyword and track how this has changed over time.

The word "neural" seems to have a very slight downward trend, although it can still be found in more than 80% of the papers. The proportion of words "recurrent" and "convolutional" is also decreasing, whereas "transformer" can now be found in more than 30% of the papers.

Looking at just the keyword "adversarial", we see that it is particularly popular in ICLR, with almost half of the papers using it. The count for "adversarial" seems to have peaked for ICML and NeurIPS, while it has been steadily increasing for AAAI.

The keyword "transformer" has gotten very popular in the past couple of years. It is particularly widely used in NLP papers, with over 50% of the publications containing it, but the popularity is steadily increasing also in all the ML conferences.

Fun Facts

And a couple more random facts about 2021 to finish:

Advice for students doing research projects in ML/NLP

This is a collection of advice that I give to students doing research projects in NLP/ML/AI. It includes suggestions that I wish I had known when I myself first started, as well as lessons from supervising students in previous years.

I would recommend reading this once before starting your project, then again after about a month or two into the project. Different things will seem relevant to you.

Implementation and debugging

  1. Don’t assume that your code is bug free – check and debug. Deep learning code can be difficult to get right. Make sure the tensors have the right shapes and dimensions. Occasionally you’ll need to think in 5 dimensions and it’s easy to get it wrong. Make sure you know what the library functions are actually doing instead of blindly applying them. Some pytorch and tensorflow functions make it easy to apply complicated operations but they also make it easy to confuse what their input and output actually contain. It is always a good idea to test sections of the code with small toy examples.
  2. Whenever you assume something about your data or output, put an assert into the code which will check that this assumption is always true. For example, if you know how long the input list should be or that the output should contain only values in a certain range. Many times this has helped me catch issues that I didn’t expect.
  3. While automated checks are useful, they won’t catch everything. So you also have to manually check your model output and performance. Make sure that the output looks as expected and that the results make sense. If you expect one configuration to perform better than another, but the results show something different, then investigate and figure out why that is. Sometimes you find a bug, other times you learn something new about the model or the task.
  4. When your model isn’t learning anything, there are two diagnostic tests I recommend running: 1) Make sure that your model can memorize a small training set. For example, a model should be able to easily memorize (get 100% performance) a training set with 10 examples. If not, this indicates a problem with the code or the data. 2) If your model takes features, give the label as an input feature and train on the whole dataset. Evaluate on the dev set, also giving labels as input features. The model should get perfect performance both on the train and dev set. If not, this indicates issues with network connections or optimization.
  5. During training, evaluate performance on the training and development sets at certain intervals. For most models, the performance on both will initially go up. Then at some point performance on the development set will start to drop while the performance on the training set will keep improving, which indicates overfitting. To deal with this, use early stopping – store a copy of the model whenever it outperforms all previous iterations on the development set. Once the training is done, load back the stored best model. To avoid unnecessary training time, you can stop the training when the best performance on the development set hasn’t improved for a certain number of epochs. This is a robust default strategy for training neural models. If you use some other strategy, then best to cite the paper where you got it from. Note: Early stopping is becoming less important with large pre-trained language models and it’s common to just train for a fixed number of 2-3 epochs. We have some experiments investigating this here.
  6. Use version control and external backups. You don’t want to suddenly lose all your hard work. It’s also frustrating when you get your system to work, make some edits that break it and then have no idea which exact edits need to be reversed. Git is good, Github and Bitbucket both allow free private repositories.
  7. Hyperparameter tuning is still a dark art, as specific values often interact with other parameters and it’s not computationally feasible to actually try all of their combinations. With time and experience, you will get better at predicting which parameters work best for specific architectures. Until then, start with parameter values that others have used in related work and explore how changing different values, in isolation or in small parameter groups, affects the model.

Evaluation

  1. Put extra care into making sure that the evaluation is implemented correctly. If you have bugs in your model, then the model usually won’t work well and you would discard it. But if you have bugs in your evaluation, you might end up reporting high results for a model that doesn’t actually work. Finding out later that your published results are all imaginary is not a situation you want to be in.
  2. Make sure to set up separate train/dev/test sets. Use the training set for parameter optimization, the development set for choosing hyperparameters and early stopping, and the test set for final evaluation. You can use the training set in more flexible ways, but you’re not allowed to edit or subsample the test set. For example, you can leave out sentences that are too long in the training set, but you can’t do that for the test set. It needs to remain fixed in order to be comparable to other published results. Cross-validation is also an option if the dataset is too small for a fixed split. If previous researchers have split the data already, then you need to use the same splits as them; don’t create a new split without a strong reason.
  3. Learn to handle randomness in your models. Random seeds affect various parts of the processing stream – shuffling of the data, initialisation of the weights, etc, so it’s a good idea to try to set them. However, sometimes neural components will give random results anyway. This is down to how the toolkits are implemented and we don’t have much control over them. GPU operations are highly parallel and when results finish at different times, some functions can combine in different order, which results in small rounding errors. These can compound and cause noticeably different results when running the same experiment multiple times, even when controlling the random seeds. To compensate for this, I usually run each experiment several (e.g. 10) times and report the average result. For full scientific rigor, you should also calculate and report the standard deviation and/or confidence interval of these results. Unfortunately, many people just tend to report a single result from a neural model, which in some cases can be a misleadingly high value that is later difficult to replicate.
  4. Compare performance to published results to understand if your model is working as well as it should. Find a similar model in literature, set up your model configuration based on theirs, train on the same data and see if you are able to replicate their results. If you are working on an established task, start by getting your baseline performance to the previous state-of-the-art level. That way any improvement that you achieve will set a new state-of-the-art. It is tempting to show improvements over a weaker baseline, but pretty much anything will give an improvement if the baseline is weak enough. Trying to explain why your best system is still outperformed by a stronger baseline would be a massive headache later when trying to publish or defend your work.
  5. Perform significance tests on your final results when possible. It adds credibility to your results. I recommend the randomization test, but any test is better than no test. If you show a significant improvement through an actual significance test, make that clear in the paper and state the test that was used. However, if you haven’t performed any significance test, then do not use phrases like “results are significantly better” when describing your work as that will appear misleading.
  6. If your novel system has multiple novel components, then for a convincing evaluation you need to perform an ablation experiment. This means either a) removing individual components from the full system to see how much performance drops, or b) incrementally adding individual components to the simple system to see how much performance improves at each step. The goal is to measure the benefit of each component separately, to make sure that they all are actually useful. Otherwise, if you only evaluate the combined system, then it could easily be that only one modification is providing all the improvements and the others are useless or even lowering performance.

Project planning

  1. In the beginning of your project, start by reading the papers related to your topic. If you only know a small number of related papers, try looking at the papers that are referenced in these. Google Scholar also has a “Cited by” button at each paper that will show you other later papers that reference this particular paper. Simple keyword searches on Google and ArXiv are useful too. Get familiar with the current state-of-the-art, try to get ideas that you can integrate into your project, while also looking for existing models that you can compare against. If you don’t yet feel comfortable with the toolkits that you will be working with, focus on getting some practice with these as well.
  2. If there is any previous work on your task, ideally the state-of-the-art, then start by reimplementing that. If they already have the code available then great, just get it running. Otherwise, write the code, train the model on the same data and evaluate the same way as they do. If you get the same results, then great, you know that your system and evaluation are working. If not, debug and improve until they are working, or find out exactly what is the cause of the difference. You don’t want to have a fundamental bug somewhere deep in your code all throughout the project.
  3. Implement a simple version of your idea early and quickly. Don’t spend a long time designing and implementing a complex architecture in the beginning. Instead, start by implementing and training the simplest system you can imagine, which shouldn’t take longer than a week or two. Make sure you are getting reasonable results and that the evaluation is set up properly. Only then you should build on it and improve the results with more complex models. It is much easier to debug a basic system and this will give you a chance to get to know the problem and the dataset. Plus, you can always use the original model as a baseline in your work.
  4. Make a plan, even if you don’t follow it. Making a plan is a useful exercise, as it makes you think through all the different parts of the project that need to be completed, how they fit together, how much time they might take and what might go wrong. After you have created the plan, you will have a reasonable idea in which direction you are going, hopefully sanity-checked by your supervisor. However, once you actually start work on the project, you will learn new things about the problem and get better ideas, so this plan is likely to change and improve. And that is fine. The main thing is to always have a plan for where you are heading based on current information; when you get new information, feel free to revise that plan.
  5. Whenever possible, aim to push the state-of-the-art performance on your task. There are exceptions and chasing the highest number should not be a goal by itself, but it helps with making sure that your contribution is actually advancing the field. Otherwise, showing improvements only over a trivial baseline will imply that this problem has already been solved by other methods but you are trying to ignore or hide them. On the other hand, if you start with an existing state-of-the-art system and improve on that, then it is much easier to claim that you have made a novel and useful finding. Setting a new state-of-the-art also usually means it will be easier to publish this work as a paper.
  6. When brainstorming ideas, think in terms of what additional information or capability is not currently available to the model but would help it make better decisions. Trying to blindly integrate the flavor-of-the-month ideas into every model is not a good strategy.
  7. When using computation resources that have quotas, make sure to use your available time/credit sensibly. Estimate how long your experiment will take (or set a limit), multiply by the number of jobs you are starting and make sure the speed at which you are spending credits lasts you until the end of the project. If you’re just starting, better start cautiously, your code is likely to still have bugs and you might need to rerun all of these experiments anyway. If you’re finishing the project and know exactly which experiments are left to write a good report, you can take full advantage of the parallel computation. Just be sensible – every year there is someone who blows their whole GPU cluster credit in the first week of the project. You don’t want to be stuck training your neural models on CPUs.
  8. Look at some existing dissertations to get a feel for what they should look like and what kind of content is expected. Do that in the beginning of the project, as opposed to the end.
  9. When choosing topics to work on, I recommend choosing something that is close to your heart and something that makes a difference and drives the field forward. In the past, I have found myself solving a question only to realise that it doesn’t really need to be answered.
  10. Reading related work is very important, but try to learn from other fields as well. Read about research in different areas of machine learning, NLP, cognitive science, neuroscience, linguistics, vision and signal analysis. By broadening your world view, you can bring new ideas to your research topic and find new exciting applications in other domains.
  11. If your group has reading groups, take active part in them. They motivate you to read papers outside of your narrow topic, which expands your view and can give new ideas that can be adapted for your project. Reading groups also encourage interaction between the group members and this can lead to very interesting discussions. Volunteering to present allows you to choose the topic that you would like the group to discuss, lets others know what you are working on and it’s a good way to get some experience with public speaking. 

Writing and Publishing

  1. Start writing early. Don’t underestimate the difficulty of writing up a project report. Communicating your research, motivation and architectures clearly to the reader and presenting a coherent story takes effort. Plus, you might find during writing that in order to present a full picture you need to go back and run some additional experiments. Finish your write-up early enough so that someone, probably your supervisor, has time to read it though and give you feedback and you have time to implement changes based on this feedback.
  2. Proof-read your writing. This applies to any writing – reports, dissertations and papers. Grammatical errors, spelling errors and repeated words leave an overall sloppy impression and can bias a reviewer in a negative direction even if the work itself is good.
  3. I usually recommend aiming to publish a paper based on your project. Publishing your findings is important for the field and having a peer-reviewed publication makes your CV positively stand out in future applications to any research-related positions. It is the same work you will be doing for your report/dissertation anyway, might as well also get a paper out of it. You just have to make sure that the work is good enough to be publishable.
  4. Many venues provide an opportunity for a rebuttal, to reply to the reviewers. These don’t result in an increased score very often, but it is still useful to write one – as an exercise for yourself and to better inform the reviewers, so that they get an informed picture of your paper. Focus on the areas that can be objectively argued, for example, if they have misunderstood something. Sometimes the reviewer just subjectively doesn’t like the paper and there’s not much that can be done about that.
  5. Papers get rejected, it’s very common and happens to everyone. Use this as a learning experience, improve the paper based on the feedback and then resubmit. Improvements can include additional experiments that the reviewers requested, clarifying areas that confused them or changing the way that the work is presented in order to focus more on a slightly different aspect.
  6. Most of the time you will not agree with the reviewers’ comments. You will probably think “Are they stupid? I clearly covered that issue in Section X”. You’ll have to accept the fact that reviewer quality varies quite a lot and most of them have very little time to devote to your paper. That is why it’s up to you to make the paper so clear and well-written that even the worst reviewers are able to follow it. Standing your ground and fighting with the reviewers will not get your paper accepted. Instead, think about how you can convince the reviewers or best clarify their misunderstandings. The reviewers represent a sample of your potential future readers as well, so any improvements you make also translate to your audience understanding the paper better.
  7. In terms of the author list, the main author usually goes in the first position, the main supervisor in the last position and everyone else who substantially contributed to the paper would be added in between.
  8. Read the submission requirements provided by the conference/journal and stick to them. These are quite strict. Being over the page limit by one line is enough to get the paper rejected even without a review. Messing with the page margins and font sizes will also disqualify a paper. Make sure to use the required latex template for that particular conference – they will vary between conferences.
  9. ArXiv is a useful place to submit a paper for extra publicity. It acts as a version-controlled repository of non-peer-reviewed research papers. Some things to keep in mind: 1) Anyone can upload any paper to ArXiv so having it there is not the same as publishing it at a conference or journal. 2) Once you upload a paper, it is there forever. You can still update the paper but the old versions will also be publicly visible. 3) Many conferences now have a mandatory anonymity period, which extends before the submission deadline, during which you can’t upload the paper anywhere, including arxiv. That means you either need to complete and upload the paper before this period starts or you have to wait until the conference decisions have been sent out. 4) ArXiv keeps the source latex of your paper and it will be available for anyone to download. You might want to clean it up a bit before submitting, removing notes and comments that are not meant for the public.
  10. Don’t be afraid to reach out to researchers beyond your immediate research group. This can be in order to ask for clarification about a paper they have published or initiating a collaboration. They can bring in new expertise and ideas, which would benefit everybody.
  11. When collaboratively writing a paper, Overleaf is a good platform to use. Many universities have a paid subscription to Overleaf – try adding your university email to your account to see if you get upgraded for free. It should work for Imperial and Cambridge.
  12. Do not use Apple’s Preview or Quartz for creating your pdfs. It has been known to create pdfs that cannot be opened by Adobe Acrobat. This in turn can lead to a paper being rejected without review. Use pdflatex or download the pdf from Overleaf.
  13. If you need to submit an Early Stage Report (this is specific to Imperial) or a background report, create this as a very early version of your thesis/dissertation. Structure the report similar to how you would structure the thesis. Have a clear background section that covers previous work. And use the style files that you would for a thesis. You can then extend this same report for your final work later.

Interacting with your supervisor

  1. Remember that you are in charge of your project. The role of the supervisor is to give advice and to help direct you onto a promising path. But this is your research project, so the supervisor is not supposed to plan out every experiment that needs to be run. It is up to you to plan your research (with the help of your supervisor), decide which experiments need to be run and also brainstorm new ideas. Take active charge of your project and set your own targets, instead of expecting to tick off a weekly checklist created by the supervisor.
  2. At the same time, listen to your supervisor. If they tell you to do something specific, you should probably do it. This can be your first similar project whereas they have done many of them before. They most likely know more about the field and what is necessary in a report or paper. Ideally, they will give you freedom to explore, but if they tell you to get something specific done then there is probably a good reason for it.
  3. Meet your supervisor, ideally once a week. Use this to give an update on your work and discuss any questions you might have. The regular weekly meeting helps to keep you motivated and progressing each week.
  4. Your supervisor is probably working with many more projects in addition to yours. When you meet up after a week, it’s good to give a bit of context and summary regarding the state of the project and which part you are currently focusing on, before jumping into the fine-grained details. This helps get the supervisor onto the right wavelength.
  5. The supervisor is not always right. Listen to them and their arguments, as they probably have more experience than you. But if you are still not convinced then find a way to learn more or try your idea out empirically. In the worst case, you will learn something new. The supervisor wants what’s best for you, but they’re not all-knowing oracles. I’ve heard plenty of stories about supervisors in the early 2010’s telling their students not to work on neural networks because ‘that field isn’t going anywhere’.
  6. Feel free to contact your supervisor with questions. Don’t wait until the next meeting if you have a question that stops you from progressing. However, please don’t send your supervisor questions that you can answer by googling.
  7. During the project you should become more of an expert on your specific research topic than your supervisor. The supervisor provides a starting point, a broad overview and the experience. But it is your job to go in depth and find out everything you can about the specific questions that you are trying to answer in your project – get to know the previous related work, the latest developments, the available datasets and benchmarks, etc. For at least a short moment, you should be the world-leading expert on your very narrow research question.

If you made it this far, well done. If you are just starting with your project, I imagine some of these points might not mean much to you yet. I recommend going through this list again after you have worked on the project for a month or two.

ML and NLP Publications in 2020

I ran my paper analysis pipeline once again in order to get statistics for 2020. It certainly was an unusual year. While ML and NLP conferences again had more publications than ever before, most of them needed to quickly adapt to a new remote format. Each conference took a slightly different approach as everyone was trying to figure out how to make this work. I heard especially positive comments about EMNLP 2020, regarding their smooth organisation and engaging technical solutions. Overall, I think the remote format has its pluses and minuses - while it certainly complicates networking and socialising, it also makes these conferences much more accessible to a wider range of audience. Hopefully the online participation options will be made available even after we are able to have in-person meetings again.

This post includes the analysis of publications from the following conferences and journals: ACL, EMNLP, NAACL, EACL, AACL, COLING, TACL, CL, CoNLL, NeurIPS, ICML, ICLR, AAAI. All the information is crawled and processed automatically from the corresponding proceedings and directly from the pdf files. Some noise likely still remains, so the graphs are more indicative of general trends as opposed to specific paper counts. Big thanks to Jonas Pfeiffer (@PfeiffJo) for adding the country annotations to all the new organisations, so that we can also run the country-level analysis below.

These statistics are not meant to imply that the quantity of publications is an important measure of a researcher. Having one groundbreaking paper is always more meaningful than churning out many forgettable pieces. But it can be a good way of getting an overview of active research groups and hopefully it can inspire new researchers to publish their own work.

Venues

Let's start by looking at the overall number of papers published at different venues. Following the trends from the previous few years, most venues continued to break records in this regard. ML and AI conferences such as NeurIPS, AAAI and ICML had particularly large increases in paper numbers. ACL and EMNLP had more modest increments, but still more than ever before. CoNLL took a slight decrease in numbers, possibly due to narrowing their focus back towards linguistics as opposed to engineering. AACL was introduced as a brand new conference for the Asia-Pacific Chapter of ACL.

Organizations

Looking at which organizations published most papers in 2020, it is clear to see that Google manages to dominate this space. Microsoft holds a respectable second position and CMU is the the top publishing university. MIT, Berkeley, DeepMind and Oxford are mostly publishing only at ML conferences. In contrast, Microsoft, Tencent, Uni. of CAS, Alibaba and Amazon have significant proportions of their publications at NLP conferences.

Looking at the statistics for the whole 2012-2020 period, the top three positions are nailbitingly close: Microsoft still leads with 1580, then Google with 1570 and CMU with 1537.

Most of the top publishing organizations also have upward trajectories through the years. Google has a bizarrely straight line going up at 45 degrees for the past few years; I can almost imagine some strategist drawing this with a ruler in their company plan.

Authors

Next, let's look at the researchers who published the most papers in 2020. Graham Neubig from CMU ranks first in this aspect, with 31 publications. Others with impressive numbers of papers include Yue Zhang (Westlake University), Sergey Levine (UC Berkeley), Ting Liu (Harbin Institute of Technology), Zhiyuan Liu (Tsinghua University) and Ming Zhou (Microsoft Research Asia).

Comparing the counts across the whole period of 2012-2020, Ming Zhou from MSR Asia has taken the overall lead. He was ranked third in 2019 and published an impressive 28 papers in 2020, while Yoshua Bengio and Chris Dyer (who were the top two until last year) have considerably scaled down their paper numbers this year.

The breakdown over the years gives an overview of when each researcher has published most. Yue Zhang and Ting Liu have both improved their overall ranking, having had a particularly successful last year in terms of publishing. While Sergey Levine still holds the overall record of most papers per year, Graham Neubig managed to publish the most in 2020.

First Authors

The most published researchers are generally supervisors for many students/post-docs performing the practical experiments. In contrast, first authors are usually those who do the legwork, implementing the actual code and writing much of the paper.

Normally, I would make a chart showing the most publishing first authors from the last year. However, this time all the top ranks were cases where two or more people were publishing under the same name. I don't currently have any technology that can automatically identify and disambiguate these authors (future research project perhaps?) so I'm skipping this graph and jumping straight to the overall statistics between 2012-2020. Zeyuan Allen-Zhu (MSR AI), Jiwei Li (Shannon.AI) and Ivan Vulić (Cambridge/PolyAI) have the most impressive publication records as first authors, with Zeyuan publishing in the ML area, Jiwei and Ivan publishing mostly in NLP.

Countries

Separating the statistics by country highlights just how much the United States publishes. China is giving a strong effort as well, with the UK being third.

While China is definitely publishing more and more every year, it seems the US still manages to increase the lead. Overall, there is a general trend for more papers and most countries continue to increase their scientific output.

USA

As the United States publishes so much, the USA-only breakdown looks very similar to the combined graph, with Google, Microsoft and CMU in the lead.

China

In China, Tsinghua, Peking, the Chinese Academy of Sciences and Tencent are the top publishing organizations.

UK

In the UK, DeepMind is the top publisher and the large majority of their papers are in NeurIPS, ICLR or ICML. Oxford and Cambridge are also not far behind, with Cambridge having a much larger NLP proportion than the other two.

Germany

Rober Bosch GmbH is the top publishing organisation in Germany this year, with a very impressive NeurIPS result. Darmstadt comes in second as the top university and with a much larger NLP output. Tübingen ranks third with almost all papers in ML conferences.

Canada

In Canada, University of Toronto is the top ranking organisation in terms of publishing. Vector Institute ranks second and McGill third.

Topic Similarity

Just for fun, I also ran the papers through LDA and then visualised them using t-SNE.

Looking at the top publishing organisations, there is a clear geographic separation happening on the left and right side of the graph. In the middle we have global companies, such as Google, Amazon and Microsoft.

Another LDA graph for the top publishing authors of 2020. If you spot any interesting clusters here, let me know in the comments as well.

Finally, plotting the countries. Geographic proximity seems to carry over directly to paper similarity. One possible explanation is that neighbouring countries are more likely to publish together, which would make their paper collections also more similar.

Keywords

As a new addition, I thought it might be informative to also plot the usage of different keywords in papers. Here the y-axis represents the proportion of papers that contain that specific keyword, as opposed to an absolute count.

It is interesting to see how the word "neural" has increased in popularity over the years, now plateauing at over 80% of all the papers. "Recurrent" and "convolutional" have taken downward directions, while "transformer" has a steep rising trajectory. It is nice to see a gradual rise of "github", which implies that more papers are making their source code available.

Just looking at the keyword "neural", we see that almost all papers in ICLR and NeurIPS are mentioning it. The proportion is consistently high also in other conferences, although not quite to the same level. Interestingly, EMNLP, CoNLL and CL have actually reduced the proportion of neural papers in recent years.

At least in CoNLL, some of those "neural" papers have been replaced by "bayesian" papers instead. For EMNLP, the trend is more stable. For most conferences, the "bayesian" proportion seems to have a slight downward trend overall, with CoNLL and ICLR being the exceptions.

Distribution of "github" shows indeed that the practice of releasing source code is becoming more common in all the conferences. The NLP conferences are still doing much better in this respect, compared to all the ML conferences, with TACL and EMNLP having close to 80% of the papers mentioning github. In contrast, ICLR is around 60% and AAAI is the lowest with 40%.

Fun Facts

And a couple more random facts to finish:

That's it for this year. I hope you found something interesting. The full processed dataset of published papers is available in this repository: https://github.com/marekrei/ml_nlp_paper_data

Have a great year of publishing excellent work! May reviewer #2 be merciful to you. And if you invent AGI in the next 12 months, make it benevolent.

ML and NLP Publications in 2019

It is about time we once again take a look at the publication statistics of the past year. 2019 was another record breaking year in machine learning and NLP research. Nearly all conferences had more attendees and more publications than ever before. For example, NeurIPS had 6,743 submissions and 1,428 accepted papers, which eclipses all the previous iterations. Because the conference sold out so fast last year, the organizers had to implement a randomised lottery for the tickets this time.

In this post you will find a number of graphs to illustrate the publication patterns from 2019.
I have included the following conferences and journals: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, NeurIPS, ICML, ICLR, AAAI. The paper information is crawled and processed automatically from the corresponding proceedings. While names and titles are quite easily accessible, getting the author affiliations is the noisiest part of the process, as this needs to be extracted directly from the pdf files. I've kept improving the pipeline every year so it should be more accurate than any of the previous iterations. If you do spot some mistakes, let me know.

The analysis this year includes some brand new statistics and graphs. This is thanks to Jonas Pfeiffer (@PfeiffJo) and Andrew Caines (@cainesap) who annotated the extracted organization names with origin countries. Andrew started the process with a subset of 2018 conferences for our joint article on The Geographic Diversity of NLP Conferences. Jonas then scaled this up to cover all of the organizations in all the conferences in this analysis. This allows us to create some new country-specific visualisations, which you'll see below.

While this post highlights authors and organizations who have published the most in the past year, I want to emphasize that publication quantity should not be used as the main metric for good research. It's always better to have one really groundbreaking and influential piece of work over 10 forgettable and incremental papers. These statistics are just meant to give an informative overview of the most active publication centres in this field and hopefully inspire some new researchers to publish their own ideas.

Venues

Essentially all conferences had record-breaking numbers of publications in 2019. The journals have a fairly stable publication rate, COLING and EACL didn't happen this year, and every other conference just grew substantially. NeurIPS is by far the biggest conference and it now has a lead of nearly 300 papers over AAAI.

Organizations

Let's see which organizations published the most in 2019. Google has taken a comfortable lead, publishing a substantial number of papers in every venue. They had more than twice as many papers at ICML as the next closest organization (MIT). Worth noting that in previous years I included DeepMind papers also under Google, whereas this time DeepMind is separated as its own entity. Microsoft and CMU are also publishing an impressive amount of research.

Next up, we can see the overall stats between 2012-2019. While Google is dominating the previous year, CMU and Microsoft are still ahead in the marathon. Notably, the counts for CMU and Microsoft came out identical(!), with 1,215 papers from both organizations. After them, the main heavy hitters are Google, Stanford, MIT, IBM, Berkeley and Tsinghua.

If we look at the separation in time, we see that Google is sort of a late bloomer. Compared to Microsoft and CMU they were publishing much less between 2012-2016, but have overtaken everyone else by quite a bit after that. All of the top players seem to have an upward trend and they all published more in 2019 than ever before.

Authors

Looking at individual authors, Sergey Levine (Berkeley) published an impressive 33 papers in 2019 - 12 in NeurIPS, 6 in ICML and 15 in ICLR. The other top authors are Graham Neubig (CMU), Yoshua Bengio (Montreal), Zhiyuan Liu (Tsinghua), Tao Qin (MSR) and Tie-Yan Liu (MSR).

Looking at the whole period of 2012-2019, we see that Yoshua Bengio (Montreal) has overtaken Chris Dyer (DeepMind) as the most prolific author. Ming Zhou (MSR), Yue Zhang (Westlake), Noah A. Smith (Washington) and Ting Liu (Harbin) all have more than 90 papers from that period. I have had to remove Yang Liu from the list, as there seem to be two or more people publishing under this name and I was not able to automatically separate them.

And looking at the separation in years shows that Sergey Levine, Graham Neubig and Yoshua Bengio have all overtaken the previous publication record set by Chris Dyer in 2016. They each also published considerably more than they did in the previous years.

First Authors

The authors with the highest publication counts are often the leaders of large groups, coordinating the work. But now let's see the first authors, who are usually the people doing the actual implementation and writing. Gabriele Farina is a 4th year PhD at CMU and he has authored 6 papers in 2019, half of them at NeurIPS. Ilias Diakonikolas (UW Madison), Hanrui Zhang (Duke), Rui Zhang (NUS), Chuhan Wu (Tsinghua), Pengcheng Yang (Peking), Sanjeev Arora (Princeton), Zeyuan Allen-Zhu (MSR) and Mikhail Yurochkin (IBM) were all first authors to 5 publications.

Looking at first-author papers through time, Zeyuan Allen-Zhu (MSR), Jiwei Li (Shannon AI), Ivan Vulić (Cambridge), Ryan Cotterell (Cambridge), Young-Bum Kim (Amazon) and Sanjeev Arora (Princeton) have published the most overall.

Countries

For the first time we are now actually able to analyse which countries published the most in 2019. Admittedly, this graph mainly highlights just how much the US dominates the research in this area. China, UK, Germany and Canada are also putting in a strong effort. China has a proportionally very large presence at AAAI, whereas the US is publishing more in NeurIPS and ICML.

The picture looks very similar when looking at the whole 2012-2019 period.

Through the years, the US has always published much more than everyone else and now the pace has accelerated even more. China is also trying to match this and has increased the lead over all the others by quite a bit. The UK is in a respectable third position.

USA

We can also take a look at the 2019 publishing statistics of individual organizations within each country/continent. Given how much the US publishes, the graph here looks sort of similar to the overall statistics, with Google in the lead.

China

In China, Tsinghua and Peking universities distinctly stand out in terms of publication numbers. The other top ranks are also mostly held by universities, with Baidu and Alibaba being the main industry publishers.

UK

In the UK, DeepMind takes the lead. They are followed by Cambridge, Oxford, Edinburgh, UCL, Imperial and the Alan Turing Institute. Worth noting that the Turing institute is largely virtual, so academic there often have affiliations with other universities as well. Out of the top 7, Cambridge and Edinburgh publish quite a bit in NLP, while the others are focusing mainly on ML.

Germany

In Germany, Darmstadt is the top publisher, with 2/3 of the papers in published in the area of NLP. Bosch is putting up some competition, ranking second with mostly ML papers. Saarland, LMU Munich, Tübingen, TU Munich and the Max Planck Institute for Intelligent Systems are also represented at the top.

Canada

Among the Canadian organizations, University of Toronto stands out with an impressive publication count. Université de Montréal and the Vector Institute are also at the top, along with Alberta, McGill, Waterloo, MILA and University of British Columbia. University of Waterloo seems to be the only one with a larger focus on NLP, with the others publishing mostly in the general ML conferences.

Topic Similarity

To try and capture a bit about the topics in the papers as well, I ran them through LDA and then visualised the results with t-SNE. Looking at the organizations, it is interesting to see how geographical clusters emerge. Chinese universities are at the top, US mostly on the right, Europe on the left, and industry right in the middle.

We can do the same for authors. The closeness in the graph reflects a combination of topic similarity and the frequency of collaboration.

Finally, we can do the same for countries. Given that all countries are working on a range of different topics, this graph is likely more representative of their collaboration frequency.

Additional Statistics

Finally, let's finish off with a couple of fun stats.

That's it for this time. Looking forward to all the interesting work that is going to be published in 2020!

The data used in this post can be found here.

74 Summaries of Machine Learning and NLP Research

My previous post on summarising 57 research papers turned out to be quite useful for people working in this field, so it is about time for a sequel.

Below you will find short summaries of a number of different research papers published in the areas of Machine Learning and Natural Language Processing in the past couple of years (2017-2019). They cover a wide range of different topics, authors and venues. These are not meant to be reviews showing my subjective opinion, but instead I aim to provide a blunt and concise overview of the core contribution of each publication.

Given how many papers are published in our area every year, it is getting more and more difficult to keep track of all of them. The goal of this post is to save some time for both new and experienced readers in the field and allow them to get a quick overview of 74 research papers in about 30 minutes reading time.

I set out to post 60 summaries (up from 50 compared to last time). At the end, I also include the summaries for my own published papers since the last iteration (papers 61-74).

Here we go.

1. Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. OpenAI. 2018.
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

A transformer architecture that is trained as a language model on a large corpus, then fine-tuned for individual text classification and similarity tasks. Multiple sentences are combined together into a single sequence using delimiters in order to work with the same model. Reporting high results on entailment, question answering and semantic similarity tasks.

2. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. Google. NAACL 2019.
https://www.aclweb.org/anthology/N19-1423.pdf

A bidirectional transformer architecture for pre-training language representations. The model is optimized on unlabaled data by 1) predicting masked words in the input sequence, and 2) predicting whether the input sequences occur together. The parameters can then be fine-tuned for a specific task, such as classifying sentences, sentence pairs, or tokens.

3. LXMERT: Learning Cross-Modality Encoder Representations from Transformers

Hao Tan, Mohit Bansal. UNC. ArXiv 2019.
https://arxiv.org/pdf/1908.07490.pdf

Building a cross-modal pre-trained model for both vision and language. Both images and text are encoded and attended over jointly with a cross-modal encoder, the model is then optimized with both unimodal and multimodal tasks (masked LM, image classification, image-caption matching, visual QA).
The model achieves new state-of-the-art on several VQA datasets.

ML and NLP Publications in 2018

It is time for another yearly update of the publication statistics in Machine Learning and Natural Language Processing. The field has continued to grow very rapidly, both in number of publications and number of attendees, breaking all sorts of previous records. Perhaps most notably the initial release of NeurIPS conference tickets sold out in 11 minutes and 38 seconds. In this post I will provide some finer-grained statistics on these numbers, showing which authors and organizations are publishing most at specific conferences.

This year, I have included the following conferences/journals: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, NeurIPS, ICML, ICLR, AAAI. This selection aims to cover the most well-known and high-ranking venues for publishing work on both machine learning and language technologies. Compared to last year, I've removed SemEval, as it has a large focus on shared task papers and I'm not including these for other conferences either. I've also added AAAI, which is one of the bigger conferences and was previously missing from the rankings. NeurIPS (previously known as NIPS) changed its name this year, but for consistency I will use the new name to refer to all the previous iterations as well.

This analysis is done automatically with a collection of scripts that I've continued to improve over the years. The paper lists are crawled from online proceedings and author names can usually be found there as well. Organization names need to be extracted straight from the PDFs which can lead to quite a bit of noise. I've created various methods for detecting and mapping different types of names, but let me know if you spot any remaining errors.

While this post highlights authors and organizations who have published the most in the recent year, I want to specify that I do not think that publication quantity is something that we as a field should be pursuing or rewarding. As the graphs below show, the field is becoming more and more popular, and this rapid increase in numbers comes with very varying quality. Authoring 1 piece of groundbreaking work is always better than releasing 10 totally forgettable incremental papers. This post is just meant to give a light high-level view of who is currently publishing and at which conferences, and perhaps provide a bit of inspiration for new researchers with great ideas.

Venues

We start off by looking at the publications at all the conferences between 2012-2018. Most of the ML venues continued their growth in the number of published papers, with AAAI and NeurIPS going past the 1,000 paper mark. EMNLP and NAACL also had their record years by quite a margin, whereas ACL and COLING stayed closer to the previous numbers. EACL took this year to rest, and the number of papers in TACL and CL has remained relatively stable throughout the years.

57 Summaries of Machine Learning and NLP Research

Staying on top of recent work is an important part of being a good researcher, but this can be quite difficult. Thousands of new papers are published every year at the main ML and NLP conferences, not to mention all the specialised workshops and everything that shows up on ArXiv. Going through all of them, even just to find the papers that you want to read in more depth, can be very time-consuming.

In this post, I have summarised 50 papers. After going through a paper, if I had the chance, I would write down a few notes and summarise the work in a couple of sentences. These are not meant as reviews – I’m not commenting on whether I think the paper is good or not. But I do try to present the crux of the paper as bluntly as possible, without unnecessary sales tactics. Hopefully this can give you the general idea of 50 papers, in roughly 20 minutes of reading time.

The papers are not selected or ordered based on any criteria. It is not a list of the best papers I have read, more like a random sample. The only filter that I applied was to exclude papers older than 2016, as the goal is to give an overview of the more recent work.

I set out to summarise 50 papers. Once I was done, I thought this would be a sensible place to summarise my own work as well. So at the end of the list you will also find brief summaries of the papers I published in 2017.

Let’s get started.

1. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Danqi Chen, Jason Bolton, Christopher D. Manning. Stanford. ACL 2016.
https://arxiv.org/pdf/1606.02858.pdf

Hermann et al (2015) created a dataset for testing reading comprehension by extracting summarised bullet points from CNN and Daily Mail. All the entities in the text are anonymised and the task is to place correct entities into empty slots based on the news article.

cnn_daily_mail

This paper has hand-reviewed 100 samples from the dataset and concludes that around 25% of the questions are difficult or impossible to answer even for a human, mostly due to the anonymisation process. They present a simple classifier that achieves unexpectedly good results, and a neural network based on attention that beats all previous results by quite a margin.

2. Word Translation Without Parallel Data
Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Hervé Jégou. Facebook, Le Mans, Sorbonne. ArXiv 2017.
https://arxiv.org/pdf/1710.04087.pdf

Inducing word translations using only monolingual corpora for two languages. Separate embeddings are trained for each language and a mapping is learned though an adversarial objective, along with an orthogonality constraint on the most frequent words. A strategy for an unsupervised stopping criterion is also proposed.

Word Translation Without Parallel Data

ML/NLP Publications in 2017

It has been a very productive year for NLP and ML research. Both areas continued to grow, with conferences reaching record numbers of publications. In this post I will break these numbers down a bit more, by individual authors and organisations. The statistics cover the following venues: ACL, EMNLP, NAACL, EACL, COLING, TACL, CL, CoNLL, *Sem+SemEval, NIPS, ICML, ICLR. Compared to last year, I’ve now included ICLR which has grown very rapidly in the last two years and become a highly competitive conference.

The analysis is done automatically, by crawling publication information from the conference websites and ACL Anthology. Author names are usually listed in the proceedings and easily extractable, however the organisation names are more tricky and need to be extracted straight from the PDFs. I’ve created a number of rules to map together alternative names and misspellings, but let me know if you notice any errors.

Venues

First, let’s look at different publication venues between 2012-2017. NIPS is clearly heading off the charts, with 677 publications this year. Most other venues are also growing rapidly, with 2017 being the biggest year ever for ICML, ICLR, EMNLP, EACL and CoNLL. In contrast, TACL and CL seem to be keeping a constant number of publications per year. NAACL and COLING were notably missing from 2017, but we can look forward to both of them in 2018.

Attending to characters in neural sequence labeling models

Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar vectors means these words also behave similarly in the model, which is what we want for good generalisation properties.

However, word embeddings have a couple of weaknesses:

  1. If a word doesn’t exist in the training data, we can’t have an embedding for it. Therefore, the best we can do is clump all unseen words together under a single OOV (out-of-vocabulary) token.
  2. If a word only occurs a couple of times, the word embedding likely has very poor quality. We simply don’t have enough information to learn how these words behave in different contexts.
  3. We can’t properly take advantage of character-level patterns. For example, there is no way to learn that all words ending with -ing are likely to be verbs. The best we can do is learn this for each word separately, but that doesn’t help when faced with new or rare words.

In this post I will look at different ways of extending word embeddings with character-level information, in the context of neural sequence labeling models.  You can find more information in the Coling 2016 paper “Attending to characters in neural sequence labeling models“.

Sequence labeling

We’ll investigate word representations in order to improve on the task of sequence labeling. In a sequence labeling setting, a system gets a series of tokens as input and it needs to assign a label to every token. The correct label typically depends on both the context and the token itself. Quite a large number of NLP tasks can be formulated as sequence labeling, for example:

POS-tagging
DT  NN    VBD      NNS    IN      DT   DT  NN     CC  DT  NN   .
The pound extended losses against both the dollar and the euro .

Error detection
+ +    +  x       +   +      +   +    +    x      +
I like to playing the guitar and sing very louder .

Named entity recognition
PER _      _   _      _  ORG  ORG   _  TIME _
Jim bought 300 shares of Acme Corp. in 2006 .

Chunking
B-NP    B-PP B-NP I-NP B-VP I-VP     I-VP I-VP   B-PP B-NP B-NP  O
Service on   the  line is   expected to   resume by   noon today .

In each of these cases, the model needs to understand how a word is being used in a specific context, and could also take advantage of character-level patterns and morphology.

NLP and ML Publications – Looking Back at 2016

After my last post on analysing publication patterns I received quite a lot of feedback and many feature requests, so I decided to create an update once 2016 is over. It is now quite a bit bigger than before, and includes 11 different conferences and journals: ACL, EACL, NAACL, EMNLP, COLING, CL, TACL, CoNLL, *Sem+SemEval, NIPS, and ICML.

The information used in these graphs was collected through crawling the web. ACL Anthology was very useful, listing papers in a consistent format. However, information such as the organisation names in each paper still needed to be extracted directly from the pdfs, which means there are likely to be some errors. I’ve tried to create exceptions to catch different spelling variations and other anomalies, but if you notice mistakes in the graphs, do let me know.

This analysis shouldn’t be taken too seriously – after all, quality of research matters much more than quantity, and that is considerably more difficult to measure. However, my motivation is to provide a high-level overview of what is happening in the field, where the big players are publishing, and perhaps supply a bit of inspiration and motivation for the new year.