Analysing NLP publication patterns

Recently, I got curious about finding out how much different institutions publish in my area. Does Google publish more than Microsoft? Which university has the strongest publication record in NLP? And are there any interesting trends that can be seen in the recent years? Quantity does not necessarily equal quality, but the number of publications is still a reasonable indicator of general activity in the field, how big the research group is, and how outward-facing are the research projects.

My approach was to crawl papers from the 6 biggest conferences that are relevant to my research: ACL, EACL, NAACL, EMNLP, NIPS, ICML. The first 4 focus on NLP applications regardless of methods, and the latter 2 on machine learning algorithms regardless of tasks. The time window was restricted to 2012-2016, as I’m more interested in current publications.

Luckily, all these conferences have nice webpages listing all the papers published there. ACL Anthology contains records for ACL, EACL, NAACL and EMNLP, NIPS has a separate webpage for papers, and ICML proceedings are on the JMLR website (except for ICML12 which are on the conference website). I wrote python scripts that crawled all the papers from these conferences, extracting author names and organisations. While authors can be crawled directly from the websites, in order to find the organisation names I had to parse the pdfs into text and extract anything that looked like a university or company name in the first 30 lines of on the paper. I wrote a bunch of manual patterns to map names to canonical versions (“UCL” to “University College London” and “Google Inc” to “Google”), although it is likely that I still missed some edge cases.

Below is the graph of top 25 organisations and the conferences where they publish.

CMU comes out as the most prolific publisher with 305 papers. A close second is Microsoft with 302 publications, also leading in the industry category. I was somewhat surprised to find that Microsoft publishes so much, almost twice as many papers compared to Google, especially as Google seems to get much more publicity with their research. Stanford is also among the top 3 organisations that publish substantially more than others. Edinburgh and Cambridge represent the UK camp with 121 and 117 papers respectively.

When we look at the distribution of conferences, Princeton and UCL stand out as having very little NLP-specific research, with nearly all of their papers in ICML and NIPS. Stanford, Berkeley and MIT also seem to focus more on machine learning algorithms. In contrast, Edinburgh, Johns Hopkins and University of Maryland have most of their publications on NLP-related conferences. CMU, Microsoft and Columbia are the most balanced among the top publishers, with roughly 50:50 division between NLP and ML.

We can also plot the number of publications per year, focusing on the top 15 institutions.

Carnegie Mellon has a very good track record, but has only just recently overtaken Microsoft as the top publisher. Google, MIT, Berkeley, Cambridge and Princeton have also stepped up their publishing game, showing upward trends in the recent years. The sudden drop for 2016 is due to incomplete data – at the time of writing, ACL, EMNLP and NIPS papers for this year are not available yet.

Now let’s look at the same graphs but for individual authors.

Chris Dyer comes out on top with 50 papers. This result is even more impressive given that he started with just 2 papers in 2012, then rocketing to the top by quite a margin in 2015. Almost all of his papers are in NLP conferences, with only 1 paper each for NIPS and ICML. Noah Smith, Chris Manning and Dan Klein rank 2nd-4th, with more stable publishing records, but also focusing mainly on NLP conferences. In contrast, Zoubin Ghahramani, Yoshua Bengio and Lawrence Carin are focused mostly on machine learning algorithms.

There seems to be a clear separation between the two research communities, with researchers specialising to publishing either in NLP or ML. This seems somewhat unexpected, especially considering the widespread trend of publishing novel neural network architectures for NLP tasks. Both fields would probably benefit from slightly tighter integration in the future.

I hope this little analysis was interesting to fellow researchers. I’m happy to post an update some time in the future, to see how things have changed. In the meantime, let me know if you find any bugs in the statistics.

Update: As requested, I’ve also added the statistics for first authors with highest publication counts. Jiwei Li from Stanford towers above others with 14 publications. William Yang Wang (CMU), Young-Bum Kim (Microsoft), Manaal Faruqui (CMU), Elad Hazan (Princeton), and Eunho Yang (IBM) have all managed an impressive 9 first-author publications.

Update 2: Added a fix for Jordan Boyd-Graber who publishes under Jordan L. Boyd-Graber in NIPS.

Update 3: Added a fix for Hal Daumé III, mapping together different spellings.

Update 4: By showing top N authors on the graphs, some authors with equal numbers of publications were being excluded. I’ve adjusted the value N for each graph so this doesn’t happen.

Update 5: Added a fix for Pradeep K. Ravikumar who also publishes under Pradeep Ravikumar.

Update 6: Added fixes to capture name variations for INRIA.


    • Marek

      The full data on organisations is quite noisy at the lower ranks at the moment, as it is extracted from pdfs and then post-processed with manual rules. It still contains a long tail of alternative spellings and entries that are not institutions at all (eg College Park).
      Imperial College London comes up with 7 entries in there. Although worth noting that I’m only looking at 6 specific conferences, and Imperial seems to be publishing in somewhat different areas.

      • Ice

        Nice su.mationsmThis post would be improved, however, with a section on the coaching. What did Brown do well as an in-game coach? What things did he not do well? Which things had a clear goal and which ones were confusing?Right now, most of the Laker-land is stating that Mike Brown is the big problem with everything. It’s not possible to judge the Princeton Offense yet, but maybe an evaluation of some of his moves during the game would be helpful.

      • kredit trotz negativ schufer eitrag arbeitslos ist

        Can a visitor get keys to Crash Space? No.Can a visitor borrow equipment from Crash Space? No.Can a visitor vote at Crash Space? No.Those are all things that are benefits of being a paying member, and I think if you want those things in cities other than the one you live in you should pay for membership at those spaces.

    • Marek

      Thanks! Indeed, I’m not catching alternative names for authors at the moment. I will update it soon and add a fix for your name.

  1. Jason Eisner

    How about including TACL? It’s a journal, but deliberately set up to be another mechanism for publishing normal ACL-style papers, so leaving it out of the analysis is strange. The format is essentially the same as ACL/NAACL/EMNLP/EACL, and you get to present the work at one of those conferences. Downloading and scraping the papers should be no different than for ACL. Whether you submit via TACL or directly via the conferences is as much a matter of when the deadlines fall as anything else. (Although TACL papers arguably should count a bit more: they generally get more thorough reviews, are often required to make revisions for final acceptance, and tend to be longer.)

    There’s also a question of whether long-form journal papers (JMLR, CL, etc.) should be included in measures of productivity. Perhaps those are often just synthesizing and expanding previously published conference papers? – but I’m not sure.

    Of course, I hope that no one optimizes for your ranking.

    • Marek

      The 6 conferences I chose simply based on which sources I personally follow the most. I completely agree that there are many other conferences and journals that could be included: TACL, COLING, CoNLL, *Sem, IJCAI, IJNLP, LREC, JMLR, CL, CIKM, AAAI, WWW, etc.
      I intend to post an update at the end of the year, and will include a longer list of conferences. Feel free to suggest additional sources which I haven’t listed yet.

      • Wei Xu

        I second Jason. TACL is essentially equal to ACL/NAACL/EMNLP/EACL; it is quite different from COLING, CoNLL, *Sem, IJCAI, IJNLP, LREC, JMLR, CL, CIKM, AAAI, WWW, etc, and much more right in the center of NLP research. I would recommend anyone interested in NLP to follow TACL papers (if not more closely) in addition to ACL/NAACL/EMNLP/EACL.

    • Adelaide

      I wanted to visit and allow you to know how , very much I trrusaeed discovering your blog today. We would consider it a good honor to work at my office and be able to utilize tips discussed on your web site and also engage in visstori’ comments like this. Should a position associated with guest writer become on offer at your end, please let me know.

    • http://www./

      CRICKETNight falls, and all comes to rest as best as can be allowed. The shroud of Autumn lurks and works its way into this scene. Serene and sedate. The late summer air is soothed by symphonic sounds. A soft chirp begins the overture, and it’s for sure that it will play until morning. The strains are lilting, never wilting or reaching crescendo, a slow and steady melody. Music of the night. hidden musicianplaying through the gentle nightdelight in your song

    • http://www./

      asri persoalkan why non muslim parti roket n kapal tenggelam tu xnak terima hudud?thats why die suruh abby bace dengan teliti tweet die. WTH,TF abby tu nak spam tweet ust asri mcm tu.BS!Well-loved.

  2. Ryan

    Thanks for the nice post, but some of the numbers seem off, and the errors may be related to parsing Chinese names. For example, Yuchen Zhang does not have an EMNLP, and there are at least two Yuxin Chen working in this area but neither of them has 7 ICML+NIPS alone. Perhaps you double-counted other people named Y. Zhang or Y. Chen?

  3. Jochen L Leidner

    Nice inforgraphic, thanks! Immediate feature requests: How about patents? Including IR? Speech? Top single-authors? Or which university fosters most team-coauthoring? CItation impact per institution?

    • Stella

      om normal atunci cand vede ca greseste in anumite domenii nu mai continua sa le ia din nou de la capat! Tu te agati cu dintii de orice cuvant care poate avea mai multe sensuri doar pentru a-ti demonstra tie ca dumnezeul tau exista! Cum nu poti sa-ti dai seama ca ceva nu este in regula cu religia asta?! Se pare ca iti este frica sa afli adetraul!Opresve-te putin si mediteaza! oricum cu tine nu se va putea ajunge la o concluzie caci tu selectezi doar informatia care iti convine! si problema este ca pe toate informatiile le-ai luat pe nemestecate de pe siturile creationiste.

      • Millicent

        Could not thank you adequately for the content on your site. I know you set a lot of time and energy into them and really hope you know how cosliderabny I enjoy it. I hope I can do the same for someone else one of these days.

    • www kreditkarte de

      Intervju med en agent om vad som avgör priset pÃ¥ en spelareDiskussioner kring truppen samt framförallt – vilka är de potentiella nyförvärven frÃ¥n Argentina? Det borde gÃ¥ att luska fram, det handlar ju trots allt bara om tvÃ¥ argentinska klubbar?

  4. EXG

    Nice infographics! Quick comment: I believe INRIA is missing. Just by counting NIPS 2012-2015, I get more than 60 papers.

    • Marek

      Good point, thanks for letting me know. I’ve added a fix for mapping together different ways of naming INRIA. They are now featured in the top 25.

  5. Trevor Cohn

    Interesting analysis, thanks for making this public. Also related is the ACL anthology network, which includes citation analysis over the conferences/journals in the ACL anthology ({NA,E,}ACL, EMNLP, COLING etc). Sadly it hasn’t been updated for 3 years.

  6. John

    I wonder why you have included ICML and NIPS into your analysis. There is some spillover from ML into NLP and vice versa but generally within the NLP community, only the big four (ACL, NAACL, EMNLP, and EACL) matter. The other two are really machine learning conferences and are not that much of interest to researchers in Computational Linguistics/NLP, so the data from NIPS and ICML are more like noise and don’t give you much information on current trends in the field.

    • Marek

      I chose the conferences that influence my work the most. Totally subjective, I agree. On the spectrum of linguistics-NLP-ML, I am more on the ML side.

Post a comment

You may use the following HTML:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>