Marek Rei

Creating Interpretable Models with Atomic Inference

Joe Stacey November 7, 2024 Uncategorized 0 Comments

This is a guest post from Joe Stacey about our quest to create interpretable Natural Language Inference (NLI) models. In this post he will share some of our ideas about interpretability, introduce the idea of atomic inference, and give an overview of the work in our 2022 and 2024 EMNLP papers [1,2]. At the end he’ll also share some of the other relevant papers that have inspired us along the way.

The goal of creating interpretable, high performing models

The focus of our research has been trying to create interpretable models that still perform competitively. After all, surely it’s better to have an interpretable model, where you can understand the reasons for each model prediction, rather than trying to interpret a black-box model (with no guarantees of faithfulness). Even though creating high performing, interpretable models seems like a sensible thing to want to do, there is little research in this area. A recent survey paper by Calderon and Reichart [10] found that less than 10% of NLP interpretability papers consider inherently interpretable self-explaining models, with the authors advocating for more research on causal based interpretability methods.

Maybe it’s just too difficult to create interpretable models that also perform competitively? Our findings suggest that this isn’t the case, and we hope to convince you that trying to create more interpretable models is an exciting and worthwhile area of research (which could be of real benefit to the NLP community).

Atomic Inference, and why we think it’s important

Our efforts to create interpretable models have centred around methods that we describe as Atomic Inference. I’ll start by explaining what this means, and why we felt that we needed to introduce this term.

We use the term atomic inference to describe methods that involve breaking down instances into sub-component parts, which we describe as atoms (sounds cool, right?), with models independently making hard predictions for each atom. The model’s atom-level predictions are then aggregated in an interpretable way to make instance-level predictions. This means that we can trace exactly which atoms are responsible for each model prediction. We refer to this as a faithfulness guarantee, as we can be certain which components of an instance were responsible for each prediction.

You might be wondering, what happens if we need to make a prediction based on two different atoms? How can we do that if the predictions for each atom need to be independent? One option is to create an additional, new atom that contains all of the relevant content from the other atoms that you need to combine.

We felt we needed to introduce the term atomic inference to distinguish interpretable atom-based methods from other work that doesn’t have the same interpretability benefits, for example when the atomic decomposition is introduced in order to improve model performance.

There are some key parts of the atomic inference definition that separate these methods from other atom-based methods:

Atomic inference methods require models to make each atom-level prediction independently, or in other words, predictions for one atom do not also have access to the content of the other atoms. This is often not the case for other atom-based methods [3, 4, 5]. By making the atom-level predictions independently, we guarantee that the content of each atom is the reason for the corresponding atom-level predictions. Otherwise, we no longer have our faithfulness guarantee.
Atomic inference also requires hard predictions for each atom. Instead, other methods can require soft probability scores (for example considering a mean score from each atom [6, 7, 8]). If the end goal is improving model performance, using soft probability scores makes a lot of sense. But if your goal is to improve interpretability, making hard predictions for each atom has considerable advantages, pinpointing the specific atoms that are responsible for each prediction.

So, in the end, atomic inference methods give us a faithfulness guarantee – we know, for certain, which atoms are responsible for each model prediction. Although, admittedly, the predictions for each atom remain a black-box. But in our opinion, this is fine! Atomic inference methods may not be perfect, but they are useful and practical interpretability methods.

This sounds intriguing! I get the idea, but how does this work in practice?

Great question! Let me walk through our EMNLP 2022 paper to give you a flavour of what atomic inference methods can look like in practice. The work focuses on NLI, but the same method could also be applied to other, similar tasks.

In this paper we proposed the idea that we could decompose NLI instances into atoms by segmenting the hypotheses into different spans, with each span being described as an atom. You can see the example below, where the hypothesis is segmented into spans. In this example, each span can be considered as an entailment span, except for the final span, which is neutral with respect to the premise (we don’t know if the man is carrying a surfboard or not).

We created the spans by segmenting the hypothesis based on the presence of nouns, but in our experience changing how we segment the text into spans didn’t make so much difference.

The tricky part is what to do when you have some kind of long range dependence across the sentence, e.g. if you have a negation word at the beginning of the sentence that will impact other spans towards the end of the input. Or perhaps you need to consider multiple spans together to correctly deduce the entailment relationship. To overcome these issues, we considered including consecutive spans as additional atoms (the number of consecutive spans we combine was a hyper-parameter). It’s not the perfect solution, but it was good enough for high performance on SNLI. We therefore had overlapping atoms, with single spans being considered as atoms alongside consecutive spans. While this might seem a bit unusual, the presence of overlapping (or duplicate) atoms isn’t a problem for atomic inference systems.

Next I’m going to talk about how inference works, and how we combine the model predictions for each individual atom to get a prediction for each instance. After that we’ll get to the problem of how to train a model to make accurate predictions for individual atoms (when we don’t have any atom-level labels to train with).

How inference works with our EMNLP 2022 paper

Our model makes predictions of either entailment, neutral or contradiction for each atom (our spans). We then aggregate these span predictions together using the following logic:

If any span in the hypothesis contradicts the premise, then the entire hypothesis will contradict the premise
Otherwise, if any span in the hypothesis is predicted as neutral, then the hypothesis is neutral with respect to the premise

We provide an example below where the model predicts the neutral class:

In this case, the model predicts the following spans as being neutral: “the sisters”, “are hugging goodbye” and “after just eating lunch”. As we have no contradiction spans being predicted, and there is at least one neutral span, the prediction must be neutral overall. Moreover, we know exactly why the model is making this prediction – because two women aren’t necessarily sisters, embracing doesn’t necessarily mean hugging goodbye, and we don’t know if the women have just had lunch.

Our co-author Haim Dubossarsky once described this as “giving a model explainability super-powers” – because with the atomic inference faithfulness guarantee, we know exactly why the model made each prediction (up to an atom-level).

Model training in our EMNLP 2022 paper

Now that we’ve sorted out how things work during inference, we need to face the tricky problem of how to train models to make accurate predictions for each specific span. We only have labels at an instance-level, and we want to avoid having to manually annotate spans to obtain span-level training data. Instead, our idea is to use a semi-supervised method where the model decides for itself which spans are likely to be neutral or contradiction. Specifically, during training we encode the premise together with each hypothesis span in turn, with a simple MLP giving us a neutral score for every atom in the instance. We then supervise the maximum of these neutral scores, comparing this to a binary neutral label of 1 or 0 (if we are expecting there to be a neutral span present or not).

This involves considering the following loss:

Where i are the atoms for that instance, n refers to the neutral class, \(y_n\) is the binary neutral label. \(\widetilde{a}_{n,i}\) is the neutral score for each atom.

We then follow the same process for contradiction, finding a contradiction score for each atom, and then supervising the maximum score for each instance. As with the neutral class, we compare this maximum score to a binary label of 1 or 0 based on if we are expecting there to be a contradiction span.

What remains is how to decide what these binary neutral and contradiction labels should be during training. To do this, we introduce even more rules that govern the interaction between the instance labels and the span labels:

If we have a contradiction example, we have yc =1 (we must have a contradiction span) – although we don’t know if there is a neutral span or not.
If we have a neutral example, we have yn =1 (we must have a neutral span). We also can’t have a contradiction span, so yc=0
Otherwise, for entailment examples, we can’t have either a neutral or a contradiction span, so both yc=0 and yn=0.

You might wonder where we get all these rules from. Can we just make them up like this? We argue that these rules follow from the inherent nature of NLI, and they just work.

The training process in the paper is a little bit more complicated than what I’ve described above, because we include an additional instance-level loss. We include this extra loss because it makes a small improvement in performance, but if you understand everything above, you understand the most important parts of the system.

The last thing to mention is that this method is model agnostic, and that you can apply this method to a range of baseline models.

The diagram below summarises how the full training process works in the paper (in this case, using a BERT model to encode the premise and each span):

Performance and limitations

I’ve talked about the “explainability superpowers” that atomic inference methods can offer, and at this point you probably agree that it’s good to have an interpretable model. But what’s the trade-off with performance? What is the cost to get these atomic inference guarantees about model faithfulness? It turns out, there is only a small loss in performance with our system, with our method almost matching the baseline model. In contrast, other interpretable methods for SNLI show considerably worse performance.

The table below shows the performance of our method (SLR-NLI), compared to the BERT baseline, and other interpretable methods for SNLI. The results show that additional span-level supervision using the e-SNLI rationales [11] also improves performance.

At this point, we have an interpretable model, with strong faithfulness guarantees, with performance almost in-line with the baseline. Taking a step back, this seems like quite a big achievement, showing that inherently interpretable models can perform competitively compared to black-box models.

What’s the catch? There must be a catch, right? In this case, the catch is the dataset. Our atomic inference method works great for simple NLI datasets such as SNLI. But after testing our span-based method on ANLI (the focus of our 2024 paper), the results are no longer so impressive!

Atomic Inference methods for ANLI

ANLI (Adversarial NLI) is created with a human in-the-loop method where humans create challenging hypotheses that fool existing fine-tuned models. As ANLI premises are also multiple sentences, a span-level decomposition is no longer feasible (there would be too many spans), and our method of considering consecutive spans would also not work. So we need to find another way to define our atoms.

We experiment with a sentence-segmentation of the premise, which performs OK. But we find even better performance when we use an LLM to generate a list of facts to itemise all of the information contained within the premise.

You might notice that we’re now talking about decomposing an NLI premise into atoms, rather than the NLI hypothesis. Therefore, all of those training and evaluation rules we had before are no longer going to work. Is it even possible to construct such a set of training and evaluation rules that would allow us to use the same model architecture which proved so successful in our previous work? It turns out there is, and they’re summarised below (the figure is contained in our EMNLP 2024 paper Appendix if you’re interested).

I love how all these different rules just follow from the inherent nature of NLI, it just seems so elegant.

Our 2024 paper talks in more detail about how we create comprehensive fact-lists (one of the main challenges we faced). The paper then compares our method to a range of baselines (both atomic inference baselines, and uninterpretable black-box models). Surprisingly, when we implemented our interpretable fact-based method with a DeBERTa-base model, performance actually improved! Our interpretable model also outperformed some LLMs with over 100 times the number of parameters (see our results table below).

This is further evidence that, given the right choice of atoms and training framework, atomic inference models can perform competitively.

More detail about our EMNLP 2024 paper results

Now for some more detail on the results from our EMNLP 2024 paper. If you’re not interested in this level of detail, you may want to skip to the next section!

The training method we described earlier was introduced to overcome the issue where we don’t have atom-level labels during training. This was essential when dealing with text spans as atoms, when a model couldn’t be expected to make sensible predictions for text spans unless it was specifically fine-tuned to do this. This is not necessarily the case when we have facts (or sentences) as our atoms, where models trained on full premise and hypothesis NLI pairs may still be able to make accurate fact or sentence-level predictions. To test this, we train a standard ANLI model on the full premise and hypothesis pairs, and use this model to make the atom-level predictions, rather than training a model using the max supervision method we described earlier.

We find that using a standard ANLI-trained model performs worse than our training method for either sentence atoms or LLM-generated fact atoms. For sentence-atoms, our training method leads to an 1.9% accuracy improvement on ANLI. However, when we consider generated facts as atoms, the improvement from our training method is even larger (at 3.3%). This suggests that the generated facts are further from the training distribution than the sentences.

We also perform ablation experiments to understand to what extent the improvements from our method are a result of either: 1) Our training method, 2) The strategies we implement to make our fact lists more comprehensive (including an additional fact conditioned on the hypothesis). We find that both components are important.

The table below shows the performance of:

Using the standard ANLI-trained model to make the fact-level predictions
(ablation number #1 below)
Using the standard ANLI-trained model to make a the fact-level predictions, but with the inclusion of the hypothesis-conditioned facts to make the fact lists more comprehensive
(ablation #2 below)
Using our training method (with the max supervision), but without making the facts lists more comprehensive with the hypothesis-conditioned facts
(ablation #3 below)

Interestingly, we found that if we don’t use our training method, and instead use a model trained on the full ANLI premises and hypotheses, then using sentence atoms performs better than using fact atoms. However, if we do use our training method (with the max. supervision), and we also make the fact-lists comprehensive, then using facts as atoms gives the best results!

What’s missing from our existing work, and some thoughts for future research

Atomic inference sets a high bar for interpretability. We have to break down instances into atoms, make independent predictions for each atom, these predictions must be hard predictions, and these atom-level predictions must be aggregated in an interpretable and deterministic way. Yet with the right choice of atoms, performance can still be just as good as a baseline model!

But there remains lots of work to do. First of all, I keep talking about NLI. How could this approach apply to a very different task? Perhaps for example, for the task of VQA – could you use patches as your choice of atoms? Or is there another, better choice of atom?

We also assume that either the NLI hypothesis or premise are decomposed into atoms, but what happens if we need to decompose both of these? And perhaps more importantly, are there better methods for reasoning across multiple atoms which would maintain the interpretability guarantees from atomic inference.

And finally, there is the question of model robustness. In our 2024 paper we show that existing atomic inference methods often decrease out-of-distribution performance. Is there a way to address this weakness, so we can have both interpretable and robust models that perform as well as their black-box model counterparts?

We believe this area of research is a work in progress, but atomic inference methods can be a practical and highly effective way of creating interpretable neural networks where we know exactly why the models make each prediction.

Other papers that have inspired us along the way

It would be impossible to mention all the papers that have influenced us (so I’m sorry if your paper isn’t mentioned below), but here are a select few papers that are worth reading if you enjoyed this work:

Papers that could be described as being atomic inference methods:

Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters [6]: Decomposes a premise into sentences, with an NLI model used to predict whether hypothesis atoms are supported by a given premise statement. The ‘hard aggregation’ variant could be considered as atomic inference.
Zero-Shot Fact Verification via Natural Logic and Large Language Models [9]: The cool thing about this paper is that the hard predictions for each atom are natural logical operators, which are then aggregated together into either ‘supports’, ‘refutes’ or ‘neither’ labels. This is an exciting idea that the atom-level label space can be different from the overall task labels.

Other relevant papers that we also particularly like:

WiCE: Real-World Entailment for Claims in Wikipedia [7]: This work also considers an atomic fact-level decomposition, considering a mean score used to inform instance-level predictions.
SUMMAC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization [8]: Involves decomposing a document and summary into sentences, with an NLI model providing a score for how well supported each summary sentence is. This work predates our EMNLP 2022 paper!
PROPSEGMENT: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition [3]: This paper considers a new type of atom, considering ‘propositions’ instead of sentences or facts. I love the idea about this kind of atom, and it opens the door to other ideas for how we can create atoms.

If you’re aware of other relevant papers that we haven’t cited in either our 2022 or 2024 EMNLP papers, I would love to hear from you!

Thanks for reading, and please do get in touch if you’d like to discuss any of this further. I’ll be presenting this work as a poster at EMNLP 2024 in Miami (Thursday 14th, 10:30-12:00). You can also reach me on j.stacey20@imperial.ac.uk. Thanks also to Pasquale Minervini, Haim Dubossarsky, Oana-Maria Camburu and Marek Rei for being fantastic collaborators for this work.

References:

Joe Stacey, Pasquale Minervini, Haim Dubossarsky and Marek Rei. 2022. Logical Reasoning with Span-Level Predictions for Interpretable and Robust NLI Models. In EMNLP.
Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Oana-Maria Camburu and Marek Rei. 2024. Atomic Inference for NLI with Generated Facts as Atoms. In EMNLP.
Sihao Chen, Senaka Buthpitiya, Alex Fabrikant, Dan Roth and Tal Schuster. 2023. PROPSEGMENT: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition. In ACL Findings.
Zijun Wu, Zi Xuan Zhang, Atharva Naik, Zhijian Mei, Mauajama Firdaus and Lili Mou. 2023. Weakly Supervised Explainable Phrasal Reasoning with Neural Fuzzy Logic. In ICLR.
Yufei Feng, Xiaoyu Yang, Xiaodan Zhu and Michael Greenspan. 2022. Neuro-symbolic Natural Logic with Introspective Revision for Natural Language Inference. In TACL.
Tal Schuster, Sihao Chen, Senaka Buthpitiya, Alex Fabrikant and Donald Metzler. 2022. Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters. In EMNLP Findings.
Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez and Greg Durrett. 2023. WiCE: Real-World Entailment for Claims in Wikipedia. In EMNLP.
Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst. 2022. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. In TACL.
Marek Strong, Rami Aly and Andreas Vlachos. 2024. Zero-Shot Fact Verification via Natural Logic and Large Language Models. In EMNLP.
Nitay Calderon and Roi Reichart. 2024. On Behalf of the Stakeholders: Trends in NLP Model Interpretability in the Era of LLMs.
Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz and Phil Blunsom. 2018. e-SNLI: Natural Language Inference with Natural Language Explanations. In NeurIPS

68 Summaries of Machine Learning and NLP Research

Marek November 4, 2024 Uncategorized 0 Comments

I have written short summaries of 68 different research papers published in the areas of Machine Learning and Natural Language Processing. They cover a wide range of different topics, authors and venues. These are not meant to be reviews showing my subjective opinion, but instead I aim to provide a blunt and concise overview of the core contribution of each publication. At the end of the list I have also included a selection of my own papers, published together with my students and collaborators.

Given how many papers are published in our area every year, it is getting more and more difficult to keep track of all of them. The goal of this post is to save some time for both new and experienced readers in the field and allow them to get a quick overview of 68 research papers.

These summaries are written in a way that tries to relay the core idea and main takeaway of the paper without any overhype or marketing wrapper.

It is probably good to also to mention that I wrote all of these summaries myself and they are not generated by any language models.

Here we go.

1. PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Zhenqiang Gong, Xing Xie. Microsoft Research, CAS, CMU, Peking University, Westlake University, Duke University. ArXiv 2023.
https://arxiv.org/abs/2306.04528

The paper investigates LLM robustness to prompt perturbations, measuring how much task performance drops for different models with different attacks. Prompts are changed by introducing spelling errors, replacing synonyms, concatenating irrelevant information or translating from a different language. Word replacement attacks are found to be most effective, with average 33% performance drop. Character-level attacks rank second. GPT-4 and UL2 outperformed other investigated models in terms of robustness.

2. System 2 Attention (is something you might need too)

Jason Weston, Sainbayar Sukhbaatar. Meta. ArXiv 2023.
https://arxiv.org/abs/2311.11829

The paper proposes query rewriting as the solution to the problem of LLMs being overly affected by irrelevant information in the prompts. The first step asks the LLM to rewrite the prompt to remove the irrelevant parts. This edited prompt is then given to the LLM to get the final answer, which improves robustness when the prompts include irrelevant information.

3. Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval

João Coelho, Bruno Martins, João Magalhães, Jamie Callan, Chenyan Xiong. CMU, University of Lisbon, NOVA School of Science and Technology. ArXiv 2024.
https://arxiv.org/abs/2404.04163

The paper investigates positional biases when encoding long documents into a vector for similarity-based retrieval. They start with a pre-trained T5-based model and show that this by itself isn’t biased. However, when using contrastive training (either unsupervised or supervised) to optimize the model for retrieval, the model starts to perform much better when the evidence is in the beginning of the text. They find evidence to indicate that this bias is part of the task itself – important information tends to be towards the beginning of the texts, so that is where the models learns to look more.

4. Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal. Google. ArXiv 2024.
https://arxiv.org/abs/2404.07143

The paper extends the attention in transformers (which have inefficient quadratic complexity) with a memory component that considerably increases the effective context length. The memory approximates a key-value store by recursively adding previous values into a matrix of parameters as the model moves through context. They show state-of-the-art results on long-context language modelling, finding a hidden passcode from a 1M token length context, and summarizing 500K length books.

5. BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Yapei Chang, Kyle Lo, Tanya Goyal, Mohit Iyyer. UMass Amherst, AllenAI, Princeton. ICLR 2024.
https://arxiv.org/abs/2310.00785

The paper investigates two strategies for LLM evaluation of reading full-length books: hierarchically combining chunk-level summaries and incrementally building up a summary while going through the book. They focus on coherence, as opposed to correctness, and develop an automated LLM-based score (BooookScore) for assessing summaries. They first have humans assess each sentence of a sample of generated summaries, then check that the automated metric correlates with the human assessment. The results indicate that hierarchical summarisation produces more coherent summaries while incremental summarisation leads to more details being retained in the summary.

6. DE-COP: Detecting Copyrighted Content in Language Models Training Data

André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li. ULisboa, UCSB, CMU. ICML 2024.
https://arxiv.org/abs/2402.09910

The paper proposes a simple approach to determining whether a particular book has been used for training an LLM. A paragraph from the book is presented to the model, along with multiple paraphrases. The model is then asked to choose which paragraph came from the book. The results are surprisingly good, improving over baselines using probabilities and perplexities.

7. STaR-GATE: Teaching Language Models to Ask Clarifying Questions

Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, Noah D. Goodman. Stanford. ArXiv 2024.
https://arxiv.org/abs/2403.19154

The paper focuses on teaching LLMs to ask clarifying questions instead of trying to immediately answer user questions and instructions in one turn. They set up two LLMs to chat with each other – the Roleplayer has a secret persona and asks a scripted question, while the Questioner is tasked with answering the question after asking clarifying questions. They sample alternative dialogue traces, identify the one that led to the best final answer, then supervise the Questioner with this best dialogue. The results show that the trained Questioner is able to provide better persona-specific answers.

8. Are Emergent Abilities of Large Language Models a Mirage?

Rylan Schaeffer, Brando Miranda, Sanmi Koyejo. Stanford. NeurIPS 2023.
https://arxiv.org/abs/2304.15004

This paper questions the claim that LLMs have emergent abilities – unexpected skills that suddenly appear only with models that are sufficiently large. They show that this property is mostly due to the use of discontinuous metrics, which only credit models once they reach sufficiently high levels of abilities. When using smooth continuous metrics, and increasing test sets to sufficient size, the “sudden appearance” of abilities is replaced by gradual improvements.

9. Distinguishing the Knowable from the Unknowable with Language Models

Gustaf Ahdritz, Tian Qin, Nikhil Vyas, Boaz Barak, Benjamin L. Edelman. Harvard. ICML 2024.
https://arxiv.org/abs/2402.03563

The paper investigates the possibility of distinguishing epistemic uncertainty (due to lack of knowledge) from aleatoric uncertainty (due to entropy in the underlying distribution) in LLMs. They make the assumption that a large LLM has no (or less) epistemic uncertainty and train a probe to predict the uncertainty of a large LLM based on the frozen activations of a smaller LLM. This is largely successful, indicating that the model activations in the smaller model are able to differentiate between the different uncertainty types. They also propose that in-context information affects the LLM probabilities more in the case of epistemic uncertainty, and less with aleatoric uncertainty.

10. Do Large Language Models Latently Perform Multi-Hop Reasoning?

Sohee Yang, Elena Gribovskaya, Nora Kassner, Mor Geva, Sebastian Riedel. DeepMind, UCL, Google Research, Tel Aviv University. ArXiv 2024.
https://arxiv.org/abs/2402.16837

The paper investigates the ability of LLMs to perform latent multi-hop reasoning, by completing sentences such as “The author of the novel Ubik was born in the city of …”. They construct experiments to investigate each hop separately: 1) whether the internal representation for the intermediate entity (“Philip K. Dick”) required for the first hop strengthens, and 2) whether increased recall of the intermediate entity improves the consistency of the final answer. In these experiments they find strong evidence for the first hop and moderate evidence for the second hop. They construct a dataset of 45,595 pairs of one-hop and two-hop prompts, to be released.

11. PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models

Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar. Intuit, Vanderbilt, Cambridge. ArXiv 2024.
https://arxiv.org/abs/2402.11347

The paper describes an evolution strategy for finding optimal LLM prompts for specific tasks. Prompts are first initialised either by experts or by asking an LLM to recover the prompt based on example input-output pairs. These initial prompts then go through multiple mutation steps, by having the LLM rewrite the prompts according to different strategies. The fitness of the prompts is measured by performance on the dev set and the best prompt is then evaluated on the test set.

12. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

Siyuan Wang, Zhongyu Wei, Yejin Choi, Xiang Ren. Fudan University, University of Washington, USC, AllenAI. ArXiv 2024.
https://arxiv.org/abs/2402.11442

The paper first constructs a dataset of 14000 commonsense logical rules, using LLMs to generate and check the rules. They then assess LLM understanding of these rules by turning the rules into a binary entailment classification task, showing that all models decrease in performance as the complexity of the rules increases. Finally, they train an LLM with these rules and show benefit on downstream tasks that require commonsense reasoning.

13. An LLM Compiler for Parallel Function Calling

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, Amir Gholami. UC BErkeley, ICSI, LBNL. ICML 2024.
https://arxiv.org/abs/2312.04511

While most LLMs perform tool calls sequentially, in a linear iteration loop, this paper investigates executing tool calls in parallel. The planner stage first predicts which tool calls will be needed and what are the dependencies between the tool calls. These tools are then called in parallel and the results are then combined for the final answer. They show performance improvements in some settings and speed improvements in all evaluated settings, showing particular usefulness in settings where the LLM needs to retrieve information about multiple entities (e.g. do background research) in order to reach the final solution.

14. Interpreting Language Models with Contrastive Explanations

Kayo Yin, Graham Neubig. UC Berkeley, CMU. EMNLP 2022.
https://aclanthology.org/2022.emnlp-main.14/

Proposes an explainability method for language modelling that explains why one word was predicted instead of a specific other word. Adapts three different explainability methods to this contrastive approach and evaluates on a dataset of minimally different sentences. The method is shown to better highlight the differences between these specific words, instead of just assigning most of the focus to the previous word.

15. Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

Ashish V. Thapliyal, Jordi Pont-Tuset, Xi Chen, Radu Soricut. Google Research. EMNLP 2022.
https://arxiv.org/abs/2205.12522

Multilingual image captioning dataset containing captions in 36 languages for 3600 images. An annotation process has been designed to make sure all the annotations resemble the same style, similar to an automated captioning output. Results show that the dataset provides a much higher correlation to human judgements, compared to the silver annotations from COCO-dev.

16. Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov. MIT, Northeastern, Technion IIT. NeurIPS 2022.
https://arxiv.org/abs/2202.05262

Proposes a method for editing a specific relational fact in a pre-trained language model. The specific feedforward layer that is responsible for recalling the target of a specific fact is identified using noisy permutations of the model activations. The feedforward layer is then directly trained to produce a more optimal output when given the subject of that fact as input.

17. M2D2: A Massively Multi-Domain Language Modeling Dataset

Machel Reid, Victor Zhong, Suchin Gururangan, Luke Zettlemoyer. The University of Tokyo, University of Washington. EMNLP 2022.
https://aclanthology.org/2022.emnlp-main.63/

Assembling a dataset for language model domain adaptation, from Wikipedia and ArXiv, containing 22 broad domains and 145 fine-grained domains. Experiments show that when fine-tuning a model for in-domain data, it is best to tune on related broad domain data first, then further only on the specific fine-grained domain data. Out-of-domain performance is shown to strongly correlate with vocabulary overlap between the different domains.

18. Binding Language Models in Symbolic Languages

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, Tao Yu. The University of Hong Kong, Shanghai Jiao Tong University, University of Washington, AllenAI, University of Waterloo, Salesforce Research, Yale University, Meta AI. ICLR 2023.
https://arxiv.org/abs/2210.02875

Uses a large pre-trained language model (GPT-3 Codex) to translate a natural language question into an SQL query. This query can then contain API calls, which also get executed by the language model, in order to populate additional columns of information in a database, over which the SQL can then operate. Outperforms previous methods on datasets of questions about data tables, using only contextual examples without fine-tuning the language model.

19. Hungry Hungry Hippos: Towards Language Modeling with State Space Models

Daniel Y. Fu, Tri Dao, Khaled K. Saab, Armin W. Thomas, Atri Rudra, Christopher Ré. Stanford, University at Buffalo. ICLR 2023.
https://arxiv.org/abs/2212.14052

Proposes a modification for state space models, a version of RNN with linear operations that can be separated into components of cumulative sums. The modification gives them abilities similar to attention, being able to copy and compare tokens across the sequence. The architecture scales O(N log N) with sequence length N, as opposed to N^2 for regular attention.

20. Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, Furu Wei. ACL 2023.
https://aclanthology.org/2023.findings-acl.247/

Shows how the equations for attention during in-context learning (showing the model examples in the input) can be thought of as a form of gradient descent. Experiments indicate that there are also similarities in how these methods affect the model in practice.

21. PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, Graham Neubig. CMU, Inspired Cognition. ICML 2023.
https://arxiv.org/abs/2211.10435

Question answering with a language model while generating a chain-of-thought, outputting the necessary reasoning steps to get to the answer. In addition to natural language reasoning steps, the model generates python syntax that is then executed in order to output the final answer. This is shown to improve results, particularly when the result requires mathematical arithmetic over large numbers.

22. Explaining black box text modules in natural language with language models

Chandan Singh, Aliyah R. Hsu, Richard Antonello, Shailee Jain, Alexander G. Huth, Bin Yu, Jianfeng Gao. MSR, UC Berkeley, UT Austin. NeurIPS XAIA 2023.
https://arxiv.org/abs/2305.09863

A method for providing natural language explanations to black-box modules that take neurons as input and return a score as output. A large number of ngrams are passed through the model in order to identify ngrams that result in the largest score. These ngrams are sampled and fed into an LLM for summarization, generating explanation candidates. An LLM is then used for generating positive and negative example sentences based on each candidate explanation, and the score differences between these examples are used for selecting the best explanation.

23. Do CoNLL-2003 Named Entity Taggers Still Work Well in 2023?

Shuheng Liu, Alan Ritter. Georgia Institute of Technology. ACL 2023.
https://arxiv.org/abs/2212.09747

A thorough investigation of well-known NER models and how their performance is affected on modern data when trained on CoNLL 2003. They annotate a new test set of news data from 2020 and find that performance of certain models holds up very well and the field luckily hasn’t overfitted to the CoNLL 2003 test set. For best results on modern data, large models pre-trained on contemporary corpora come out on top, with RoBERTa, T5 and LUKE standing out.

24. The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song. UC Berkeley. ArXiv 2023.
https://arxiv.org/abs/2305.15717

Analyzing the performance of open-source LLMs fine-tuned on the outputs of proprietary LLMs. They conclude that when fine-tuned on general-purpose dialogues, the models learn to mimic the style of the teacher model and can fool human assessors, but lack core knowledge and can more easily generate factually incorrect claims. However, when fine-tuning only for a specific task, then this imitation strategy can reach near-parity with the teacher models.

25. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models

Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, Tao Yu. Misc. EMNLP 2022.
https://aclanthology.org/2022.emnlp-main.39

Unifies 21 generation tasks that require accessing structured information sources into a general sequence-to-sequence benchmark for language models. It includes datasets on semantic parsing, question answering, data-to-text, dialogue and fact verification. The input, structured data and and output is linearised for these datasets, so that a general-purpose language model can be used for all of them. A fine-tuned T5 model is shown to outperform existing SOTA models on many of these datasets.

26. Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. Meta, Universitat Pompeu Fabra. NeurIPS 2023.
https://openreview.net/forum?id=Yacmpz84TH

The paper describes a method for teaching large language models to use external tools. A supervised dataset is created by trying to insert results from API calls into various points in the text, then only retaining the cases where doing that improves perplexity. The LM is then fine-tuned on this dataset and manages to improve performance on several tasks by using tools such as a QA system, Wikipedia search and a calculator.

27. ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings

Shibo Hao, Tianyang Liu, Zhen Wang, Zhiting Hu. UC San Diego, Mohamed bin Zayed University of Artificial Intelligence. NeurIPS 2023.
https://openreview.net/forum?id=BHXsb69bSx

The paper presents ToolkenGPT, a framework for extending LMs with tool use. For each new tool, a new token is added to the output vocabulary of the LM and the embedding for that token is trained with annotated or synthetic examples. When the LM generates that token, the model is switched to a different mode and prompted with examples for that particular tool in order to generate the necessary arguments for that tool call.

28. Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. NVIDIA, Caltech, UT Austin, Stanford, UW Madison. TMLR 2024.
https://arxiv.org/abs/2305.16291

Combining together black-box LLMs through different prompting strategies to create a capable agent for playing Minecraft. One component receives information about the current state, reasons about the next possible goals and formulates a suitable task in natural language. Another component receives the desired task along with descriptions of existing API functions and skills, then iteratively generates and improves a code function for performing that task.

29. Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction

Steven Coyne, Keisuke Sakaguchi, Diana Galvan-Sosa, Michael Zock, Kentaro Inui. Tohoku University, RIKEN, Aix-Marseille University. ArXiv 2023.
https://arxiv.org/abs/2303.14342

Evaluating well-known LLMs on two established benchmarks for grammatical error correction (GEC). They find that specific prompts matter quite a bit and lower temperature is better for GEC. The results show that LLMs tend to over-correct and perform fluency edits, achieving state-of-the-art performance on a dataset designed for this type of edits (JFLEG). However, performance is considerably lower on a benchmark that focuses on minimal edits and only fixing grammaticality (BEA-2019).

30. Backpack Language Models

John Hewitt, John Thickstun, Christopher D. Manning, Percy Liang. Stanford. ACL 2023.
https://aclanthology.org/2023.acl-long.506

Proposes a language model architecture that maps each word to multiple sense vectors, uses a separate model to predict the weights for these senses, then directly outputs a prediction as a log-linear function. Experiments show that performance degrades, as many more parameters are required to reach a perplexity comparable to a transformer LM. They find that editing the weights of these sense vectors can be used for mitigating gender bias or editing specific information.

31. What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture. Comcast Applied AI, UCL, University of Waterloo. ACL 2023.
https://aclanthology.org/2023.acl-long.310

A method for analysing text-to-image models, indicating which areas of the generated image are attributed to a particular word in the input. They use the attention scores in Stable Diffusion between convolutional blocks and word embeddings, upscaling and aggregating them between different heads, layers and time steps. The result is competitive to supervised image segmentation models and they use it for linguistic analysis. For example, they show that cohyponyms (e.g. a giraffe and a zebra) in the input can have their concepts merged, resulting in only one of the objects being generated.

32. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, Maosong Sun. Tsinghua University, ModelBest, Renmin University of China, Yale University, WeChat AI, Tencent, Zhihu. ICLR 2024.
https://arxiv.org/abs/2307.16789

They crawl documentation for 16K APIs from RapidAPI and synthesize an instruction tuning dataset for using these APIs.
Instruction examples are generated using ChatGPT, by asking it to generate examples that make use of one or multiple sample APIs.
Multiple rounds of API calls and responses are allowed by the LLM, which are explored using a depth-first search strategy, until a final answer or termination is generated.

33. ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding

Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy. Tel Aviv University, Meta AI. EMNLP 2023.
https://aclanthology.org/2023.findings-emnlp.536

Constructing a benchmark for zero-shot understanding of long texts with LLMs. Includes existing datasets (summarization, QA) from the Scrolls benchmark, along with two new tasks: determining the ratio of positive reviews in a set of reviews, and sorting a shuffled list of book chapter summaries. Results indicate that GPT-4 is best overall, even though it loses points in automatic evaluation as it doesn’t follow the instructed format.

34. Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling

Subhendu Khatuya, Rajdeep Mukherjee, Akash Ghosh, Manjunath Hegde, Koustuv Dasgupta, Niloy Ganguly, Saptarshi Ghosh, Pawan Goyal. Indian Institute of Technology Kharagpur, Goldman Sachs. NAACL 2024.
https://aclanthology.org/2024.naacl-long.410/

The paper approaches the task of tagging entities with a large number of labels using a generative model. The LLM is instruction-tuned to generate the label description of the label. The generated description is then mapped to the closest real label description using sentence embeddings and cosine similarity. The method achieves strong performance over other traditional tagging models on this task.

35. CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction

Tara Safavi, Doug Downey, Tom Hope. University of Michigan, Allen Institute for Artificial Intelligence, Northwestern University, The Hebrew University of Jerusalem. AKBC 2022.
https://arxiv.org/abs/2205.08012

Proposes a pipeline of increasingly complex models for predicting links in graph. Simpler models are used to perform initial filtering, then more complex models that need much processing time are applied only on a small chosen sample. The number of candidates to retain at each stage is learned in a supervised way. The model makes it feasible to apply very large models to large tasks while also improving the SOTA results.

36. Revisiting Transformer-based Models for Long Document Classification

Xiang Dai, Ilias Chalkidis, Sune Darkner, Desmond Elliott. CSIRO Data61, University of Copenhagen. EMNLP 2022.
https://aclanthology.org/2022.findings-emnlp.534.pdf

Comparing sparse attention (Longformer) and hierarchical transformers for long document classification, focusing on electronic health records. They find that the performance is close, with Longformer slightly better out-of-the-box and hierarchical models better after tuning hyperparameters. Results also indicate that splitting long text into overlapping sections and using Label-Wise Attention Network helps improve performance.

37. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence. Google. ICLR 2023.
https://arxiv.org/abs/2204.00598

Constructs a pipeline of pre-trained language models with different modalities for scene understanding. Visual LMs rank possible locations and objects, audio LMs rank possible sounds, regular LMs take this information in a filled-out template and generate summaries or answers for QA. LMs can also be used to generate candidate activities, based on the detected places and objects, then these activities are re-ranked by a visual LM to choose ones that match the scene.

38. LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks

Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, Kangwook Lee. University of Wisconsin-Madison. NeurIPS 2022.
https://arxiv.org/abs/2206.06565

The paper investiagtes the application of pre-trained language models on the task of classifying non-textual data, without any architecture changes.
The input features are linearized into a text-like sequence and given as context, the output is then collected as a language model prediction.
While it doesn’g perform best overall, it does achieve surprisingly competitive performance.

39. Few-Shot Tabular Data Enrichment Using Fine-Tuned Transformer Architectures

Asaf Harari, Gilad Katz. Ben-Gurion University of the Negev. ACL 2022.
https://aclanthology.org/2022.acl-long.111.pdf

System for creating additional feature columns for a tabular dataset, which can then be useful for classification. Entities (rows) in the data are matched to wikipedia articles in order to retrieve plain text descriptions. Binary classifiers are then trained to classify this text according to properties from the tabular data, and the output probabilities are included as new features in the dataset. Evaluation is performed on the main classification task by using these new extra features.

40. Counterfactual Memorization in Neural Language Models

Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, Nicholas Carlini. Google Research, Carnegie Mellon University, Google DeepMind, ETH Zürich. NeurIPS 2023.
https://arxiv.org/abs/2112.12938

The paper proposes a measure of counterfactual memorization for pre-trained language models.
A large number of different models are trained on subsets of the training set.
The expected performance on that sentence is then calculated for examples containing a particular sentence, versus those that do not contain this sentence.
A high score difference indicates that the model tends to memorize this sentence when it is included in the training data.

41. Relation-Constrained Decoding for Text Generation

Xiang Chen, Zhixian Yang, Xiaojun Wan. Peking University. NeurIPS 2022.
https://openreview.net/forum?id=dIUQ5haSOI

The paper describes a model for text generation, based on target dependency relations that should be in the output.
The word-level output probabilties are modified to increase the likelihood of generating words that match the target relation.
During beam decoding, the candidate construction method also takes the target relations into account.
Evaluation is performed on several datasets, formulating the task as text generation based on dependency relations.

42. Interpretability for Language Learners Using Example-Based Grammatical Error Correction

Masahiro Kaneko, Sho Takase, Ayana Niwa, Naoaki Okazaki. Tokyo Institute of Technology. ACL 2022.
https://aclanthology.org/2022.acl-long.496

Describes a system for performing error correction, while also returning examples of similar corrections from the training set. Each token in each sentence is encoded with the GEC model and the representation is used for finding other similar correction examples. The kNN-based similarity is also incorporated into the output distribution of the error correction model, improving performance on closed-class errors while reducing some performance on open-class errors.

43. Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction

Maksym Tarnavskyi, Artem Chernodub, Kostiantyn Omelianchuk. Ukrainian Catholic University, Grammarly. ACL 2022.
https://aclanthology.org/2022.acl-long.266

The paper extends the GECToR sequence tagging architecture for grammatical error correction.
The models are scaled up to bigger versions, multiple versions are ensembled together and then distilled back to a single model through generated training data. Improvements are shown on the BEA-2019 dataset both for the ensembled configuration and the single best model.

44. Linguistic Parameters of Spontaneous Speech for Identifying Mild Cognitive Impairment and Alzheimer Disease

Veronika Vincze, Martina Katalin Szabó, Ildikó Hoffmann, László Tóth, Magdolna Pákáski, János Kálmán, Gábor Gosztolya. University of Szeged. Computational Linguistics 2022.
https://aclanthology.org/2022.cl-1.5

Developing a system for the detection of cognitive impairment based on linguistic features.
Patients were recorded when answering free-text questions, their answers transcribed, and a large number of features extracted for classification.
Morphological and statistical features perform well across different tasks and the overall classifier achieves 2-class F1 of 84-86%.

45. Black-box Prompt Learning for Pre-trained Language Models

Shizhe Diao, Zhichao Huang, Ruijia Xu, Xuechun Li, Yong Lin, Xiao Zhou, Tong Zhang. The Hong Kong University of Science and Technology, University of California San Diego. TMLR 2023.
https://arxiv.org/abs/2201.08531

The paper proposes the task of tuning prompts for pre-trained models in a setting where the model weights or activations are not available and need to be treated as a black box.
A first stage of white-box fine-tuning using a small dataset is assumed, followed by a black-box tuning stage using additional data just for updating the prompts.
The prompts are updated by randomly sampling permutations to the existing prompts, then approximating the gradient using the natural evolution strategy (NAS) algorithm.
Evaluation shows that having the extra black-box training step on additional data is beneficial over only doing white-box prompt tuning using a smaller dataset, but is outperformed by using the full dataset for white-box prompt tuning.

46. Mind the gap: Challenges of deep learning approaches to Theory of Mind

Jaan Aru, Aqeel Labash, Oriol Corcoll, Raul Vicente. University of Tartu. ArXiv 2022.
https://arxiv.org/abs/2203.16540

An opinion paper on deep learning models in connection to the Theory of Mind – the skill of humans to understand the minds of others, imagine that they might have hidden knowledge or emotions. Gives a summary of different stages of this skill developing in humans, along with a review of this work in the deep learning field. Proposes that this is not a skill that will develop from one task, and that it should be evaluated through the interpretation of neural networks (for example whether a specific neuron can be identified to detect the emotion of others).

47. Atomic Inference for NLI with Generated Facts as Atoms

Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Oana-Maria Camburu, Marek Rei. Imperial, Edinburgh, QMUL, UCL. EMNLP 2024.
https://arxiv.org/abs/2305.13214

Long texts are broken down into individual self-contained facts using an LLM. A special architecture then learns to make decisions about the text by making individual entailment decisions about those facts. The resulting system is able to point to specific facts as explanations for the overall decision, as these are guaranteed to be a faithful explanation for the final prediction.

48. Continuous Predictive Modeling of Clinical Notes and ICD Codes in Patient Health Records

Mireia Hernandez Caralt, Clarence Boon Liang Ng, Marek Rei. Imperial. BioNLP 2024.
https://arxiv.org/pdf/2405.11622

Investigates the task of early prediction of hospital diagnoses and necessary procedures, based on textual notes in the electronic health records. A causal hierarchical model is created, which is able to make predictions about overall ICD codes at every timestep of the hospital stay. As the note sequences are very long, an extended context algorithm is proposed, which samples a subset of notes during training but is able to iteratively use the whole sequence during testing.

49. Modelling Temporal Document Sequences for Clinical ICD Coding

Clarence Boon Liang Ng, Diogo Santos, Marek Rei. Imperial, Transformative AI. EACL 2023.
https://aclanthology.org/2023.eacl-main.120.pdf

Assigning ICD codes to discharge summaries in electronic health records, which indicate the diagnoses and procedures for each patient. The model is designed to integrate additional information from the previous notes in the health record. Additive embeddings are used for representing metadata about each note. The system achieves state-of-the-art results on the task of ICD coding.

50. When and Why Does Bias Mitigation Work ?

Abhilasha Ravichander, Joe Stacey, Marek Rei. Ai2, Imperial. EMNLP 2023.
https://aclanthology.org/2023.findings-emnlp.619

Targeted testing of different model debiasing methods, in order to investigate their effect on the model. Creating six datasets that contain very specific controlled biases for probing debiasing methods. Experiments show that specifically debiasing against one bias actually increases reliance against another bias.

51. Did the Neurons Read your Book? Document-level Membership Inference for Large Language Models

Matthieu Meeus, Shubham Jain, Marek Rei, Yves-Alexandre de Montjoye. Imperial College London, Sense Street. USENIX Security Symposium 2024.
https://arxiv.org/abs/2310.15007

Introducing the task of document-level membership inference for LLMs – determining whether a particular document (e.g. book or article) has been using during the LLM training, while only having query access to the resulting model. All token-level probabilities are collected from the language model, these are normalised by how rare each token is overall, and then aggregated into features and given to a supervised classifier. Experiments show that most documents can be detected, with the detection working better for longer documents.

52. An Extended Sequence Tagging Vocabulary for Grammatical Error Correction

Stuart Mesham, Christopher Bryant, Marek Rei, Zheng Yuan. University of Cambridge, Imperial College London, King’s College London. EACL 2023.
https://aclanthology.org/2023.findings-eacl.119

Introducing tool usage into a tagging-based grammatical error correction model. Instead of training the model to correct every type of error itself, the model detects when a word should be sent to a separate spellcheck or an inflection system. The modification improves performance on the targeted error types and overall, also leaving room for introducing additional tools for other error types.

53. On the application of Large Language Models for language teaching and assessment technology

Andrew Caines, Luca Benedetto, Shiva Taslimipoor, Christopher Davis, Yuan Gao, Oeistein Andersen, Zheng Yuan, Mark Elliott, Russell Moore, Christopher Bryant, Marek Rei, Helen Yannakoudakis, Andrew Mullooly, Diane Nicholls, Paula Buttery. Cambridge, KCL, CUPA, Writer Inc, Imperial, ELiT. AIED LLM 2023.
https://arxiv.org/abs/2307.08393

Discussing the possible applications of generative language models in the area of language teaching. Covering tasks such as automated test creation, question difficulty estimation, automated essay scoring and feedback generation. The paper concludes that there is a lot of potential but for best results the automated systems need to be paired with human intervention.

54. Prompting open-source and commercial language models for grammatical error correction of English learner text

Christopher Davis, Andrew Caines, Øistein Andersen, Shiva Taslimipoor, Helen Yannakoudakis, Zheng Yuan, Christopher Bryant, Marek Rei, Paula Buttery. Cambridge, Writer Inc, KCL, Imperial. ACL 2024.
https://arxiv.org/pdf/2401.07702

Investigating the abilities of LLMs to perform grammatical error correction. 7 open-source and 3 commercial models are evaluated with 11 different prompts on established error correction benchmarks. LLMs perform the best on fluency edits but do not come close to state-of-the-art performance on minimal corrections of grammatical errors.

55. Control Prefixes for Parameter-Efficient Text Generation

Jordan Clive, Kris Cao, Marek Rei. Imperial, DeepMind, Cambridge. GEM 2022.
https://aclanthology.org/2022.gem-1.31

Introducing control prefixes, which are plug-and-play modules for influencing text generation into a particular direction. By switching on a particular control prefix, the model can generate text in a particular domain, in a particular style or at a specific length. Experiments also include predicting values for a previously unseen generation category.

56. Logical Reasoning with Span-Level Predictions for Interpretable and Robust NLI Models

Joe Stacey, Pasquale Minervini, Haim Dubossarsky, Marek Rei. Imperial College London, University of Edinburgh, UCL, Queen Mary University of London. EMNLP 2022.
https://aclanthology.org/2022.emnlp-main.251

A neural architecture for entailment detection (NLI) that has guarantees on the faithfulness of its explanations. Text is broken into shorter spans, the model makes predictions about each span separately and the final overall decision is found deterministically based on the span-level predictions. The updated model retains performance while being more explainable, and also outperforms all previous logical architectures for entailment detection.

57. Probing for targeted syntactic knowledge through grammatical error detection

Christopher Davis, Christopher Bryant, Andrew Caines, Marek Rei, Paula Buttery. Cambridge, Imperial. CoNLL 2022.
https://aclanthology.org/2022.conll-1.25

Investigating how much pre-trained language models capture syntactic information and how well are they able to detect syntactic errors out-of-the-box. Different language models are frozen and small probes are trained on top of them to identify specific errors in text. Analysis shows that the final layers of ELECTRA and BERT capture subject-verb agreement errors best.

58. Memorisation versus Generalisation in Pre-trained Language Models

Michael Tänzer, Sebastian Ruder, Marek Rei. Imperial, Google Research. ACL 2022.
https://aclanthology.org/2022.acl-long.521

Investigating the learning abilities of language models under controlled experiments. Results show that LMs are surprisingly resilient to noise in the training data, with the resulting performance being nearly unaffected as long as the learning is stopped at an optimal time. However, the models are not able to differentiate between label noise and low-resource classes, with overall performance deteriorating just as the rare classes start to be learned. The paper proposes a model based on class prototypes to get the best of both worlds.

59. Multimodal Conversation Modelling for Topic Derailment Detection

Zhenhao Li, Marek Rei, Lucia Specia. Imperial. EMNLP 2022.
https://aclanthology.org/2022.findings-emnlp.376

Creating and releasing a dataset of reddit threads that contain images. Posts that derail the conversation are then identified and annotated with the derailment type, such as starting a new topic, spamming or making toxic comments. A multimodal architecture for this task is also described, which encodes the post, the image and the context in order to make accurate decisions.

60. DiffuseDef: Improved Robustness to Adversarial Attacks

Zhenhao Li, Marek Rei, Lucia Specia. Imperial. ArXiv.
https://arxiv.org/abs/2407.00248

Introducing a defusion module as a denoising step for text representations in order to prevent adversarial attacks.
The diffusion layer is trained on top of a frozen encoder to predict the randomly inserted noise in the representation.
During inference, this noise is then subtracted over multiple iterations to get a clean representation.
Results show state-of-the-art results for resisting adversarial attacks.

61. Supervising Model Attention with Human Explanations for Robust Natural Language Inference

Joe Stacey, Yonatan Belinkov, Marek Rei. Imperial, Technion. AAAI 2022.
https://cdn.aaai.org/ojs/21386/21386-13-25399-1-2-20220628.pdf

Investigating how natural language explanations of particular label assignments can be used to improve model performance.
The method identifies important words in the input, either based on explanation text or highlights, and then trains the model to assign higher self-attention weights to those tokens. Experiments show that this method consistently improves performance, making the model focus more on the relevant evidence.

62. Guiding Visual Question Generation

Nihir Vedd, Zixu Wang, Marek Rei, Yishu Miao, Lucia Specia. Imperial. NAACL 2022.
https://aclanthology.org/2022.naacl-main.118.pdf

Investigates guiding the process of visual question generation, where the system generates questions about a given image.
The architecture allows the user to specify categories or objects in the image, which the system will then ask about.
These choices can also be modeled as latent variables inside the model, leading to state-of-the-art results in regular VQG.

63. Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers

Kamil Bujel, Andrew Caines, Helen Yannakoudakis, Marek Rei. Imperial, Cambridge, KCL. Arxiv 2023.
https://arxiv.org/pdf/2303.07991

Designing a hierarchical transformer model that is able to explain itself by pointing to relevant tokens in the input.
It indirectly supervises the attention on a certain number of tokens at each turn, in order to make the model behave like a binary importance classifier on the token level. Previous methods that work well on shorter texts (e.g. individual sentences) are shown to not work well when applied to longer texts.

64. Distilling Robustness into Natural Language Inference Models with Domain-Targeted Augmentation

Joe Stacey, Marek Rei. Imperial. ACL 2024.
https://aclanthology.org/2024.findings-acl.132.pdf

Investigating model distillation methods that would be better able to generalise to previously unseen domains. The methods either generate new unlabeled examples or upweight existing examples from the training data that are more similar to a particular domain, then use these as input during the distilling process. Experiments show that this increases robustness also towards domains which are not considered during distillation.

65. The alignment of companies’ sustainability behavior and emissions with global climate targets

Simone Cenci, Matteo Burato, Marek Rei, Maurizio Zollo. Imperial. Nature Communications 2024.

Applying NLP systems to analyse thousands of company reports and the sustainability initiatives described in those reports.
The system crawls public reports online, extracts sentences that refer to sustainability initiatives implemented by that company, and classifies them based on type, stakeholder and one of the 17 Sustainable Development Goals established by the UN.
Analysis indicates that companies are mostly investing in risk-prevention initiatives, as opposed to innovation and cooperation.

66. Predicting cell type-specific epigenomic profiles accounting for distal genetic effects

Alan E Murphy, William Beardall, Marek Rei, Mike Phuycharoen, Nathan G Skene. Imperial, Manchester. Nature Communications 2024.
https://www.biorxiv.org/content/10.1101/2024.02.15.580484

Using architectures based on pre-trained transformer language models and extending them for the domain of genome modeling.
Proposing and releasing Enformer Celltyping, which can incorporate long-range context of DNA base pairs and predict epigenetic signals while being cell type-agnostic.

67. SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)

Matthieu Meeus, Igor Shilov, Shubham Jain, Manuel Faysse, Marek Rei, Yves-Alexandre de Montjoye. Imperial, Sense Street, Université Paris-Saclay. Arxiv 2024.
https://arxiv.org/pdf/2406.17975

In order to develop methods for detecting whether specific copyrighted work has been used for training a particular LLM, researchers have collected datasets of known documents that are (or are not) part of specific training sets. However, these datasets are normally collected post-hoc, leading to distibution differences. The experiments show that many detection models misleadingly react to those differences, instead of truly detecting the documents in the training data. Suggestions are provided for preventing this issue in future experiments.

68. StateAct: State Tracking and Reasoning for Acting and Planning with Large Language Models

Nikolai Rozanov, Marek Rei. Imperial. Arxiv 2024.
https://arxiv.org/abs/2410.02810

Improving LLMs for long-range planning and reasoning, for tasks which require a large number of steps to complete.
As the length of the few-shot examples and the generated trace grows longer, LLMs can lose track of what they are supposed to do and what they have done already. Periodically reminding them of the task and explicitly keeping track of their state provides consistent improvements in performance, establishing a new state-of-the-art result for the Alfworld benchmark.

ML and NLP Publications in 2021

Marek April 28, 2022 Uncategorized 5 Comments

I am sharing here the yearly paper analysis for 2021, containing statistics about ML and NLP publications from the past year. It has arrived later than in previous years - preparing it this time took quite a bit longer than intended. The new data required some manual cleaning and updating of the pipeline, which meant the analysis got delayed quite a bit. But finally, here it is now. :)

The analysis of the papers is done using a series of automated tools. These processes are not perfect so some noise and errors may occur. Some authors have also recently started releasing their papers in an obfuscated form for some reason, preventing any copying or automated extraction of the content, so these have had to be excluded from some of the analysis. But overall, each year the pipeline gets a bit better and the bugs from previous years get fixed, so it should provide a good picture of the field.

Many thanks to Chen Cecilia Liu and Jonas Pfeiffer for their help with matching organizations to countries!

This post isn't meant to glorify publishing huge amounts of papers. Quality is definitely more important than quantity. In fact, I think our field has gotten too focused on publishing rapidly in large quantities, which tends to give an advantage to quick iterative papers over thorough groundbreaking ideas. The aim of this analysis is just to provide a bit of a higher-level view of what is happening in the field, which institutions are currently the major players and which researchers have the largest groups.

Venues

Let's start by looking at the conferences themselves. The publication numbers for most of the conferences again kept going up and breaking new records. One exception seems to be ACL - this is likely due to a heavier use of the Findings format, which I didn't include in these statistics. AAAI seems to be almost levelling out, whereas NeurIPS still keeps a steady growth rate. COLING and AACL did not take place in 2021, but both EACL and NAACL did.

Organizations

The organization with most published papers in 2021 (by a wide margin) was Google. Microsoft also manages to offer some competition in that space, with CMU, Stanford, Facebook and MIT ranking after that. Microsoft, CAS, Amazon, Tencent, Cambridge, Washington and Alibaba stand out as having quite a large proportion of papers at NLP conferences, whereas the other top organizations seem to focus mostly on ML venues.

Looking at the statistics for the whole period of 2012-2021, Google with 2170 papers has finally overtaken Microsoft with 2013 papers. CMU with 1881 publications is also represented in the top 3 cluster.

Most of the organizations have also continued to increase their yearly publication count. Google has finally broken their linear acceleration in publication numbers, but still released more papers than ever before. CMU had a plateau last year but has made up for it this year. IBM seems to be the only top organization with a slight downward trajectory, likely related to them selling large parts of Watson recently.

Authors

Next, let's look at the researchers who published the most papers in 2021. Sergey Levine (Berkeley, Google) towers above everyone else with 42 papers. Tie-Yan Liu (Microsoft), Jie Zhou (Tsinghua), Mohit Bansal (UNC) and Graham Neubig (CMU) are also among the top publishing researchers.

Looking at the whole period of 2012-2021, Sergey Levine (Berkeley, Google) is again at the top position. Having ranked 6th last year, he had a really prolific year and has overtaken everyone else. Yoshua Bengio (Montreal), Graham Neubig (CMU), Yue Zhang (Westlake), Ming Zhou (Sinovation, Microsoft) and Ting Liu (Harbin) make up the rest of the overall top publishers.

Note that a couple of names have been manually removed from the list. While trying to identify their affiliation, I found that publications from multiple researchers with identical names have been aggregated. Unfortunately I don't have the necessary technology at the moment to separate the publications in such cases.

The breakdown over the years gives an overview of when each researcher has published most. Sergey Levine has set the new overall record, by quite a margin. Mohit Bansal also increased his paper output by quite a lot, releasing 31 publications in 2021, to the same number as Graham Neubig. Yoshua Bengio had a decrease in the number of papers in 2020, but that count is now back up again.

First Authors

The researchers with the largest numbers of papers are generally supervisors for many post-docs and students working in their group. In contrast, first authors are usually those who do the practical work, so it's good to analyse their counts separately.

Ramit Sawhney (Tower Research Capital, IIIT Delhi) published a very impressive 9 papers in 2021. Jason Wei (Google Research) and Tiago Pimentel (Cambridge) also stand out with 6 publications.

Looking at the year range 2012-2021, Ivan Vulić (Cambridge, PolyAI) and Zeyuan Allen-Zhu (Microsoft) have both managed to publish an imprissive 24 papers as first authors. Yi Tay (Google) and Jiwei Li (Shannon.AI, Zhejiang, Stanford) are ranked next, with 23 and 22 papers, respectively. Ilias Diakonikolas (UW Madison) has an impressive 15 first-author NeurIPS papers. Haris Aziz (UNSW Sydney) has published exclusively in AAAI.

Countries

Looking at the 2021 publication counts by country highlights how much the United States publishes. China and the UK are also among the top 3. NeurIPS has the largest proportion for the US and the UK, while AAAI is the preferred venue for China.

Nearly all the top countries have continued to increase their publication counts, setting new individual records in 2021. For the US, this increase is the largest, further widening the lead.

USA

As the US publishes so much, their graph for 2021 looks very similar to the overall graph. Google, Microsoft and CMU are again at the top of the publishing counts.

China

In China, Tsinghua, CAS and Peking published the most in 2021.

UK

DeepMind, Oxford and Cambridge are towering above the rest in the UK.

Canada

Toronto really stands out in Canada, with the Vector Institute, McGill and Montreal also being well-represented.

Germany

In Germany, Tübingen and Robert Bosch published more than 40 papers.

South Korea

In South Korea, KAIST has an impressive lead over the other organizations. Seoul and Samsung also stand out.

Topic Similarity

For this section, I ran the papers through LDA and then visualised them using t-SNE.

Visualization of the organizations shows that they are mostly clustered according to geographical proximity, with companies centered in the middle.

We can do the same visualization for the authors, although these clusters are a bit more difficult to interpret.

The countries are again clustered based on geography, this time with USA in the middle of the graph.

Keywords

We can also plot the proportion of papers that contain a particular keyword and track how this has changed over time.

The word "neural" seems to have a very slight downward trend, although it can still be found in more than 80% of the papers. The proportion of words "recurrent" and "convolutional" is also decreasing, whereas "transformer" can now be found in more than 30% of the papers.

Looking at just the keyword "adversarial", we see that it is particularly popular in ICLR, with almost half of the papers using it. The count for "adversarial" seems to have peaked for ICML and NeurIPS, while it has been steadily increasing for AAAI.

The keyword "transformer" has gotten very popular in the past couple of years. It is particularly widely used in NLP papers, with over 50% of the publications containing it, but the popularity is steadily increasing also in all the ML conferences.

Fun Facts

And a couple more random facts about 2021 to finish:

Most authors on a single paper: MasakhaNER: Named Entity Recognition for African Languages in TACL with 61 authors. This doesn't quite take the all-time record but gets an honorable second position.
The longest paper title: A theory of high dimensional regression with arbitrary correlations between input features and target functions: sample complexity, multiple descent curves and a hierarchy of phase transitions at ICML. This title actually takes the all-time record as well.
The shortest paper title: Light RUMs at ICML.

Advice for students doing research projects in ML/NLP

Marek March 14, 2022 Uncategorized 3 Comments

This is a collection of advice that I give to students doing research projects in NLP/ML/AI. It includes suggestions that I wish I had known when I myself first started, as well as lessons from supervising students in previous years.

I would recommend reading this once before starting your project, then again after about a month or two into the project. Different things will seem relevant to you.

Implementation and debugging

Don’t assume that your code is bug free – check and debug. Deep learning code can be difficult to get right. Make sure the tensors have the right shapes and dimensions. Occasionally you’ll need to think in 5 dimensions and it’s easy to get it wrong. Make sure you know what the library functions are actually doing instead of blindly applying them. Some pytorch and tensorflow functions make it easy to apply complicated operations but they also make it easy to confuse what their input and output actually contain. It is always a good idea to test sections of the code with small toy examples.
Whenever you assume something about your data or output, put an assert into the code which will check that this assumption is always true. For example, if you know how long the input list should be or that the output should contain only values in a certain range. Many times this has helped me catch issues that I didn’t expect.
While automated checks are useful, they won’t catch everything. So you also have to manually check your model output and performance. Make sure that the output looks as expected and that the results make sense. If you expect one configuration to perform better than another, but the results show something different, then investigate and figure out why that is. Sometimes you find a bug, other times you learn something new about the model or the task.
When your model isn’t learning anything, there are two diagnostic tests I recommend running: 1) Make sure that your model can memorize a small training set. For example, a model should be able to easily memorize (get 100% performance) a training set with 10 examples. If not, this indicates a problem with the code or the data. 2) If your model takes features, give the label as an input feature and train on the whole dataset. Evaluate on the dev set, also giving labels as input features. The model should get perfect performance both on the train and dev set. If not, this indicates issues with network connections or optimization.
During training, evaluate performance on the training and development sets at certain intervals. For most models, the performance on both will initially go up. Then at some point performance on the development set will start to drop while the performance on the training set will keep improving, which indicates overfitting. To deal with this, use early stopping – store a copy of the model whenever it outperforms all previous iterations on the development set. Once the training is done, load back the stored best model. To avoid unnecessary training time, you can stop the training when the best performance on the development set hasn’t improved for a certain number of epochs. This is a robust default strategy for training neural models. If you use some other strategy, then best to cite the paper where you got it from. Note: Early stopping is becoming less important with large pre-trained language models and it’s common to just train for a fixed number of 2-3 epochs. We have some experiments investigating this here.
Use version control and external backups. You don’t want to suddenly lose all your hard work. It’s also frustrating when you get your system to work, make some edits that break it and then have no idea which exact edits need to be reversed. Git is good, Github and Bitbucket both allow free private repositories.
Hyperparameter tuning is still a dark art, as specific values often interact with other parameters and it’s not computationally feasible to actually try all of their combinations. With time and experience, you will get better at predicting which parameters work best for specific architectures. Until then, start with parameter values that others have used in related work and explore how changing different values, in isolation or in small parameter groups, affects the model.

Evaluation

Put extra care into making sure that the evaluation is implemented correctly. If you have bugs in your model, then the model usually won’t work well and you would discard it. But if you have bugs in your evaluation, you might end up reporting high results for a model that doesn’t actually work. Finding out later that your published results are all imaginary is not a situation you want to be in.
Make sure to set up separate train/dev/test sets. Use the training set for parameter optimization, the development set for choosing hyperparameters and early stopping, and the test set for final evaluation. You can use the training set in more flexible ways, but you’re not allowed to edit or subsample the test set. For example, you can leave out sentences that are too long in the training set, but you can’t do that for the test set. It needs to remain fixed in order to be comparable to other published results. Cross-validation is also an option if the dataset is too small for a fixed split. If previous researchers have split the data already, then you need to use the same splits as them; don’t create a new split without a strong reason.
Learn to handle randomness in your models. Random seeds affect various parts of the processing stream – shuffling of the data, initialisation of the weights, etc, so it’s a good idea to try to set them. However, sometimes neural components will give random results anyway. This is down to how the toolkits are implemented and we don’t have much control over them. GPU operations are highly parallel and when results finish at different times, some functions can combine in different order, which results in small rounding errors. These can compound and cause noticeably different results when running the same experiment multiple times, even when controlling the random seeds. To compensate for this, I usually run each experiment several (e.g. 10) times and report the average result. For full scientific rigor, you should also calculate and report the standard deviation and/or confidence interval of these results. Unfortunately, many people just tend to report a single result from a neural model, which in some cases can be a misleadingly high value that is later difficult to replicate.
Compare performance to published results to understand if your model is working as well as it should. Find a similar model in literature, set up your model configuration based on theirs, train on the same data and see if you are able to replicate their results. If you are working on an established task, start by getting your baseline performance to the previous state-of-the-art level. That way any improvement that you achieve will set a new state-of-the-art. It is tempting to show improvements over a weaker baseline, but pretty much anything will give an improvement if the baseline is weak enough. Trying to explain why your best system is still outperformed by a stronger baseline would be a massive headache later when trying to publish or defend your work.
Perform significance tests on your final results when possible. It adds credibility to your results. I recommend the randomization test, but any test is better than no test. If you show a significant improvement through an actual significance test, make that clear in the paper and state the test that was used. However, if you haven’t performed any significance test, then do not use phrases like “results are significantly better” when describing your work as that will appear misleading.
If your novel system has multiple novel components, then for a convincing evaluation you need to perform an ablation experiment. This means either a) removing individual components from the full system to see how much performance drops, or b) incrementally adding individual components to the simple system to see how much performance improves at each step. The goal is to measure the benefit of each component separately, to make sure that they all are actually useful. Otherwise, if you only evaluate the combined system, then it could easily be that only one modification is providing all the improvements and the others are useless or even lowering performance.

Project planning

In the beginning of your project, start by reading the papers related to your topic. If you only know a small number of related papers, try looking at the papers that are referenced in these. Google Scholar also has a “Cited by” button at each paper that will show you other later papers that reference this particular paper. Simple keyword searches on Google and ArXiv are useful too. Get familiar with the current state-of-the-art, try to get ideas that you can integrate into your project, while also looking for existing models that you can compare against. If you don’t yet feel comfortable with the toolkits that you will be working with, focus on getting some practice with these as well.
If there is any previous work on your task, ideally the state-of-the-art, then start by reimplementing that. If they already have the code available then great, just get it running. Otherwise, write the code, train the model on the same data and evaluate the same way as they do. If you get the same results, then great, you know that your system and evaluation are working. If not, debug and improve until they are working, or find out exactly what is the cause of the difference. You don’t want to have a fundamental bug somewhere deep in your code all throughout the project.
Implement a simple version of your idea early and quickly. Don’t spend a long time designing and implementing a complex architecture in the beginning. Instead, start by implementing and training the simplest system you can imagine, which shouldn’t take longer than a week or two. Make sure you are getting reasonable results and that the evaluation is set up properly. Only then you should build on it and improve the results with more complex models. It is much easier to debug a basic system and this will give you a chance to get to know the problem and the dataset. Plus, you can always use the original model as a baseline in your work.
Make a plan, even if you don’t follow it. Making a plan is a useful exercise, as it makes you think through all the different parts of the project that need to be completed, how they fit together, how much time they might take and what might go wrong. After you have created the plan, you will have a reasonable idea in which direction you are going, hopefully sanity-checked by your supervisor. However, once you actually start work on the project, you will learn new things about the problem and get better ideas, so this plan is likely to change and improve. And that is fine. The main thing is to always have a plan for where you are heading based on current information; when you get new information, feel free to revise that plan.
Whenever possible, aim to push the state-of-the-art performance on your task. There are exceptions and chasing the highest number should not be a goal by itself, but it helps with making sure that your contribution is actually advancing the field. Otherwise, showing improvements only over a trivial baseline will imply that this problem has already been solved by other methods but you are trying to ignore or hide them. On the other hand, if you start with an existing state-of-the-art system and improve on that, then it is much easier to claim that you have made a novel and useful finding. Setting a new state-of-the-art also usually means it will be easier to publish this work as a paper.
When brainstorming ideas, think in terms of what additional information or capability is not currently available to the model but would help it make better decisions. Trying to blindly integrate the flavor-of-the-month ideas into every model is not a good strategy.
When using computation resources that have quotas, make sure to use your available time/credit sensibly. Estimate how long your experiment will take (or set a limit), multiply by the number of jobs you are starting and make sure the speed at which you are spending credits lasts you until the end of the project. If you’re just starting, better start cautiously, your code is likely to still have bugs and you might need to rerun all of these experiments anyway. If you’re finishing the project and know exactly which experiments are left to write a good report, you can take full advantage of the parallel computation. Just be sensible – every year there is someone who blows their whole GPU cluster credit in the first week of the project. You don’t want to be stuck training your neural models on CPUs.
Look at some existing dissertations to get a feel for what they should look like and what kind of content is expected. Do that in the beginning of the project, as opposed to the end.
When choosing topics to work on, I recommend choosing something that is close to your heart and something that makes a difference and drives the field forward. In the past, I have found myself solving a question only to realise that it doesn’t really need to be answered.
Reading related work is very important, but try to learn from other fields as well. Read about research in different areas of machine learning, NLP, cognitive science, neuroscience, linguistics, vision and signal analysis. By broadening your world view, you can bring new ideas to your research topic and find new exciting applications in other domains.
If your group has reading groups, take active part in them. They motivate you to read papers outside of your narrow topic, which expands your view and can give new ideas that can be adapted for your project. Reading groups also encourage interaction between the group members and this can lead to very interesting discussions. Volunteering to present allows you to choose the topic that you would like the group to discuss, lets others know what you are working on and it’s a good way to get some experience with public speaking.

Writing and Publishing

Start writing early. Don’t underestimate the difficulty of writing up a project report. Communicating your research, motivation and architectures clearly to the reader and presenting a coherent story takes effort. Plus, you might find during writing that in order to present a full picture you need to go back and run some additional experiments. Finish your write-up early enough so that someone, probably your supervisor, has time to read it though and give you feedback and you have time to implement changes based on this feedback.
Proof-read your writing. This applies to any writing – reports, dissertations and papers. Grammatical errors, spelling errors and repeated words leave an overall sloppy impression and can bias a reviewer in a negative direction even if the work itself is good.
I usually recommend aiming to publish a paper based on your project. Publishing your findings is important for the field and having a peer-reviewed publication makes your CV positively stand out in future applications to any research-related positions. It is the same work you will be doing for your report/dissertation anyway, might as well also get a paper out of it. You just have to make sure that the work is good enough to be publishable.
Many venues provide an opportunity for a rebuttal, to reply to the reviewers. These don’t result in an increased score very often, but it is still useful to write one – as an exercise for yourself and to better inform the reviewers, so that they get an informed picture of your paper. Focus on the areas that can be objectively argued, for example, if they have misunderstood something. Sometimes the reviewer just subjectively doesn’t like the paper and there’s not much that can be done about that.
Papers get rejected, it’s very common and happens to everyone. Use this as a learning experience, improve the paper based on the feedback and then resubmit. Improvements can include additional experiments that the reviewers requested, clarifying areas that confused them or changing the way that the work is presented in order to focus more on a slightly different aspect.
Most of the time you will not agree with the reviewers’ comments. You will probably think “Are they stupid? I clearly covered that issue in Section X”. You’ll have to accept the fact that reviewer quality varies quite a lot and most of them have very little time to devote to your paper. That is why it’s up to you to make the paper so clear and well-written that even the worst reviewers are able to follow it. Standing your ground and fighting with the reviewers will not get your paper accepted. Instead, think about how you can convince the reviewers or best clarify their misunderstandings. The reviewers represent a sample of your potential future readers as well, so any improvements you make also translate to your audience understanding the paper better.
In terms of the author list, the main author usually goes in the first position, the main supervisor in the last position and everyone else who substantially contributed to the paper would be added in between.
Read the submission requirements provided by the conference/journal and stick to them. These are quite strict. Being over the page limit by one line is enough to get the paper rejected even without a review. Messing with the page margins and font sizes will also disqualify a paper. Make sure to use the required latex template for that particular conference – they will vary between conferences.
ArXiv is a useful place to submit a paper for extra publicity. It acts as a version-controlled repository of non-peer-reviewed research papers. Some things to keep in mind: 1) Anyone can upload any paper to ArXiv so having it there is not the same as publishing it at a conference or journal. 2) Once you upload a paper, it is there forever. You can still update the paper but the old versions will also be publicly visible. 3) Many conferences now have a mandatory anonymity period, which extends before the submission deadline, during which you can’t upload the paper anywhere, including arxiv. That means you either need to complete and upload the paper before this period starts or you have to wait until the conference decisions have been sent out. 4) ArXiv keeps the source latex of your paper and it will be available for anyone to download. You might want to clean it up a bit before submitting, removing notes and comments that are not meant for the public.
Don’t be afraid to reach out to researchers beyond your immediate research group. This can be in order to ask for clarification about a paper they have published or initiating a collaboration. They can bring in new expertise and ideas, which would benefit everybody.
When collaboratively writing a paper, Overleaf is a good platform to use. Many universities have a paid subscription to Overleaf – try adding your university email to your account to see if you get upgraded for free. It should work for Imperial and Cambridge.
Do not use Apple’s Preview or Quartz for creating your pdfs. It has been known to create pdfs that cannot be opened by Adobe Acrobat. This in turn can lead to a paper being rejected without review. Use pdflatex or download the pdf from Overleaf.
If you need to submit an Early Stage Report (this is specific to Imperial) or a background report, create this as a very early version of your thesis/dissertation. Structure the report similar to how you would structure the thesis. Have a clear background section that covers previous work. And use the style files that you would for a thesis. You can then extend this same report for your final work later.

Interacting with your supervisor

Remember that you are in charge of your project. The role of the supervisor is to give advice and to help direct you onto a promising path. But this is your research project, so the supervisor is not supposed to plan out every experiment that needs to be run. It is up to you to plan your research (with the help of your supervisor), decide which experiments need to be run and also brainstorm new ideas. Take active charge of your project and set your own targets, instead of expecting to tick off a weekly checklist created by the supervisor.
At the same time, listen to your supervisor. If they tell you to do something specific, you should probably do it. This can be your first similar project whereas they have done many of them before. They most likely know more about the field and what is necessary in a report or paper. Ideally, they will give you freedom to explore, but if they tell you to get something specific done then there is probably a good reason for it.
Meet your supervisor, ideally once a week. Use this to give an update on your work and discuss any questions you might have. The regular weekly meeting helps to keep you motivated and progressing each week.
Your supervisor is probably working with many more projects in addition to yours. When you meet up after a week, it’s good to give a bit of context and summary regarding the state of the project and which part you are currently focusing on, before jumping into the fine-grained details. This helps get the supervisor onto the right wavelength.
The supervisor is not always right. Listen to them and their arguments, as they probably have more experience than you. But if you are still not convinced then find a way to learn more or try your idea out empirically. In the worst case, you will learn something new. The supervisor wants what’s best for you, but they’re not all-knowing oracles. I’ve heard plenty of stories about supervisors in the early 2010’s telling their students not to work on neural networks because ‘that field isn’t going anywhere’.
Feel free to contact your supervisor with questions. Don’t wait until the next meeting if you have a question that stops you from progressing. However, please don’t send your supervisor questions that you can answer by googling.
During the project you should become more of an expert on your specific research topic than your supervisor. The supervisor provides a starting point, a broad overview and the experience. But it is your job to go in depth and find out everything you can about the specific questions that you are trying to answer in your project – get to know the previous related work, the latest developments, the available datasets and benchmarks, etc. For at least a short moment, you should be the world-leading expert on your very narrow research question.

If you made it this far, well done. If you are just starting with your project, I imagine some of these points might not mean much to you yet. I recommend going through this list again after you have worked on the project for a month or two.

ML and NLP Publications in 2020

Marek February 11, 2021 Uncategorized 2 Comments

I ran my paper analysis pipeline once again in order to get statistics for 2020. It certainly was an unusual year. While ML and NLP conferences again had more publications than ever before, most of them needed to quickly adapt to a new remote format. Each conference took a slightly different approach as everyone was trying to figure out how to make this work. I heard especially positive comments about EMNLP 2020, regarding their smooth organisation and engaging technical solutions. Overall, I think the remote format has its pluses and minuses - while it certainly complicates networking and socialising, it also makes these conferences much more accessible to a wider range of audience. Hopefully the online participation options will be made available even after we are able to have in-person meetings again.

This post includes the analysis of publications from the following conferences and journals: ACL, EMNLP, NAACL, EACL, AACL, COLING, TACL, CL, CoNLL, NeurIPS, ICML, ICLR, AAAI. All the information is crawled and processed automatically from the corresponding proceedings and directly from the pdf files. Some noise likely still remains, so the graphs are more indicative of general trends as opposed to specific paper counts. Big thanks to Jonas Pfeiffer (@PfeiffJo) for adding the country annotations to all the new organisations, so that we can also run the country-level analysis below.

These statistics are not meant to imply that the quantity of publications is an important measure of a researcher. Having one groundbreaking paper is always more meaningful than churning out many forgettable pieces. But it can be a good way of getting an overview of active research groups and hopefully it can inspire new researchers to publish their own work.

Venues

Let's start by looking at the overall number of papers published at different venues. Following the trends from the previous few years, most venues continued to break records in this regard. ML and AI conferences such as NeurIPS, AAAI and ICML had particularly large increases in paper numbers. ACL and EMNLP had more modest increments, but still more than ever before. CoNLL took a slight decrease in numbers, possibly due to narrowing their focus back towards linguistics as opposed to engineering. AACL was introduced as a brand new conference for the Asia-Pacific Chapter of ACL.

Organizations

Looking at which organizations published most papers in 2020, it is clear to see that Google manages to dominate this space. Microsoft holds a respectable second position and CMU is the the top publishing university. MIT, Berkeley, DeepMind and Oxford are mostly publishing only at ML conferences. In contrast, Microsoft, Tencent, Uni. of CAS, Alibaba and Amazon have significant proportions of their publications at NLP conferences.

Looking at the statistics for the whole 2012-2020 period, the top three positions are nailbitingly close: Microsoft still leads with 1580, then Google with 1570 and CMU with 1537.

Most of the top publishing organizations also have upward trajectories through the years. Google has a bizarrely straight line going up at 45 degrees for the past few years; I can almost imagine some strategist drawing this with a ruler in their company plan.

Authors

Next, let's look at the researchers who published the most papers in 2020. Graham Neubig from CMU ranks first in this aspect, with 31 publications. Others with impressive numbers of papers include Yue Zhang (Westlake University), Sergey Levine (UC Berkeley), Ting Liu (Harbin Institute of Technology), Zhiyuan Liu (Tsinghua University) and Ming Zhou (Microsoft Research Asia).

Comparing the counts across the whole period of 2012-2020, Ming Zhou from MSR Asia has taken the overall lead. He was ranked third in 2019 and published an impressive 28 papers in 2020, while Yoshua Bengio and Chris Dyer (who were the top two until last year) have considerably scaled down their paper numbers this year.

The breakdown over the years gives an overview of when each researcher has published most. Yue Zhang and Ting Liu have both improved their overall ranking, having had a particularly successful last year in terms of publishing. While Sergey Levine still holds the overall record of most papers per year, Graham Neubig managed to publish the most in 2020.

First Authors

The most published researchers are generally supervisors for many students/post-docs performing the practical experiments. In contrast, first authors are usually those who do the legwork, implementing the actual code and writing much of the paper.

Normally, I would make a chart showing the most publishing first authors from the last year. However, this time all the top ranks were cases where two or more people were publishing under the same name. I don't currently have any technology that can automatically identify and disambiguate these authors (future research project perhaps?) so I'm skipping this graph and jumping straight to the overall statistics between 2012-2020. Zeyuan Allen-Zhu (MSR AI), Jiwei Li (Shannon.AI) and Ivan Vulić (Cambridge/PolyAI) have the most impressive publication records as first authors, with Zeyuan publishing in the ML area, Jiwei and Ivan publishing mostly in NLP.

Countries

Separating the statistics by country highlights just how much the United States publishes. China is giving a strong effort as well, with the UK being third.

While China is definitely publishing more and more every year, it seems the US still manages to increase the lead. Overall, there is a general trend for more papers and most countries continue to increase their scientific output.