Advice for students doing research projects in ML/NLP

This is a collection of advice that I give to students doing research projects in NLP/ML/AI. It includes suggestions that I wish I had known when I myself first started, as well as lessons from supervising students in previous years.

I would recommend reading this once before starting your project, then again after about a month or two into the project. Different things will seem relevant to you.

Implementation and debugging

Don’t assume that your code is bug free – check and debug. Deep learning code can be difficult to get right. Make sure the tensors have the right shapes and dimensions. Occasionally you’ll need to think in 5 dimensions and it’s easy to get it wrong. Make sure you know what the library functions are actually doing instead of blindly applying them. Some pytorch and tensorflow functions make it easy to apply complicated operations but they also make it easy to confuse what their input and output actually contain. It is always a good idea to test sections of the code with small toy examples.
Whenever you assume something about your data or output, put an assert into the code which will check that this assumption is always true. For example, if you know how long the input list should be or that the output should contain only values in a certain range. Many times this has helped me catch issues that I didn’t expect.
While automated checks are useful, they won’t catch everything. So you also have to manually check your model output and performance. Make sure that the output looks as expected and that the results make sense. If you expect one configuration to perform better than another, but the results show something different, then investigate and figure out why that is. Sometimes you find a bug, other times you learn something new about the model or the task.
When your model isn’t learning anything, there are two diagnostic tests I recommend running: 1) Make sure that your model can memorize a small training set. For example, a model should be able to easily memorize (get 100% performance) a training set with 10 examples. If not, this indicates a problem with the code or the data. 2) If your model takes features, give the label as an input feature and train on the whole dataset. Evaluate on the dev set, also giving labels as input features. The model should get perfect performance both on the train and dev set. If not, this indicates issues with network connections or optimization.
During training, evaluate performance on the training and development sets at certain intervals. For most models, the performance on both will initially go up. Then at some point performance on the development set will start to drop while the performance on the training set will keep improving, which indicates overfitting. To deal with this, use early stopping – store a copy of the model whenever it outperforms all previous iterations on the development set. Once the training is done, load back the stored best model. To avoid unnecessary training time, you can stop the training when the best performance on the development set hasn’t improved for a certain number of epochs. This is a robust default strategy for training neural models. If you use some other strategy, then best to cite the paper where you got it from. Note: Early stopping is becoming less important with large pre-trained language models and it’s common to just train for a fixed number of 2-3 epochs. We have some experiments investigating this here.
Use version control and external backups. You don’t want to suddenly lose all your hard work. It’s also frustrating when you get your system to work, make some edits that break it and then have no idea which exact edits need to be reversed. Git is good, Github and Bitbucket both allow free private repositories.
Hyperparameter tuning is still a dark art, as specific values often interact with other parameters and it’s not computationally feasible to actually try all of their combinations. With time and experience, you will get better at predicting which parameters work best for specific architectures. Until then, start with parameter values that others have used in related work and explore how changing different values, in isolation or in small parameter groups, affects the model.

Evaluation

Put extra care into making sure that the evaluation is implemented correctly. If you have bugs in your model, then the model usually won’t work well and you would discard it. But if you have bugs in your evaluation, you might end up reporting high results for a model that doesn’t actually work. Finding out later that your published results are all imaginary is not a situation you want to be in.
Make sure to set up separate train/dev/test sets. Use the training set for parameter optimization, the development set for choosing hyperparameters and early stopping, and the test set for final evaluation. You can use the training set in more flexible ways, but you’re not allowed to edit or subsample the test set. For example, you can leave out sentences that are too long in the training set, but you can’t do that for the test set. It needs to remain fixed in order to be comparable to other published results. Cross-validation is also an option if the dataset is too small for a fixed split. If previous researchers have split the data already, then you need to use the same splits as them; don’t create a new split without a strong reason.
Learn to handle randomness in your models. Random seeds affect various parts of the processing stream – shuffling of the data, initialisation of the weights, etc, so it’s a good idea to try to set them. However, sometimes neural components will give random results anyway. This is down to how the toolkits are implemented and we don’t have much control over them. GPU operations are highly parallel and when results finish at different times, some functions can combine in different order, which results in small rounding errors. These can compound and cause noticeably different results when running the same experiment multiple times, even when controlling the random seeds. To compensate for this, I usually run each experiment several (e.g. 10) times and report the average result. For full scientific rigor, you should also calculate and report the standard deviation and/or confidence interval of these results. Unfortunately, many people just tend to report a single result from a neural model, which in some cases can be a misleadingly high value that is later difficult to replicate.
Compare performance to published results to understand if your model is working as well as it should. Find a similar model in literature, set up your model configuration based on theirs, train on the same data and see if you are able to replicate their results. If you are working on an established task, start by getting your baseline performance to the previous state-of-the-art level. That way any improvement that you achieve will set a new state-of-the-art. It is tempting to show improvements over a weaker baseline, but pretty much anything will give an improvement if the baseline is weak enough. Trying to explain why your best system is still outperformed by a stronger baseline would be a massive headache later when trying to publish or defend your work.
Perform significance tests on your final results when possible. It adds credibility to your results. I recommend the randomization test, but any test is better than no test. If you show a significant improvement through an actual significance test, make that clear in the paper and state the test that was used. However, if you haven’t performed any significance test, then do not use phrases like “results are significantly better” when describing your work as that will appear misleading.
If your novel system has multiple novel components, then for a convincing evaluation you need to perform an ablation experiment. This means either a) removing individual components from the full system to see how much performance drops, or b) incrementally adding individual components to the simple system to see how much performance improves at each step. The goal is to measure the benefit of each component separately, to make sure that they all are actually useful. Otherwise, if you only evaluate the combined system, then it could easily be that only one modification is providing all the improvements and the others are useless or even lowering performance.

Project planning

In the beginning of your project, start by reading the papers related to your topic. If you only know a small number of related papers, try looking at the papers that are referenced in these. Google Scholar also has a “Cited by” button at each paper that will show you other later papers that reference this particular paper. Simple keyword searches on Google and ArXiv are useful too. Get familiar with the current state-of-the-art, try to get ideas that you can integrate into your project, while also looking for existing models that you can compare against. If you don’t yet feel comfortable with the toolkits that you will be working with, focus on getting some practice with these as well.
If there is any previous work on your task, ideally the state-of-the-art, then start by reimplementing that. If they already have the code available then great, just get it running. Otherwise, write the code, train the model on the same data and evaluate the same way as they do. If you get the same results, then great, you know that your system and evaluation are working. If not, debug and improve until they are working, or find out exactly what is the cause of the difference. You don’t want to have a fundamental bug somewhere deep in your code all throughout the project.
Implement a simple version of your idea early and quickly. Don’t spend a long time designing and implementing a complex architecture in the beginning. Instead, start by implementing and training the simplest system you can imagine, which shouldn’t take longer than a week or two. Make sure you are getting reasonable results and that the evaluation is set up properly. Only then you should build on it and improve the results with more complex models. It is much easier to debug a basic system and this will give you a chance to get to know the problem and the dataset. Plus, you can always use the original model as a baseline in your work.
Make a plan, even if you don’t follow it. Making a plan is a useful exercise, as it makes you think through all the different parts of the project that need to be completed, how they fit together, how much time they might take and what might go wrong. After you have created the plan, you will have a reasonable idea in which direction you are going, hopefully sanity-checked by your supervisor. However, once you actually start work on the project, you will learn new things about the problem and get better ideas, so this plan is likely to change and improve. And that is fine. The main thing is to always have a plan for where you are heading based on current information; when you get new information, feel free to revise that plan.
Whenever possible, aim to push the state-of-the-art performance on your task. There are exceptions and chasing the highest number should not be a goal by itself, but it helps with making sure that your contribution is actually advancing the field. Otherwise, showing improvements only over a trivial baseline will imply that this problem has already been solved by other methods but you are trying to ignore or hide them. On the other hand, if you start with an existing state-of-the-art system and improve on that, then it is much easier to claim that you have made a novel and useful finding. Setting a new state-of-the-art also usually means it will be easier to publish this work as a paper.
When brainstorming ideas, think in terms of what additional information or capability is not currently available to the model but would help it make better decisions. Trying to blindly integrate the flavor-of-the-month ideas into every model is not a good strategy.
When using computation resources that have quotas, make sure to use your available time/credit sensibly. Estimate how long your experiment will take (or set a limit), multiply by the number of jobs you are starting and make sure the speed at which you are spending credits lasts you until the end of the project. If you’re just starting, better start cautiously, your code is likely to still have bugs and you might need to rerun all of these experiments anyway. If you’re finishing the project and know exactly which experiments are left to write a good report, you can take full advantage of the parallel computation. Just be sensible – every year there is someone who blows their whole GPU cluster credit in the first week of the project. You don’t want to be stuck training your neural models on CPUs.
Look at some existing dissertations to get a feel for what they should look like and what kind of content is expected. Do that in the beginning of the project, as opposed to the end.
When choosing topics to work on, I recommend choosing something that is close to your heart and something that makes a difference and drives the field forward. In the past, I have found myself solving a question only to realise that it doesn’t really need to be answered.
Reading related work is very important, but try to learn from other fields as well. Read about research in different areas of machine learning, NLP, cognitive science, neuroscience, linguistics, vision and signal analysis. By broadening your world view, you can bring new ideas to your research topic and find new exciting applications in other domains.
If your group has reading groups, take active part in them. They motivate you to read papers outside of your narrow topic, which expands your view and can give new ideas that can be adapted for your project. Reading groups also encourage interaction between the group members and this can lead to very interesting discussions. Volunteering to present allows you to choose the topic that you would like the group to discuss, lets others know what you are working on and it’s a good way to get some experience with public speaking.

Writing and Publishing

Start writing early. Don’t underestimate the difficulty of writing up a project report. Communicating your research, motivation and architectures clearly to the reader and presenting a coherent story takes effort. Plus, you might find during writing that in order to present a full picture you need to go back and run some additional experiments. Finish your write-up early enough so that someone, probably your supervisor, has time to read it though and give you feedback and you have time to implement changes based on this feedback.
Proof-read your writing. This applies to any writing – reports, dissertations and papers. Grammatical errors, spelling errors and repeated words leave an overall sloppy impression and can bias a reviewer in a negative direction even if the work itself is good.
I usually recommend aiming to publish a paper based on your project. Publishing your findings is important for the field and having a peer-reviewed publication makes your CV positively stand out in future applications to any research-related positions. It is the same work you will be doing for your report/dissertation anyway, might as well also get a paper out of it. You just have to make sure that the work is good enough to be publishable.
Many venues provide an opportunity for a rebuttal, to reply to the reviewers. These don’t result in an increased score very often, but it is still useful to write one – as an exercise for yourself and to better inform the reviewers, so that they get an informed picture of your paper. Focus on the areas that can be objectively argued, for example, if they have misunderstood something. Sometimes the reviewer just subjectively doesn’t like the paper and there’s not much that can be done about that.
Papers get rejected, it’s very common and happens to everyone. Use this as a learning experience, improve the paper based on the feedback and then resubmit. Improvements can include additional experiments that the reviewers requested, clarifying areas that confused them or changing the way that the work is presented in order to focus more on a slightly different aspect.
Most of the time you will not agree with the reviewers’ comments. You will probably think “Are they stupid? I clearly covered that issue in Section X”. You’ll have to accept the fact that reviewer quality varies quite a lot and most of them have very little time to devote to your paper. That is why it’s up to you to make the paper so clear and well-written that even the worst reviewers are able to follow it. Standing your ground and fighting with the reviewers will not get your paper accepted. Instead, think about how you can convince the reviewers or best clarify their misunderstandings. The reviewers represent a sample of your potential future readers as well, so any improvements you make also translate to your audience understanding the paper better.
In terms of the author list, the main author usually goes in the first position, the main supervisor in the last position and everyone else who substantially contributed to the paper would be added in between.
Read the submission requirements provided by the conference/journal and stick to them. These are quite strict. Being over the page limit by one line is enough to get the paper rejected even without a review. Messing with the page margins and font sizes will also disqualify a paper. Make sure to use the required latex template for that particular conference – they will vary between conferences.
ArXiv is a useful place to submit a paper for extra publicity. It acts as a version-controlled repository of non-peer-reviewed research papers. Some things to keep in mind: 1) Anyone can upload any paper to ArXiv so having it there is not the same as publishing it at a conference or journal. 2) Once you upload a paper, it is there forever. You can still update the paper but the old versions will also be publicly visible. 3) Many conferences now have a mandatory anonymity period, which extends before the submission deadline, during which you can’t upload the paper anywhere, including arxiv. That means you either need to complete and upload the paper before this period starts or you have to wait until the conference decisions have been sent out. 4) ArXiv keeps the source latex of your paper and it will be available for anyone to download. You might want to clean it up a bit before submitting, removing notes and comments that are not meant for the public.
Don’t be afraid to reach out to researchers beyond your immediate research group. This can be in order to ask for clarification about a paper they have published or initiating a collaboration. They can bring in new expertise and ideas, which would benefit everybody.
When collaboratively writing a paper, Overleaf is a good platform to use. Many universities have a paid subscription to Overleaf – try adding your university email to your account to see if you get upgraded for free. It should work for Imperial and Cambridge.
Do not use Apple’s Preview or Quartz for creating your pdfs. It has been known to create pdfs that cannot be opened by Adobe Acrobat. This in turn can lead to a paper being rejected without review. Use pdflatex or download the pdf from Overleaf.
If you need to submit an Early Stage Report (this is specific to Imperial) or a background report, create this as a very early version of your thesis/dissertation. Structure the report similar to how you would structure the thesis. Have a clear background section that covers previous work. And use the style files that you would for a thesis. You can then extend this same report for your final work later.

Interacting with your supervisor

Remember that you are in charge of your project. The role of the supervisor is to give advice and to help direct you onto a promising path. But this is your research project, so the supervisor is not supposed to plan out every experiment that needs to be run. It is up to you to plan your research (with the help of your supervisor), decide which experiments need to be run and also brainstorm new ideas. Take active charge of your project and set your own targets, instead of expecting to tick off a weekly checklist created by the supervisor.
At the same time, listen to your supervisor. If they tell you to do something specific, you should probably do it. This can be your first similar project whereas they have done many of them before. They most likely know more about the field and what is necessary in a report or paper. Ideally, they will give you freedom to explore, but if they tell you to get something specific done then there is probably a good reason for it.
Meet your supervisor, ideally once a week. Use this to give an update on your work and discuss any questions you might have. The regular weekly meeting helps to keep you motivated and progressing each week.
Your supervisor is probably working with many more projects in addition to yours. When you meet up after a week, it’s good to give a bit of context and summary regarding the state of the project and which part you are currently focusing on, before jumping into the fine-grained details. This helps get the supervisor onto the right wavelength.
The supervisor is not always right. Listen to them and their arguments, as they probably have more experience than you. But if you are still not convinced then find a way to learn more or try your idea out empirically. In the worst case, you will learn something new. The supervisor wants what’s best for you, but they’re not all-knowing oracles. I’ve heard plenty of stories about supervisors in the early 2010’s telling their students not to work on neural networks because ‘that field isn’t going anywhere’.
Feel free to contact your supervisor with questions. Don’t wait until the next meeting if you have a question that stops you from progressing. However, please don’t send your supervisor questions that you can answer by googling.
During the project you should become more of an expert on your specific research topic than your supervisor. The supervisor provides a starting point, a broad overview and the experience. But it is your job to go in depth and find out everything you can about the specific questions that you are trying to answer in your project – get to know the previous related work, the latest developments, the available datasets and benchmarks, etc. For at least a short moment, you should be the world-leading expert on your very narrow research question.

If you made it this far, well done. If you are just starting with your project, I imagine some of these points might not mean much to you yet. I recommend going through this list again after you have worked on the project for a month or two.

3 Comments

Shubo Tian
March 16, 2022 at 4:05 pm

One comment I would like to add is to well formulate the research question at the very beginning, and draw a flow chart to frame the research work.
Shivprasad Sagare
August 7, 2022 at 11:45 am

Great tips, Marek! I’ve been through many of these situations. Having them documented in one place makes it a good checklist to follow in the future.
SANJAY SINGH
July 25, 2024 at 12:51 am

hey could u tell me about some topics where i can do research in Natural Language Processing?