This page is for the course on Machine Learning for Language Modelling that I am teaching in University of Tartu.

Time and location

  Date Time Room
Lecture 1 6. April 2015 14:15 J. Liivi 2 - 612
Lecture 2 7. April 2015 14:15 J. Liivi 2 - 611
Lecture 3 8. April 2015 14:15 J. Liivi 2 - 202
Lecture 4 9. April 2015 14:15 J. Liivi 2 - 202
Practical 10. April 2015 14:15 J. Liivi 2 - 004

Lecture slides

Homework

Deadline: 5. May 2015, 23:59 Estonian time

For your submission of both language models:

The neural network skeleton code is available on github: neurallm-exercise. More information in the readme file.

Dataset

I have prepared and preprocessed a dataset for you to use when developing and training your language models. It is created from Wikipedia text, separated into sentences, tokenised, and lowercased. The data is separated into training, development and test sets. The training set contains approximately 10M words. You can process these files as you wish, create separate subsets, etc.

The „unk100“ files have been preprocessed, so that all words that occur less than 100 times in the training data are replaced by a special UNK token.

The „topNK“ files contain the first N thousand lines of the full file. It is likely that training on the full dataset is too time-consuming for the neural network language model, therefore I made some subsets.

Download the dataset here: lm-dataset.tar.gz

Practical

We'll train two neural network models. You can use the dataset described in the previous section for training data.

First, we'll look at the word2vec toolkit: http://code.google.com/p/word2vec/

  1. Download and compile the code.
  2. Train word vectors: ./word2vec -train trainingfile -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
  3. Run ./distance vectors.bin and enter words to find other words with the most similar vectors. For example, try these words as input: island, university, france, night

Second, we'll try out RNNLM, the recurrect neural network language model: http://rnnlm.org/

  1. Download the toolkit and compile it. You might need to set "CC = g++" (or the compiler you have installed) in the makefile
  2. Train a model: ./rnnlm -train trainingfile -valid validationfile -rnnlm model -hidden 15 -class 100 -bptt 4
  3. Test the model and measure perplexity (PPL): ./rnnlm -rnnlm model -test test
  4. Change some parameters and see how that changes perplexity on the test set. For example, try a larger hidden layer (only 15 at the moment). In order to see the full list of settings, run ./rnnlm without parameters. You need to delete the model file every time, or use a different name, otherwise the system will continue training as opposed to starting fresh.