Vectorsets

This is a collection of lexical vector sets. Each word in the vocabulary is represented as a real-valued vector, which can be used for various word similarity applications. The vectors are built using different techniques and therefore have varying properties. The following vector sets are currently available:

1. Window

The vectors are created by counting word co-occurrences in a fixed context window. Every word that occurs within a window of three words before or after is counted as a feature for the target word. Pointwise mutual information is then used for weighting.

The file is in the sparse vector format. Each line contains the vector for one word. The first token is the word, followed by tokens of <feature-id>:<feature-value>.

2. Word2Vec

Representations created using the word2vec toolkit. The tool is based on a feedforward neural network language model, with modifications to make representation learning more efficient (Mikolov et al., 2013a). We make use of the skip-gram model, which takes each word in a sequence as an input to a log-linear classifier with a continuous projection layer, and predicts words within a certain range before and after the input word. The window size was set to 5 and vectors were trained with both 100 and 500 dimensions.

The file is in the dense vector format used by word2vec. The first line contains 2 tokens: the number of words and the length of vectors. After that, there is one line per word. The first token in the word, followed by feature values.

3. Depenencies

Vector representations are created by using dependency relations from a parser as features. Every incoming and outgoing dependency relation is counted as a feature, together with the connected term. For example, given the dependency relation (play, dobj, guitar), the tuple (dobj, guitar) is extracted as a feature for play, and (!dobj, play) as a feature for guitar. We use only features that occur more than once in the dataset, and weight them using pointwise mutual information to construct feature vectors for every term. Features with negative weights were retained, as they proved to be beneficial for some similarity measures.

The file contains the same sparse vector format as the window-based vectors.

Training

The vector sets were all trained on 112M words from the British National Corpus, with preprocessing steps for lowercasing and lemmatising. Any numbers were grouped and substituted by more generic tokens. For constructing the dependency-based vector representations, we used the parsed version of the BNC created by Andersen et al. (2008) with the RASP toolkit (Briscoe et al., 2006). The vocabulary sizes of these vector sets are different - word2vec discards words that occur less than 5 times, and some words do not occur in dependency graphs.

Referencing

When you use any of these vector sets, please reference the following paper:

Looking for hyponyms in vector space Marek Rei and Ted Briscoe In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL-14) Baltimore, Maryland, United States, 2014

Download

These vector sets are available for download. As the files are rather large, please download them once and keep a local copy.

Dependency-based vectors:
https://s3-eu-west-1.amazonaws.com/vectorsets/vectors_dep.txt.gz

Word2vec vectors (size 100):
https://s3-eu-west-1.amazonaws.com/vectorsets/vectors_nnet_100.txt.gz

Word2vec vectors (size 500):
https://s3-eu-west-1.amazonaws.com/vectorsets/vectors_nnet_500.txt.gz

Window-based vectors:
https://s3-eu-west-1.amazonaws.com/vectorsets/vectors_window_3.txt.gz