Weighted Embeddings

This is a system for learning word weights, optimised for sentence-level vector similarity.

A popular method of constructing sentence vectors is to add together word embeddings for all the words in the sentence. We show that this simple model can be improved by learning a unique scalar weight for every word in the vocabulary. These weights are trained on a corpus of plain text, by optimising the similarity of nearby sentences to be high and the similarity of random sentences to be low. By applying the resulting weights in an additive model, we see improvements on the task of topic relevance detection.

You can find more details in the following paper:

Sentence Similarity Measures for Fine-Grained Estimation of Topical Relevance in Learner Essays Marek Rei and Ronan Cummins In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications (BEA) San Diego, United States, 2016

The source code for training the weights is available at: https://github.com/marekrei/weighted-embeddings

The trained weights are in weightedembeddings_word_weights.txt
They are desgined to be used together with the 300-dimensional word2vec vectors, pretrained on Google News:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing