Don’t count, predict

Marek September 22, 2014 Uncategorized 8 Comments

In the past couple of years, neural networks have nearly taken over the field of NLP, as they are being used in recent state-of-the-art systems for many tasks. One interesting application is distributional semantics, as they can be used to learn intelligent dense vector representations for words. Marco Baroni, Georgiana Dinu and German Kruszewski presented a paper in ACL 2014 called “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors“, where they compare these new neural-network models with more traditional context vectors on a range of different tasks. Here I will try to give an overview and a summary of their work.

Distributional hypothesis

The goal is to find how similar two words are to each other semantically. The distributional hypothesis states:

Words which are similar in meaning occur in similar contexts
(Rubenstein & Goodenough, 1965).

Therefore, if we want to find a word similar to “magazine”, we can look for words that occur in similar contexts, such as “newspaper”.

I was reading a magazine today	I was reading a newspaper today
The magazine published an article	The newspaper published an article
He buys this magazine every day	He buys this newspaper every day

Also, if we want to find how similar “magazine” and “newspaper” are, we can compare how similar are all the contexts in which they appear. For example, to find the similarity between two words, we can represent the contexts as feature vectors and calculate the cosine similarity between their corresponding vectors.

The counting model

In a traditional counting model, we count how many times a certain context word appears near our main word. We have to manually choose a window size – the number of words we count on either side of the main word.

For example, given sentences “He buys this newspaper every day” and “I read a newspaper every day”, we can look at window size 2 (on either side of the main word) and create the following count vector:

[table]
buys,this,every,day,read,a
1,1,2,2,1,1
[/table]

As you can see, “every” and “day” get a count of 2, because they appear in both sentences, whereas “he” doesn’t get included because it doesn’t fit into the window of 2 words.

In order to get the best performance, it is important to use a weighting function which turns these counts into more informative values. A good weighting scheme would downweight frequent words such as “this” and “a”, and upweight more informative words such as “read”. The authors experimented with two weighting schemes: Pointwise Mutual Information and Local Mutual Information. An overview of various weighting methods can be found in the thesis of Curran (2003).

These vectors are also often compressed. The motivation is that feature words such as “buy” and “purchase” are very similar and compression techniques should be able to represent them as a single semantic class. In addition, this gives smaller dense vectors, which can be easier to operate on. The authors here experimented with SVD and Non-negative Matrix Factorization.

The predicting model

A the neural network (predicting) model, this work uses the word2vec implementation of Continuous Bag-of-Words (CBOW). Word2vec is a toolkit for efficiently learning word vectors from very large text corpora (Mikolov et al., 2013). The CBOW architecture is shown below:

The vectors of context words are given as input to the network. They are summed and then used to predict the main word. During training, the error is backpropagated and the context vectors are updated so that they would predict the correct word. Since similar words appear in similar contexts, the vectors of similar words will also be updated towards similar directions.

A normal network would require finding the probabilities of all possible words in our vocabulary (using softmax), which can be computationally very demanding. Word2vec implements two alternatives to speed this up:

Hierarchical softmax, where the words are arranged in a tree (similar to a decision tree), making the complexity logarithmic.
Negative sampling, Where the system learns to distinguish the correct answer from a sample of a few incorrect answers.

In addition, word2vec can downsample very frequent words (for example, function words such as “a” and “the”) which are not very informative.

From personal experience I have found skip-gram models (also implemented in word2vec) to perform better than CBOW, although they are slightly slower. Skip-grams were not compared in this work, so it is still an open question which of the two models gives better vectors, given the same training time.

Evaluation

The models were trained on a corpus of 2.8 billion tokens, combining ukWac, the English Wikipedia and the British National Corpus. They used 300,000 most frequent words as both the context and target words.

The authors performed on a number of experiments on different semantic similarity tasks and datasets.

Semantic relatedness: In this task, humans were asked to rate the relatedness of two words, and the system correlation with these scores is measured.
- rg: 65 noun pairs by Rubenstein & Goodenough (1965)
- ws: Wordsim353, 353 word pairs by Finkelstein et al. (2002)
- wss: Subset of Wordsim353 focused on similarity (Agirre et al., 2009)
- wsr: Subset of Wordsim353 focused on relatedness (Agirre et al., 2009)
- men: 1000 word pairs by Bruni et al. (2014)
Synonym detection: The system needs to find the correct synonym for a word.
- toefl: 80 multiple-choice questions with 4 synonym candidates (Landauer & Dumais, 1997).
Concept categorization: Given a set of concepts, the system needs to group them into semantic categories.
- ap: 402 concepts in 21 categories (Almuhareb, 2006)
- esslli: 44 concepts in 6 categories (Baroni et al., 2008)
- battig: 83 concepts in 10 categories (Baroni et al., 2010)
Selectional preferences: Using some examples, the system needs to decide how likely a noun is to be a subject or object for a specific verb.
- up: 221 word pairs (Pado, 2007)
- mcrae: 100 noun-verb pairs (McRae et al., 1998)
Analogy: Using examples, the system needs to find a suitable word to solve analogy questions, such as: “X is to ‘uncle’ as ‘sister’ is to ‘brother'”
- an: ~19,500 analogy questions (Mikolov et al., 2013b)
- ansyn: Subset of the analogy questions, focused on syntactic analogies
- ansem: Subset of the analogy questions, focused on semantic analogies

Results

The graph below illustrates the performance of different models on all the tasks. The blue bars represent the counting models, and the red bars are for neural network models. The “individual” models are the best models on that specific task, whereas the “best-overall” is the single best model across all the tasks.

In conclusion, the neural networks win with a large margin. The neural models have given a large improvement on the tasks of semantic relatedness, synonym detection and analogy detection. The performance is equal or slightly better on categorization and selectional preferences.

The best parameter choices for counting models are as follows:

window size 2 (bigger is not always better)
weighted with PMI, not LMI
no dimensionality reduction (not using SVD or NNMF)

The best parameters for the neural network model were:

window size 5
negative sampling (not hierarchical softmax)
subsampling of frequent words
dimensionality 400

The paper also finds that the neural models are much less sensitive to parameter choices, as even the worst neural models perform relatively well, whereas counting models can thoroughly fail with unsuitable values.

It seems appropriate to quote the final conclusion of the authors:

Our secret wish was to discover that it is all hype, and count vectors are far superior to their predictive counterparts. A more realistic expectation was that a complex picture would emerge, with predict and count vectors beating each other on different tasks. Instead, we found that the predict models are so good that, while triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture.

The results certainly support that vectors created by neural models are more suitable for distributional similarity tasks. We have also performed a similar comparison on the task of hyponym generation (Rei & Briscoe, 2014). Our findings match those here – the neural models outperform simple bag-of-words models. However, the neural models are in turn outperformed by a dependency-based model. It remains to be seen whether this finding also applies on a wider range of distributional similarity models.

References

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., & Soroa, A. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics on – NAACL ’09, (June), 19. doi:10.3115/1620754.1620758

Almuhareb, A. (2006) Attributes in Lexical Acquisition. Phd thesis, University of Essex.

Baroni, M., Barbu, E., Murphy, B., & Poesio, M. (2010). Strudel: A distributional semantic model based on properties and types. Cognitive Science, 34.

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 238–247).

Baroni, M., Evert, S., & Lenci, A., editors. (2008). Bridging the gap between Semantic
Theory and Computational Simulations: Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics.

Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal Distributional Semantics. J. Artif. Intell. Res.(JAIR), 49, 1–47.

Curran, J. R. (2003). From distributional to semantic similarity. University of Edinburgh. University of Edinburgh.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., & Ruppin, E. (2002). Placing search in context: the concept revisited. In ACM Transactions on Information Systems (Vol. 20, pp. 116–131). ACM. doi:10.1145/503104.503110

McRae, K., Spivey-Knowlton, M., & Tanenhaus, M. (1998). Modeling the influence of the- matic fit (and other constraints) in on-line sentence comprehension. Journal of Memory and Language, 38:283–312.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop, 1–12.

Mikolov, T., Yih, W., & Zweig, G. (2013b). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Pado, U. (2007). The Integration of Syntax and Semantic Plausibility in a Wide-Coverage Model of Sentence Processing. Dissertation, Saarland University, Saarbrücken.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. In CoNLL-2014 (pp. 68–77).

Rubenstein, H., & Goodenough, J. (1965). Contextual correlates of synonymy.
Communications of the ACM, 8 (10), 627–633.

8 Comments

Pingback:Linguistic Regularities in Word Representations - Marek ReiMarek Rei
anon
December 12, 2014 at 11:24 am
Reply

You state in the last paragraph that dependency-based models outperform neural-net models. Do you have a citation for that?
- Marek Post author
  June 22, 2015 at 12:34 am
  Reply
  
  This was a finding in my own work: Looking for Hyponyms in Vector Space
  
  Omer Levy also found that dependency-based neural embeddings outperform traditional neural embeddings: Dependency-Based Word Embeddings
  
  But I’m sure this will entirely depend on the specific task where they are applied – dependency based features will put more emphasis on word similarity (car – bus) as opposed to relatedness (car – wheel).
Acron
September 29, 2016 at 12:42 am
Reply

Thanks Marek, great post.

Can we use word2vec to predict the next word in a sentence (autocomplete)? Most of the evaluation metrics use word analogies or lexical similarity; not really touching on finding the most appropriate next word that could work.
- Marek Post author
  October 1, 2016 at 3:34 pm
  Reply
  
  Hi Acron,
  
  Word2vec isn’t really designed for next word prediction. It looks at words on both sides of the target word to build a better context representation, whereas next word prediction assumes you’ve only seen words on the left.
  You might want to look at language modelling, which can be done using similar approaches but is more suitable for next word prediction.
Arnold
November 3, 2020 at 1:28 pm
Reply

belong to to the resolution Mabosway sports betting stand-in by registering – Are you looking for an looking for
excitement and supplementary experience in the world of online gambling?
You have come to the right place, because here you can link
us, the best and most trusted gambling site agent in Indonesia!

We are fortunate to be dexterous to have the funds for the best servants to members who are always faithful to
trust us and make us a gambling partner. Therefore, we have prepared various types of fun and popular
games such as Mabosway sports betting, casino, lottery,
and others.

Not on your own that, there are along with many promos
that we will find the money for and of course it is all
aimed at the lead of the members. all the services that are good enough for you when you become a supporter
here. Here are some of our advantages over supplementary sites:

The enlargement and withdrawal process is quick and accurate
Members’ personal data and identities are always safe and
will not leak to additional parties
Mabosway login, one user ID that can be used for various types of games
Customer staff who are always ready to assist you 24/7
Unlimited Bonuses and Promos
The best utility supported by years of experience has made Mabosway always trusted by its members and earned the title as the best gambling site agent in Indonesia.
crownslot
November 7, 2020 at 7:58 am
Reply

Tyre of Fortune: The show biz industry Edition

Wheel regarding Fortune is the most effective online casino video games you can perform in 2019, also it was developed simply by a company usually cited since the finest casino developer,
IGT. If you’ve observed the Wheel regarding Fortune show just before, you
probably understand how it takes on. In this game, you play the
role of one of many contestants, and is actually under
your control how much money you want to guess about each spin.

this specific game lets an individual win a whole lot of
money in case you are fortunate. The income may be huge, but an individual must wager a few cash to have it.

this game is amazing for everyone who else is playing the sport and wants in order to be part of the
display as opposed to just observing it on TELEVISION, watching other individuals get excited while they are sitting down at home.

If you have ever before desired to be the part of typically the Wheel
of Fortune, this online casino may be what you are looking
for!

How Does Wheel regarding Fortune: Hollywood Version Work?

This online game is in fact a representation of the famous TV show,
where you bet the wager and spin the wheel. Your income increase based on how much money a person bet, however they
usually are also tied in order to the payoff an individual get.

Apart from that, this particular game has much
more features to supply. A person get a chance to get yourself a random jackpot feature bonus.
This will be seen whenever you rewrite the jackpot steering wheel and
an image appears on the display screen – with every spin you choose, the particular jackpot
increases.
ion casino
November 26, 2020 at 7:09 am
Reply

Mabosbola is an recognized casinos agent that provides access
for online casino enthusiasts in Indonesia to the best online on line casino gambling games.

Mabosbola provides a connecting site by giving full access to suit your needs fans of casino
gambling games to the best on the internet casino game service provider
sites like Sgd777 or Nova88. Founded in 2018, Mabosbola has an incredible
number of users from all over Philippines. Every day you can find hundreds of men and women attached to the online games offered
by Mabosbola. Despite the fact that this website is continue to relatively young,
this particular site has recently been confirmed to become able
to offer professional services for each customer
and get an official certificate from PAGCOR.

The Best Types of On the internet Casino Games
Besides it provide complete access to various types of complete online casino
online game collections, Mabosbola also provides quality plus international standard video
games served by typically the best internet casino wagering
providers inside their class. The following usually are a number of the provider firms who have
collaborated together with Mabosbola in offering the very best online on line casino
gambling in Indonesia.