Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar vectors means these words also behave similarly in the model, which is what we want for good generalisation properties.
However, word embeddings have a couple of weaknesses:
- If a word doesn’t exist in the training data, we can’t have an embedding for it. Therefore, the best we can do is clump all unseen words together under a single OOV (out-of-vocabulary) token.
- If a word only occurs a couple of times, the word embedding likely has very poor quality. We simply don’t have enough information to learn how these words behave in different contexts.
- We can’t properly take advantage of character-level patterns. For example, there is no way to learn that all words ending with -ing are likely to be verbs. The best we can do is learn this for each word separately, but that doesn’t help when faced with new or rare words.
In this post I will look at different ways of extending word embeddings with character-level information, in the context of neural sequence labeling models. You can find more information in the Coling 2016 paper “Attending to characters in neural sequence labeling models“.
We’ll investigate word representations in order to improve on the task of sequence labeling. In a sequence labeling setting, a system gets a series of tokens as input and it needs to assign a label to every token. The correct label typically depends on both the context and the token itself. Quite a large number of NLP tasks can be formulated as sequence labeling, for example:
DT NN VBD NNS IN DT DT NN CC DT NN .
The pound extended losses against both the dollar and the euro .
+ + + x + + + + + x +
I like to playing the guitar and sing very louder .
Named entity recognition
PER _ _ _ _ ORG ORG _ TIME _
Jim bought 300 shares of Acme Corp. in 2006 .
B-NP B-PP B-NP I-NP B-VP I-VP I-VP I-VP B-PP B-NP B-NP O
Service on the line is expected to resume by noon today .
In each of these cases, the model needs to understand how a word is being used in a specific context, and could also take advantage of character-level patterns and morphology.