# Linguistic Regularities in Word Representations

In 2013, Mikolov et al. (2013) published a paper showing that complicated semantic analogy problems could be solved simply by adding and subtracting vectors learned with a neural network. Since then, there has been some more investigation into what is actually behind this method, and also some suggested improvements. This post is a summary/discussion of the paper “Linguistic Regularities in Sparse and Explicit Word Representations“, by Omer Levy and Yoav Goldberg, published at ACL 2014.

The task under consideration is analogy recovery. These are questions in the form:

a is to b as c is to d

In a usual setting, the system is given words a, b, c, and it needs to find d. For example:

‘apple’ is to ‘apples’ as ‘car’ is to ?

where the correct answer is ‘cars’. Or the well-known example:

‘man’ is to ‘woman’ as ‘king’ is to ?

where the desired answer is ‘queen’.

While methods such as relation extraction would also be completely reasonable approaches to this problem, the research is mainly focused on solving it by using vector similarity methods. This means we create vector representations for each of the words, and then use their positions in the high-dimensional feature space to determine what the missing word should be.

## Vector Spaces

As the first step, we need to create feature vectors for each word in our vocabulary. The two main ways of doing this, which are also considered by this paper, are:

1. BOW: The bag-of words approach. We count all the words that a certain word appears with, treat the counts as features in a vector, and weight them using something like positive pointwise mutual information (PPMI).
2. word2vec: Vectors created using the word2vec toolkit. Low-dimensional dense vector representations are learned for each word using a neural network.

If you want to learn more details about these models, take a look at an earlier post about comparing both of these models.

## Analogy Artihmetic

We have the vectors, but how can we find the correct words in the analogy task?

The idea of Mikolov et al. (2013) was that the vector offset of vectors should reflect their relation:

$$a – b \approx c – d$$
$$man -woman \approx king -queen$$

We can rearrange this to find d:

$$d \approx c – a +b$$
$$queen \approx king -man +woman$$

Therefore, we can calculate the vector for $$c – a + b$$, compare it to the vectors of all words in our vocabulary, and return the highest-scoring word as the answer.

$$d_w = arg max_{d_w’ \in V}(cos(d’, c – a +b))$$

I’m using $$d_w$$ here to denote the actual word, whereas the $$d$$ stands for the vector representation of $$d_w$$. $$V$$ is our whole vocabulary. $$cos()$$ is the cosine similarity between two vectors. Let’s refer to this as the addition method.

An alternative to this is to consider the directions of the offset directly. We find the word that has the most similar offset (in terms of direction), compared to the offset of the sample words.

$$d_w = arg max_{d’_w \in V}(cos(a – b, c – d’))$$

It only looks at the direction of the relational step, ignoring the actual magnitude of this step. We’ll see later that sometimes this is good, sometimes not so much. We’ll refer to this method as direction. Apparently this method was actually used by Mikolov in some experiments, although it wasn’t included in the paper.

Now onto the modification proposed by Levy & Goldberg (2014). The addition method basically looks for a word that is similar to c and b, and not similar to a, with each contributing equally. This can lead to cases where one large term dominates the equation, especially if the magnitudes of the similarities are very different. Therefore, the authors have proposed that instead of adding these similarities, they could be multiplied:

$$d_w = arg max_{d’_w \in V}\frac{cos(d’, c) cos(d’, b)}{cos(d’, a) + \epsilon}$$

If the vectors are normalised (unit length), then this formula actually becomes a multiplication of their scalar products:

$$d_w = arg max_{d’_w \in V}\frac{(d’ \cdot c) \times (d’ \cdot b)}{(d’ \cdot a) + \epsilon}$$

$$\epsilon$$ is set to 0.001 to avoid division by zero. Also, the formula doesn’t handle negative similarity scores, whereas cosine (and the scalar product) can be negative. Therefore, the values are transformed into a positive range by doing $$x_{new} = (x + 1)/2$$. We’ll refer to this as the multiplication method.

## Evaluation

The alternative methods are evaluated on the task of analogy recovery, using three different datasets:

1. The MSR dataset containing 8,000 analogy questions.
2. The GOOGLE dataset with 19,544 analogy questions.
3. The SEMEVAL dataset, covering 79 distinct relation types.

In the MSR and GOOGLE datasets, the system needs to find the correct answer from an open vocabulary. In contrast, the SEMEVAL task provides candidates for each relation that need to be reranked. Accuracy is used as the evaluation metric on all datasets.

The word vectors were trained on 1.5 billion words of English Wikipedia. The vocabulary was restricted to contain only words that occurred at least 100 times in the corpus, resulting in 189,533 words.

## Results

Some observations based on the results:

• The multiplication method outperforms the additon method on two tasks, and gives comparable results on the third one.
• The direction method performs horribly on the MSR and GOOGLE tasks, but gives the best results on SEMEVAL. This is because the method only looks at the direction of the relation, but doesn’t care if the result is actually positioned very far in the semantic space. The SEMEVAL task restricts the candidates to a small list, and the method is able to pick the best option, but doesn’t perform as well on open vocabulary.
• The BOW model is generally worse than word2vec when using the addition method, but is comparable to the best results when using multiplication. This indicates that the property of linguistic regularities applies to both classical BOW and neural network representations.

## Discussion

Based on my experience with applying different vector operations to discover linguistic regularities, I’d like to point out two observations.

First, while the ability to recover semantic relations in such a way is remarkable, it is not quite as impressive as it seems at first glance. Let’s take the well-known example of king – man + woman = queen. Depending on the parameters I used, several vector spaces I built actually gave queen as the most similar word to king, with no vector operations needed. This makes sense, because king and queen are semantically very related, and probably occur in very similar contexts. But it also means that (woman – man) only needs to be a zero vector or some small random noise in order to give the correct answer. I’m not saying this is a major factor in these experiments, but I would encourage future work to also report baselines without using any vector operations.

Second, I observed that the quality of these regularities highly depends on the examples that are provided, and how they relate to the query word. For my work (Rei & Briscoe, 2014) I looked at hyponym relations, and below are some examples when trying to find hyponyms for bird.

The greyed out terms are the query terms, and it is common practice to ignore these in the results; the bold results are correct answers. As you can see, just the plain vector similarity returns one bird species correctly in the top results. When using (salmon,fish) as the example pair, the system is quite good and returns a whole lot more. But as I start to use hyponym examples from other domains, the quality degrades fast, and (bird-treatment+therapy) actually returns worse results than the basic vector similarity.

I believe there is still plenty of space to further improve out understanding of how we can operate on various types of features, in order to model different semantic relations and compositional semantics.

## References

Levy, O., & Goldberg, Y. (2014). Linguistic Regularities in Sparse and Explicit Word Representations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 171–180).

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations, (June), 746–751.

Rei, M., & Briscoe, T. (2014). Looking for Hyponyms in Vector Space. In CoNLL-2014 (pp. 68–77).

# Neural Networks, Part 3: The Network

We have learned about individual neurons in the previous section, now it’s time to put them together to form an actual neural network.

The idea is quite simple – we line multiple neurons up to form a layer, and connect the output of the first layer to the input of the next layer. Here is an illustration:

Each red circle in the diagram represents a neuron, and  the blue circles represent fixed values. From left to right, there are four columns: the input layer, two hidden layers, and an output layer. The output from neurons in the previous layer is directed into the input of each of the neurons in the next layer.

We have 3 features (vector space dimensions) in the input layer that we use for learning: $$x_1$$, $$x_2$$ and $$x_3$$. The first hidden layer has 3 neurons, the second one has 2 neurons, and the output layer has 2 output values. The size of these layers is up to you – on complex real-world problems we would use hundreds or thousands of neurons in each layer.

The number of neurons in the output layer depends on the task. For example, if we have a binary classification task (something is true or false), we would only have one neuron. But if we have a large number of possible classes to choose from, our network can have a separate output neuron for each class.

The network in Figure 1 is a deep neural network, meaning that it has two or more hidden layers, allowing the network to learn more complicated patterns. Each neuron in the first hidden layer receives the input signals and learns some pattern or regularity. The second hidden layer, in turn, receives input from these patterns from the first layer, allowing it to learn “patterns of patterns” and higher-level regularities. However, the cost of adding more layers is increased complexity and possibly lower generalisation capability, so finding the right network structure is important.

## Implementation

I have implemented a very simple neural network for demonstration. You can find the code here: SimpleNeuralNetwork.java

The first important method is initialiseNetwork(), which sets up the necessary structures:

	public void initialiseNetwork(){
input = new double[1 + M]; // 1 is for the bias
hidden = new double[1 + H];
weights1 = new double[1 + M][H];
weights2 = new double[1 + H];

input[0] = 1.0; // Setting the bias
hidden[0] = 1.0;
}


$$M$$ is the number of features in the feature vectors, $$H$$ is the number of neurons in the hidden layer. We add 1 to these, since we also use the bias constants.

We represent the input and hidden layer as arrays of doubles. For example, hidden[i] stores the current output value of the i-th neuron in the hidden layer.

The first set of weights, between the input and hidden layer, are stored as a matrix. Each of the $$(1 + M)$$ neurons in the input layer connects to $$H$$ neurons in the hidden layer, leading to a total of $$(1 + M) \times H$$ weights. We only have one output neuron, so the second set of weights between hidden and output layers is technically a $$(1 + H) \times 1$$ matrix, but we can just represent that as a vector.

The second important function is forwardPass(), which takes an input vector and performs the computation to reach an output value.

	public void forwardPass(){
for(int j = 1; j < hidden.length; j++){
hidden[j] = 0.0;
for(int i = 0; i < input.length; i++){
hidden[j] += input[i] * weights1[i][j-1];
}
hidden[j] = sigmoid(hidden[j]);
}

output = 0.0;
for(int i = 0; i < hidden.length; i++){
output += hidden[i] * weights2[i];
}
output = sigmoid(output);
}


The first for-loop calculates the values in the hidden layer, by multiplying the input vector with the weight vector and applying the sigmoid function. The last part calculates the output value by multiplying the hidden values with the second set of weights, and also applying the sigmoid.

## Evaluation

To test out this network, I have created a sample dataset using the database at quandl.com. This dataset contains sociodemographic statistics for 141 countries:

• Population density (per suqare km)
• Population growth rate (%)
• Urban population (%)
• Life expectancy at birth (years)
• Fertility rate (births per woman)
• Infant mortality (deaths per 1000 births)
• Enrolment in tertiary education (%)
• Unemployment (%)
• Estimated control of corruption (score)
• Estimated government effectiveness (score)
• Internet users (per 100)

Based on this information, we want to train a neural network that can predict whether the GDP per capita is more than average for that country (label 1 if it is, 0 if it’s not).

I’ve separated the dataset for training (121 countries) and testing (40 countries). The values have been normalised, by subtracting the mean and dividing by the standard deviation, using a script from a previous article. I’ve also pre-trained a model that we can load into this network and evaluate. You can download these from here: original data, training data, test data, pretrained model.

You can then execute the neural network (remember to compile and link the binaries):

java neuralnet.SimpleNeuralNetwork data/model.txt data/countries-classify-gdp-normalised.test.txt


The output should be something like this:

Label: 0	Prediction: 0.01
Label: 0	Prediction: 0.00
Label: 1	Prediction: 0.99
Label: 0	Prediction: 0.00
...
Label: 0	Prediction: 0.20
Label: 0	Prediction: 0.01
Label: 1	Prediction: 0.99
Label: 0	Prediction: 0.00
Accuracy: 0.9


The network is in verbose mode, so it prints out the labels and predictions for each test item. At the end, it also prints out the overall accuracy. The test data contains 14 positive and 26 negative examples; a random system would have had accuracy 50%, whereas a biased system would have accuracy 65%. Our network managed 90%, which means it has learned some useful patterns in the data.

In this case we simply loaded a pre-trained model. In the next section, I will describe how to learn this model from some training data.

# Neural Networks, Part 2: The Neuron

A neuron is a very basic classifier. It takes a number of input signals (a feature vector) and outputs a single value (a prediction). A neuron is also a basic building block of neural networks, and by combining together many neurons we can build systems that are capable of learning very complicated patterns. This is part 2 of an introductory series on neural networks. If you haven’t done so yet, you might want to start by learning about the background to neural networks in part 1.

Neurons in artificial neural networks are inspired by biological neurons in nervous systems (shown below). A biological neuron has three main parts: the main body (also known as the soma), dendrites and an axon. There are often many dendrites attached to a neuron body, but only one axon, which can be up to a meter long. In most cases (although there are exceptions), the neuron receives input signals from dendrites, and then outputs its own signals through the axon. Axons in turn connect to the dendrites of other neurons, using special connections called synapses, forming complex neural networks.

Below is an illustration of an artificial neuron, where the input is passed in from the left and the prediction comes out from the right. Each input position has a specific weight in the neuron, and they determine what output to give, given a specific input vector. For example, a neuron could be trained to detect cities. We can then take the vector for London from the previous section, give it as input to our neuron, and it will tell us it’s a city by outputting value 1. If we do the same for the word Tuesday, it will give a 0 instead, meaning that it’s not a city.

You might notice that there’s a constant value of $$+1$$ as one of the input signals, and it has a separate weight $$w_0$$. This is called a bias, and it allows the network to shift the activation function up or down. Biases are not strictly required for building neurons or neural networks, but they can be very important to the performance, depending on the types of feature values you are using.

Let’s say we have an input vector $$[1, x_1, x_2]$$ and a weight vector $$[w_0, w_1, w_2]$$. Internally, we first multiply the corresponding input values with their weights, and add them together:

$$z = (w_0 \times 1) + (w_1 \times x_1) + (w_2 \times x_2)$$

Then, we pass the sum through an activation function. In this case we will use the sigmoid function (also known as the logistic curve) as our activation function.

$$y = f(z) = f((w_0 \times 1) + (w_1 \times x_1) + (w_2 \times x_2))$$

where

$$f(t) = \frac{1}{1 + e^{-t}}$$

The sigmoid function takes any real value and maps it to a range between 0 and 1. When plotted, a sigmoid function looks like this:

Using a sigmoid as our activation function has some benefits:

• Regardless of input, it will map everything to a range between 0 and 1. We don’t have to worry about output values exploding for unexpected input vectors.
• The function is non-linear, which allows us to learn more complicated non-linear relationships in our neural networks.
• It is differentiable, which comes in handy when we try to perform backpropagation to train our neural networks.

However, it is not required to use an activation function, and there exist many successful network architectures that don’t use one. We could also use a different activation, such as a hyperbolic tangent or a rectifier.

We have now looked at all the components of a neuron, but we’re not going to stop here. Richard Feynman once said “What I cannot create, I do not understand”, so let us create a working example of a neuron. Here is the complete code required to test a neuron with pre-trained weights:

public class SimpleNeuron {

private double[] weights;

public SimpleNeuron(double[] weights){
this.weights = weights;
}

public double classify(double[] input){
double value = 0.0;

value += weights[0] * 1.0;

// Adding the rest of the weights
for(int i = 0; i < input.length; i++)
value += weights[i + 1] * input[i];

// Passing the value through the sigmoid activation function
value = 1.0 / (1.0 + Math.exp(-1.0 * value));

return value;
}

public static void main(String[] args) {
// Creating data structures to hold the data
String[] names = new String[5];
double[][] vectors = new double[5][2];

// Inserting the data
names[0] = "London";
vectors[0][0] = 0.86;
vectors[0][1] = 0.09;
names[1] = "Paris";
vectors[1][0] = 0.74;
vectors[1][1] = 0.11;
names[2] = "Tuesday";
vectors[2][0] = 0.15;
vectors[2][1] = 0.77;
names[3] = "Friday";
vectors[3][0] = 0.05;
vectors[3][1] = 0.82;
names[4] = "???";
vectors[4][0] = 0.59;
vectors[4][1] = 0.19;

// Initialising the weights
double[] weights = {0.0, 100.0, -100.0};
SimpleNeuron neuron = new SimpleNeuron(weights);

// Classifying each of the data points
for(int i = 0; i < names.length; i++){
double prediction = neuron.classify(vectors[i]);
System.out.println(names[i] + " : " + (int)prediction);
}
}
}


The classify() function is the interesting part in this code. It takes a feature vector as an argument and returns the prediction value y. For testing, we use the examples from the previous section and try to classify each of them. The output of running this code will be as follows:

London : 1
Paris : 1
Tuesday : 0
Friday : 0
??? : 1


As you can see, the neuron has successfully separated cities from days. It has also provided a label for the previously-unknown example – apparently the last data point should belong with cities as well.

For this code example, I manually chose and hard-coded weight values, so that it would provide a good classification. In later sections we will see how to have the system learn these values automatically, using some training data.

Now that we know about individual neurons, in the next section we’ll look at how to connect them together and form neural networks.

# Neural Networks, Part 1: Background

Artificial neural networks (NN for short) are practical, elegant, and mathematically fascinating models for machine learning. They are inspired by the central nervous systems of humans and animals – smaller processing units (neurons) are connected together to form a complex network that is capable of learning and adapting.

The idea of such neural networks is not new. McCulloch-Pitts (1943) described binary threshold neurons already back in 1940’s. Rosenblatt (1958) popularised the use of perceptrons, a specific type of neurons, as very flexible tools for performing a variety of tasks. The rise of neural networks was halted after Minsky and Papert (1969) published a book about the capabilities of perceptrons, and mathematically proved that they can’t really do very much. This result was quickly generalised to all neural networks, whereas it actually applied only to a specific type of perceptrons, leading to neural networks being disregarded as a viable machine learning method.

In recent years, however, the neural network has made an impressive comeback. Research in the area has become much more active, and neural networks have been found to be more than capable learners, breaking state-of-the-art results on a wide variety of tasks. This has been substantially helped by developments in computing hardware, allowing us to train very large complex networks in reasonable time. In order to learn more about neural networks, we must first understand the concept of vector space, and this is where we’ll start.

A vector space is a space where we can represent the position of a specific point or object as a vector (a sequence of numbers). You’re probably familiar with 2 or 3-dimensional coordinate systems, but we can easily extend this to much higher dimensions (think hundreds or thousands). However, it’s quite difficult to imagine a 1000-dimensional space, so we’ll stick to 2-dimensional illustations.

In the graph below, we have placed 4 objects in a 2-dimensional space, and each of them has a 2-dimensional vector that represents their position in this space. For machine learning and classification we make the assumption that similar objects have similar coordinates and are therefore positioned close to each other. This is true in our example, as cities are positioned in the upper left corner, and days-of-the-week are positioned a bit further in the lower right corner.

Let’s say we now get a new object (see image below) and all we know are its coordinates. What do you think, is this object a city or a day-of-the-week? It’s probably a city, because it is positioned much closer to other existing cities we already know about.

This is the kind of reasoning that machine learning tries to perform. Our example was very simple, but this problem gets more difficult when dealing with thousands of dimensions and millions of noisy datapoints.

In a traditional machine learning context these vectors are given as input to the classifier, both at training and testing time. However, there exist methods of representation learning where these vectors are learned automatically, together with the model.

Now that we know about vector spaces, it’s time to look at how the neuron works.

### References

1. McCulloch, Warren S., and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity.” The Bulletin of Mathematical Biophysics 5.4 (1943): 115-133.
2. Minsky, Marvin, and Papert Seymour. “Perceptrons.” (1969).
3. Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.