Most parsers exploit supervised machine learning methods and a syntactically annotated dataset (i.e. treebank), incorporating a wide range of features in the training process to deliver competitive performance. The use of lexically-conditioned features, such as relations between lemmas or word forms, is often critical when choosing the correct syntactic analysis in ambiguous contexts. However, utilising such features leads the parser to learn information that is often specific to the domain and/or genre of the training data. Several experiments have demonstrated that lexical features learnt in one domain provide little if any benefit when parsing text from different domains and genres (Sekine, 1997; Gildea, 2001). Furthermore, manual creation of in-domain treebanks is an expensive and time-consuming process, which can only be performed by experts with sufficient linguistic and domain knowledge.

Instead of trying to adapt a lexicalised parser to new domains, we have developed a framework for integrating bilexical features with an unlexicalised parser, without using supervision. As our novel self-learning framework requires only a large unannotated corpus, lexical features can be easily tuned to a specific domain or genre by selecting a suitable dataset.

The framework has three main components:

Our experiments showed statistically significant increase in F-score both on the DepBank/GR and Genia-GR datasets.

Parse reranking

The underlying hypothesis is that a large corpus will often contain examples of dependency relations in non-ambiguous contexts, and these will mostly be correctly parsed by an unlexicalised parser. Lexical statistics derived from the corpus can then be used to select the correct parse in a more difficult context.

The system assigns an individual confidence score to every dependency edge in the graph. These scores are then combined into an overall graph score, which is used to rerank alternative derivations of the same sentence.

We experimented with different edge scoring methods, for example:

This function finds the probability of a specific dependency relation between w1 and w2, compared to the probability of seeing w1 and w2 in the same sentence.

Graph expansion

We apply a series of rule-based modification to the dependency graphs, creating additional nodes and edges. The motivation for this graph expansion step is similar to that motivating the move from first-order to higher-order dependency path feature types (e.g., Carreras (2007)). However, compared to using all nth-order paths, these rules are chosen to maximise the utility and minimise the sparsity of the resulting bilexical features.

Figure 1 illustrates the graph modification process.

Figure 1: Modified graph for the sentence ‘Italian PM meets with Cabinet members and senior officials’ after four steps. Edges above the text are created by the parser, edges below the text are automatically created using the graph expansion operations.

Smoothing edge scores

Most successful edge scoring methods rely on correctly estimating the probability of seeing a specific dependency triple (the label, head and dependent). Even in a large background corpus these triples will be very sparse, and it can be useful to find approximate methods for estimating the edge scores.

We use a directed distributional similarity measure to automatically generate candidate substitutes for every word in the relation. These substitutes are then also used to calculate edge confidence scores, and the results are averaged for a more robust estimate.

For example, if (dobj, read, publication) is infrequent in the data, the system might predict that book is a reasonable substitute for publication and use (dobj, read, book) to estimate the original probability.

More information

For additional information and more details, please refer to the following publications:

The code

The system is currently implemented as a prototype for running specific experiments, and not as an easy-to-use system for improving general parsing. However, we make the code publically available, and it is included in the SemSim project:

Refer to the src/sem/apps/parsererank/ as the main starting point for rerunning our parse reranking experiments.