Most parsers exploit supervised machine learning methods and a syntactically annotated dataset (i.e. treebank), incorporating a wide range of features in the training process to deliver competitive performance. The use of lexically-conditioned features, such as relations between lemmas or word forms, is often critical when choosing the correct syntactic analysis in ambiguous contexts. However, utilising such features leads the parser to learn information that is often specific to the domain and/or genre of the training data. Several experiments have demonstrated that lexical features learnt in one domain provide little if any benefit when parsing text from different domains and genres (Sekine, 1997; Gildea, 2001). Furthermore, manual creation of in-domain treebanks is an expensive and time-consuming process, which can only be performed by experts with sufficient linguistic and domain knowledge.
Instead of trying to adapt a lexicalised parser to new domains, we have developed a framework for integrating bilexical features with an unlexicalised parser, without using supervision. As our novel self-learning framework requires only a large unannotated corpus, lexical features can be easily tuned to a specific domain or genre by selecting a suitable dataset.
The framework has three main components:
- Parse reranking. We parse a large background corpus and use bilexical co-occurrence statistics to estimate confidence scores for dependency relations.
- Graph expnasion. The system applies modifications to a dependency graphs, creating new edges that model selected higher-order dependencies.
- Score smoothing. A directional distributional similarity measure is used to smooth individual edge scores.
Our experiments showed statistically significant increase in F-score both on the DepBank/GR and Genia-GR datasets.
The underlying hypothesis is that a large corpus will often contain examples of dependency relations in non-ambiguous contexts, and these will mostly be correctly parsed by an unlexicalised parser. Lexical statistics derived from the corpus can then be used to select the correct parse in a more difficult context.
The system assigns an individual confidence score to every dependency edge in the graph. These scores are then combined into an overall graph score, which is used to rerank alternative derivations of the same sentence.
We experimented with different edge scoring methods, for example:
This function finds the probability of a specific dependency relation between w1 and w2, compared to the probability of seeing w1 and w2 in the same sentence.
We apply a series of rule-based modification to the dependency graphs, creating additional nodes and edges. The motivation for this graph expansion step is similar to that motivating the move from first-order to higher-order dependency path feature types (e.g., Carreras (2007)). However, compared to using all nth-order paths, these rules are chosen to maximise the utility and minimise the sparsity of the resulting bilexical features.
Figure 1 illustrates the graph modification process.
Figure 1: Modified graph for the sentence ‘Italian PM meets with Cabinet members and senior officials’ after four steps. Edges above the text are created by the parser, edges below the text are automatically created using the graph expansion operations.
Smoothing edge scores
Most successful edge scoring methods rely on correctly estimating the probability of seeing a specific dependency triple (the label, head and dependent). Even in a large background corpus these triples will be very sparse, and it can be useful to find approximate methods for estimating the edge scores.
We use a directed distributional similarity measure to automatically generate candidate substitutes for every word in the relation. These substitutes are then also used to calculate edge confidence scores, and the results are averaged for a more robust estimate.
For example, if (dobj, read, publication) is infrequent in the data, the system might predict that book is a reasonable substitute for publication and use (dobj, read, book) to estimate the original probability.
For additional information and more details, please refer to the following publications:
- Marek Rei and Ted Briscoe (2013). Parser lexicalisation through self-learning. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2013). Atlanta, United States, 2013.
- Marek Rei (2013). Minimally supervised dependency-based methods for natural language processing. PhD thesis. University of Cambridge, United Kingdom, 2013.
The system is currently implemented as a prototype for running specific experiments, and not as an easy-to-use system for improving general parsing. However, we make the code publically available, and it is included in the SemSim project: http://www.marekrei.com/projects/semsim/
Refer to the src/sem/apps/parsererank/ParseRerank.java as the main starting point for rerunning our parse reranking experiments.