Project proposal for the 2016/2017 MPhil course in Advanced Computer Science.
Note: this project has already been completed.

Domain Adaptation for Neural Named Entity Recognition

Proposer: Marek Rei
Supervisors: Marek Rei, Felix Sanchez-Garcia
Special Resources: Guardian NER Dataset

Description

Named Entity Recognition (NER) is the task of tagging entities of specific types in text (person, organisation, location, etc). For example:
PER _ _ _ _ ORG ORG _ TIME _ Jim bought 300 shares of Acme Corp. in 2006 .

NER models commonly learn from annotated data which words and grammatical structures indicate named entities. However, models trained on data from one domain or genre often perform very poorly when applied on novel types of input. While there is various work on domain adaptation using a small training set in the target domain, we will investigate strategies for improving the performance of a neural NER system in a novel domain without using additional annotation, and by using active learning.

Rei and Yannakoudakis (2016) describe a neural network for sequence tagging, based on a bidirectional recurrent neural network, which only requires the word sequence as input. Rei et al. (2016) extend the model by incorporating a character-level neural component which takes sub-word information into account. This project aims to extend the neural network architecture, in order to make it more applicable to unseen domains and datasets. Additional features, such as word shape, capitalisation and POS can be integrated with the neural model, and a self-learning framework can be used to learn from unlabeled data. Evaluation will be performed by training on the CoNLL-03 dataset and testing on the financial NER dataset, provided by Alvarado et al. (2015).

The project is proposed in collaboration with The Guardian, who are interested in adapting traditional named entity recognition to work better on datasets such as Panama papers, Snowden files or Hillary Clinton's released emails. Felix Sanchez-Garcia, a data science researcher in The Guardian, will be co-supervising the project, and a small annotated dataset will be provided by The Guardian for additional evaluation.

Aims of the project:

  1. investigate the performance of neural sequence tagging models on out-ouf-domain NER
  2. extend the existing model with features that are more likely to generalise on unseen domains
  3. implement a framework for self-learning or active learning, allowing the model to take advantage of unlabeled data

Datasets

References:

Compositional Sequence Labeling Models for Error Detection in Learner Writing
Marek Rei and Helen Yannakoudakis. 2016.

Attending to characters in neural sequence labeling models
Marek Rei, Sampo Pyysalo and Gamal K.O. Crichton. 2016.

Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment
Julio Cesar Salinas Alvarado, Karin Verspoor, Timothy Baldwin. 2015.

Multi-Criteria-based Active Learning for Named Entity Recognition
Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, Chew-Lim Tan. 2004.

Domain Adaptation for Sequence Labeling Tasks with a Probabilistic Language Adaptation Model
Min Xiao, Yuhong Guo. 2013.

A study of active learning methods for named entity recognition in clinical text
Yukun Chen, Thomas A. Lasko, Qiaozhu Mei, Joshua C. Denny, Hua Xu. 2015.

Domain Adaptation for Named Entity Recognition Using CRFs
Tian Tian, Marco Dinarelli, Isabelle Tellier and Pedro Dias Cardoso. 2016.