{"id":789,"date":"2017-01-06T23:09:50","date_gmt":"2017-01-06T23:09:50","guid":{"rendered":"http:\/\/www.marekrei.com\/blog\/?p=789"},"modified":"2019-09-27T23:23:42","modified_gmt":"2019-09-27T23:23:42","slug":"attending-to-characters-in-neural-sequence-labeling-models","status":"publish","type":"post","link":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/","title":{"rendered":"Attending to characters in neural sequence labeling models"},"content":{"rendered":"<p>Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar vectors means these words also behave similarly in the model, which is what we want for good generalisation properties.<\/p>\n<p>However, word embeddings have a couple of weaknesses:<\/p>\n<ol>\n<li>If a word doesn&#8217;t exist in the training data, we can&#8217;t have\u00a0an embedding for it. Therefore,\u00a0the best we can do is clump all unseen words together under a single OOV (out-of-vocabulary) token.<\/li>\n<li>If a word only occurs a couple of times, the word embedding likely has very poor quality. We simply don&#8217;t have enough information to learn how these words behave in different contexts.<\/li>\n<li>We can&#8217;t properly take advantage of character-level patterns. For example, there is no way to learn that all words ending with\u00a0<em>-ing<\/em> are likely to be verbs. The best we can do is learn this for each word separately, but that doesn&#8217;t help when faced with new or rare words.<\/li>\n<\/ol>\n<p>In this post\u00a0I will look at different ways of extending word embeddings with character-level information, in the context of neural sequence labeling models. \u00a0You can find more information in the Coling 2016 paper\u00a0&#8220;<a href=\"https:\/\/aclweb.org\/anthology\/C\/C16\/C16-1030.pdf\">Attending to characters in neural sequence labeling models<\/a>&#8220;.<\/p>\n<h2>Sequence labeling<\/h2>\n<p>We&#8217;ll investigate word representations in order to improve on the task of sequence labeling. In a sequence labeling setting, a system gets a series of tokens as input and it needs to assign a label to every token. The correct\u00a0label typically depends on both the context and the token itself. Quite a large number of NLP tasks can be formulated as sequence labeling, for example:<\/p>\n<p><em><strong>POS-tagging<\/strong><\/em><br \/>\n<code><span style=\"color: #3366ff;\">DT<\/span> \u00a0<span style=\"color: #008000;\">NN<\/span> \u00a0 \u00a0<span style=\"color: #ff6600;\">VBD<\/span> \u00a0 \u00a0 \u00a0<span style=\"color: #008000;\">NNS<\/span> \u00a0 \u00a0IN \u00a0 \u00a0 \u00a0<span style=\"color: #3366ff;\">DT<\/span> \u00a0 <span style=\"color: #3366ff;\">DT<\/span> \u00a0<span style=\"color: #008000;\">NN<\/span> \u00a0 \u00a0 CC \u00a0<span style=\"color: #3366ff;\">DT<\/span> \u00a0<span style=\"color: #008000;\">NN<\/span> \u00a0 .<br \/>\nThe pound extended losses against both the dollar and the euro .<\/code><\/p>\n<p><em><strong>Error detection<\/strong><\/em><br \/>\n<code>+ + \u00a0 \u00a0+ \u00a0<span style=\"color: #ff0000;\">x<\/span>\u00a0 \u00a0 \u00a0 \u00a0+ \u00a0 + \u00a0 \u00a0 \u00a0+ \u00a0 + \u00a0 \u00a0+ \u00a0 \u00a0<span style=\"color: #ff0000;\">x<\/span>\u00a0 \u00a0 \u00a0 +<br \/>\nI like to playing the guitar and sing very louder .<\/code><\/p>\n<p><em><strong>Named entity recognition<\/strong><\/em><br \/>\n<code><span style=\"color: #ff6600;\">PER<\/span> _ \u00a0 \u00a0 \u00a0_ \u00a0 _ \u00a0 \u00a0 \u00a0_ \u00a0<span style=\"color: #3366ff;\">ORG<\/span> \u00a0<span style=\"color: #3366ff;\">ORG<\/span> \u00a0 _ \u00a0<span style=\"color: #008000;\">TIME<\/span> _<br \/>\nJim bought 300 shares of Acme Corp. in 2006 .<\/code><\/p>\n<p><em><strong>Chunking<\/strong><\/em><br \/>\n<code><span style=\"color: #008000;\">B-NP<\/span> \u00a0 \u00a0B-PP <span style=\"color: #008000;\">B-NP<\/span> <span style=\"color: #008000;\">I-NP<\/span> <span style=\"color: #ff6600;\">B-VP<\/span> <span style=\"color: #ff6600;\">I-VP<\/span> \u00a0 \u00a0 <span style=\"color: #ff6600;\">I-VP<\/span> <span style=\"color: #ff6600;\">I-VP<\/span> \u00a0 B-PP <span style=\"color: #008000;\">B-NP<\/span> <span style=\"color: #008000;\">B-NP<\/span> \u00a0O<br \/>\nService on \u00a0 the \u00a0line is \u00a0 expected to \u00a0 resume by \u00a0 noon today .<\/code><\/p>\n<p>In each of these cases, the model needs to understand how a word is being used in a specific context, and could also take advantage of character-level patterns and morphology.<\/p>\n<p><!--more--><\/p>\n<h2>Basic neural sequence labeling<\/h2>\n<p>Our baseline model for sequence labeling is as follows. Each word is represented as a 300-dimensional word embedding. This is passed through a bidirectional LSTM with hidden layers of size 200. The representations from both directions are concatenated, in order to get a word representation that is conditioned on the whole sentence. Next, we pass it through a 50-dimensional hidden layer and then an output layer, which can be a softmax or a CRF.<\/p>\n<p><a href=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph.png\" rel=\"attachment wp-att-790\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-790 size-medium\" src=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png\" alt=\"\" width=\"300\" height=\"153\" srcset=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png 300w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-150x76.png 150w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-768x391.png 768w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-1024x522.png 1024w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph.png 1203w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>This configuration is based on a combination of my previous work on\u00a0error detection (<a href=\"http:\/\/aclweb.org\/anthology\/P\/P16\/P16-1112.pdf\">Rei and Yannakoudakis, 2016<\/a>), and the models from <a href=\"https:\/\/aclweb.org\/anthology\/D\/D14\/D14-1080.pdf\">Irsoy and Cardie (2014)<\/a> and\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1603.01360\">Lample et al. (2016)<\/a>.<\/p>\n<h2>Concatenating character-based word representations<\/h2>\n<p>Now let&#8217;s look at an architecture that builds word representations from individual characters. We process each word separately and map characters to character embeddings. Next, these are passed through a bidirectional LSTM and the last states from either direction are concatenated. The resulting vector is passed through another feedforward layer,\u00a0in order to map it to a suitable space and change the vector size as needed.\u00a0We then have a word representation <strong>m<\/strong>, built from individual characters.<\/p>\n<p><a href=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/concat_graph.png\" rel=\"attachment wp-att-791\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-791\" src=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/concat_graph-300x245.png\" alt=\"concat_graph\" width=\"300\" height=\"245\" srcset=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/concat_graph-300x245.png 300w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/concat_graph-150x122.png 150w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/concat_graph-768x627.png 768w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/concat_graph.png 796w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>We still have a normal word embedding <strong>x<\/strong> for each word, and\u00a0in order to get the best of both worlds, we can combine these two representations. Following <a href=\"https:\/\/arxiv.org\/abs\/1603.01360\">Lample et al. (2016)<\/a>, one method is simply concatenating the\u00a0character-based representation with the word embedding.<\/p>\n<p style=\"text-align: center;\">\\(<br \/>\n\\widetilde{x} = [x; m]<br \/>\n\\)<\/p>\n<p style=\"text-align: left;\">The resulting vector can then be used in the word-level sequence labeling model, instead of the regular word embedding. The whole network is connected together, so that the character-based component is also optimised during training.<\/p>\n<h2>Attending to\u00a0character-based representations<\/h2>\n<p>Concatenating the two representations works, but we can do even better. We start off\u00a0the same &#8211; character embeddings are passed through a bidirectional LSTM to build a word representation <strong>m<\/strong>. However, instead of\u00a0concatenating this vector with the word embedding, we now combine them using dynamically predicted weights.<\/p>\n<p><a href=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/attention_graph.png\" rel=\"attachment wp-att-792\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-medium wp-image-792\" src=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/attention_graph-300x238.png\" alt=\"attention_graph\" width=\"300\" height=\"238\" srcset=\"https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/attention_graph-300x238.png 300w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/attention_graph-150x119.png 150w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/attention_graph-768x611.png 768w, https:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/attention_graph.png 834w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>A vector of weights <strong>z<\/strong> is predicted by the model, as a function of <strong>x<\/strong> and <strong>m<\/strong>. In this case, we use a two-layer feedforward component, with <em>tanh<\/em> activation in the first layer and <em>sigmoid<\/em> on the second layer.<\/p>\n<p style=\"text-align: center;\">\\(<br \/>\nz = \\sigma(W^{(3)}_z tanh(W^{(1)}_{z} x + W^{(2)}_{z} m))<br \/>\n\\)<\/p>\n<p style=\"text-align: left;\">Then, we combine\u00a0<strong>x<\/strong> and <strong>m<\/strong> as a weighted sum, using <strong>z<\/strong> as the weights:<\/p>\n<p style=\"text-align: center;\">\\(<br \/>\n\\widetilde{x} = z\\cdot x + (1-z) \\cdot m<br \/>\n\\)<\/p>\n<p style=\"text-align: left;\">This operation essentially looks at both word representations and decides, for each feature, whether it wants to take the value from the word embedding or from the character-based representation. Values close to 1\u00a0in <strong>z<\/strong> indicate higher weight for the word embedding, and values close to 0 assign more importance to the character-based vector.<\/p>\n<p style=\"text-align: left;\">This combination requires that the two vectors are aligned &#8211; each feature position\u00a0in the character-based representation needs to capture the same properties as that position in the word embedding. In order to encourage this property, we actively optimise for these vectors to be similar, by maximising their cosine similarity:<\/p>\n<p style=\"text-align: center;\">\\(<br \/>\n\\widetilde{E} = E + \\sum_{t=1}^{T} g_t (1 &#8211; cos(m^{(t)}, x_t)) \\hspace{3em}<br \/>\ng_t =<br \/>\n\\begin{cases}<br \/>\n0, &amp; \\text{if}\\ w_t = OOV \\\\<br \/>\n1, &amp; \\text{otherwise}<br \/>\n\\end{cases}<br \/>\n\\)<\/p>\n<p style=\"text-align: left;\">E is the main sequence labeling loss function that we minimise during training, T is the length of the sequence or sentence. Many OOV words share the same representation, and we do not want to optimise for this, therefore we use a variable g that limits\u00a0optimisation only\u00a0to non-OOV words.<\/p>\n<p style=\"text-align: left;\">In this setting, the model essentially learns two alternative representations for each word &#8211; a regular word embedding and a character-based representation. The word embedding itself is kind of a universal memory &#8211; we assign 300 elements to a word, and the model is free to save any information into it, including approximations of character-level information. For frequent words, there is really no reason to think that the character-based representation can offer much additional benefit. However, using word embeddings to save information is very inefficient &#8211; each feature needs to be learned and saved for every word separately. Therefore, we hope to get two benefits from including characters into the model:<\/p>\n<ol>\n<li style=\"text-align: left;\">Previously unseen (OOV) words and infrequent words with low-quality embeddings can get extra information from character features and morphemes.<\/li>\n<li style=\"text-align: left;\">The character-based component\u00a0can act as a highly-generalised model of typical character-level patterns, allowing the word embeddings to act as a memory\u00a0for storing\u00a0exceptions to these patterns for each specific word.<\/li>\n<\/ol>\n<p>While we optimise for the cosine similarity of <strong>m<\/strong> and <strong>x<\/strong> to be high, we are essentially teaching the model to predict distributional properties based only on character-level patterns and morphology. However, while\u00a0<strong>m<\/strong> is optimised to be similar to <strong>x<\/strong>, we implement it in such a way that <strong>x<\/strong> is not optimised to be similar to <strong>m<\/strong> (using\u00a0<a href=\"http:\/\/deeplearning.net\/software\/theano\/library\/gradient.html#theano.gradient.disconnected_grad\">disconnected_grad<\/a> in Theano). Because word embeddings are more flexible,\u00a0we want them to store exceptions as opposed to learning more general patterns.<\/p>\n<p>The resulting combined word representation is again plugged directly into the sequence labeling model. All the components, including the attention component for dynamically calculating <strong>z<\/strong>, are\u00a0optimised at the same time.<\/p>\n<h2>Evaluation<\/h2>\n<p>We evaluated the alternative architectures on 8 different datasets, covering 4 different tasks: NER, POS-tagging, error detection and chunking. See the paper for more detailed results, but here is a summary:<\/p>\n<p>[table]<br \/>\nDataset,Task,#labels,Measure,Word-based,Char concat,Char attn<br \/>\nCoNLL00,chunking,22,F1,91.23,92.35,<strong>92.67<\/strong><br \/>\nCoNLL03,NER,8,F1,79.86,83.37,<strong>84.09<\/strong><br \/>\nPTB-POS,POS-tagging,48,accuracy, 96.42 ,97.22,<strong>97.27<\/strong><br \/>\nFCEPUBLIC,error detection,2,F0.5,41.24,41.27,<strong>41.88<\/strong><br \/>\nBC2GM,NER,3,F1*,84.21,87.75,<strong>87.99<\/strong><br \/>\nCHEMDNER,NER,3,F1,79.74,83.56,<strong>84.53<\/strong><br \/>\nJNLPBA,NER,11,F1,70.75,72.24,<strong>72.70<\/strong><br \/>\nGENIA-POS,POS-tagging,42,accuracy,97.39,98.49,<strong>98.60<\/strong><br \/>\n[\/table]<\/p>\n<p>As can be seen, including a character-based component into the sequence labeling model helps on every task. In addition, using the attention-based architecture outperforms character concatenation on all benchmarks.<\/p>\n<p>We also compared\u00a0the number of parameters in each model. While both character-based models require more parameters compared to a basic architecture using only word embeddings, the attention-based architecture is actually more efficient compared to concatenation. If the vectors are simply concatenated, this increases the size of all the weight matrices in the proceeding LSTMs, whereas the attention framework\u00a0combines them without increasing length.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/1606.01700\">Miyamoto and Cho (2016)<\/a> have independently also proposed a similar architecture, with some\u00a0differences: 1) They focus on\u00a0the task of language modeling, 2) they predict a scalar weight for combining the representations as opposed to making the decision separately for each element, 3) they do not condition the weights on the character-based representation, and 4) they do not have the component that optimises character-based representations to be similar to the word embeddings.<\/p>\n<p>Since this model is aimed at learning morphological patterns, you might ask why are we not giving\u00a0actual morphemes as input to the model, instead of starting from individual characters. The reason is that the definition of an informative morpheme is likely to change between tasks and datasets, and this allows the model to learn exactly what it finds most useful. The model for POS tagging can learn to detect specific suffixes, and the model for NER can focus more on capitalisation patterns.<\/p>\n<h2>Conclusion<\/h2>\n<p>Combining word embeddings with character-based representations makes neural models more powerful and allows us to have better representations for infrequent or unseen words. One option is to concatenate the two representations, treating them as separate sets of useful features. Alternatively, we can optimise them to be similar and combine them using a gating\u00a0mechanism, essentially allowing the model to choose whether it wants to take each feature from the word embedding of from the character-based representation. We evaluated on 8 different sequence labeling datasets and found that the latter option performed consistently better, even with a fewer number of parameters.<\/p>\n<p>See the paper for more details:<br \/>\n<a href=\"https:\/\/aclweb.org\/anthology\/C\/C16\/C16-1030.pdf\">https:\/\/aclweb.org\/anthology\/C\/C16\/C16-1030.pdf<\/a><\/p>\n<p>I\u00a0have made the code for running these experiments publicly available on github:<br \/>\n<a href=\"https:\/\/github.com\/marekrei\/sequence-labeler\">https:\/\/github.com\/marekrei\/sequence-labeler<\/a><\/p>\n<p>Also, the dataset for performing error detection as a sequence labeling task\u00a0is now available online:<br \/>\n<a href=\"http:\/\/ilexir.co.uk\/datasets\/index.html\">http:\/\/ilexir.co.uk\/datasets\/index.html<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-789","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Attending to characters in neural sequence labeling models - Marek Rei<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Attending to characters in neural sequence labeling models - Marek Rei\" \/>\n<meta property=\"og:description\" content=\"Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar&hellip;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/\" \/>\n<meta property=\"og:site_name\" content=\"Marek Rei\" \/>\n<meta property=\"article:published_time\" content=\"2017-01-06T23:09:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-09-27T23:23:42+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png\" \/>\n<meta name=\"author\" content=\"Marek\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Marek\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/\",\"url\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/\",\"name\":\"Attending to characters in neural sequence labeling models - Marek Rei\",\"isPartOf\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#primaryimage\"},\"thumbnailUrl\":\"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png\",\"datePublished\":\"2017-01-06T23:09:50+00:00\",\"dateModified\":\"2019-09-27T23:23:42+00:00\",\"author\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191\"},\"breadcrumb\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#primaryimage\",\"url\":\"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png\",\"contentUrl\":\"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.marekrei.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Attending to characters in neural sequence labeling models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/#website\",\"url\":\"https:\/\/www.marekrei.com\/blog\/\",\"name\":\"Marek Rei\",\"description\":\"Thoughts on Machine Learning and Natural Language Processing\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.marekrei.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191\",\"name\":\"Marek\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g\",\"caption\":\"Marek\"},\"url\":\"https:\/\/www.marekrei.com\/blog\/author\/marek\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Attending to characters in neural sequence labeling models - Marek Rei","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/","og_locale":"en_US","og_type":"article","og_title":"Attending to characters in neural sequence labeling models - Marek Rei","og_description":"Word embeddings are great. They allow us to represent words as distributed vectors, such that semantically and functionally similar words have similar representations. Having similar&hellip;","og_url":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/","og_site_name":"Marek Rei","article_published_time":"2017-01-06T23:09:50+00:00","article_modified_time":"2019-09-27T23:23:42+00:00","og_image":[{"url":"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png"}],"author":"Marek","twitter_misc":{"Written by":"Marek","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/","url":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/","name":"Attending to characters in neural sequence labeling models - Marek Rei","isPartOf":{"@id":"https:\/\/www.marekrei.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#primaryimage"},"image":{"@id":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#primaryimage"},"thumbnailUrl":"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png","datePublished":"2017-01-06T23:09:50+00:00","dateModified":"2019-09-27T23:23:42+00:00","author":{"@id":"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191"},"breadcrumb":{"@id":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#primaryimage","url":"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png","contentUrl":"http:\/\/www.marekrei.com\/blog\/wp-content\/uploads\/2016\/12\/baseline_graph-300x153.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.marekrei.com\/blog\/attending-to-characters-in-neural-sequence-labeling-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.marekrei.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Attending to characters in neural sequence labeling models"}]},{"@type":"WebSite","@id":"https:\/\/www.marekrei.com\/blog\/#website","url":"https:\/\/www.marekrei.com\/blog\/","name":"Marek Rei","description":"Thoughts on Machine Learning and Natural Language Processing","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.marekrei.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191","name":"Marek","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g","caption":"Marek"},"url":"https:\/\/www.marekrei.com\/blog\/author\/marek\/"}]}},"_links":{"self":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts\/789","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/comments?post=789"}],"version-history":[{"count":61,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts\/789\/revisions"}],"predecessor-version":[{"id":1294,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts\/789\/revisions\/1294"}],"wp:attachment":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/media?parent=789"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/categories?post=789"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/tags?post=789"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}