{"id":111,"date":"2014-02-08T20:26:11","date_gmt":"2014-02-08T20:26:11","guid":{"rendered":"http:\/\/www.marekrei.com\/blog\/?p=111"},"modified":"2019-09-27T23:40:31","modified_gmt":"2019-09-27T23:40:31","slug":"normalise-feature-vectors","status":"publish","type":"post","link":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/","title":{"rendered":"How to normalise feature vectors"},"content":{"rendered":"<p>I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over the place. In this example I&#8217;m working with demographical real-world values for countries. For example, a feature for GDP per person in a country ranges from 551.27 to 88286.0, whereas estimates for corruption range between -1.56 to 2.42. This can be very confusing for machine learning algorithms, as they can end up treating bigger values as more important signals.<\/p>\n<p>To handle this issue, we want to scale all the feature values into roughly the same range. We can do this by taking each feature value, subtracting its mean (thereby shifting the mean to 0), and dividing by the standard deviation (normalising the distribution). This is a piece of code I&#8217;ve implemented a number of times for various projects, so it&#8217;s time to write a nice reusable script. Hopefully it can be helpful for others as well. I chose to do this in python, as it&#8217;s easies to run compared to C++ and Java (doesn&#8217;t need to be compiled), but has better support for real-valued numbers compared to bash scripting.<br \/>\n<!--more--><\/p>\n<p>Each line in the input file is assumed to be a feature vector, with values separated by whitespace. The first element is an integer class label that will be left untouched. This is followed by a number of floating point feature values which will be normalised. For example:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">1 0.563 13498174.2 -21.3\r\n0 0.114 42234434.3 15.67<\/pre>\n<p>We&#8217;re assuming dense vectors, meaning that each line has an equal number of features.<\/p>\n<p>To execute it, simply use<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">python feature-normaliser.py &lt; in.txt &gt; out.txt<\/pre>\n<p>The complete script that will normalise feature vectors is here:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n\r\nimport sys;\r\nimport fileinput;\r\nimport numpy;\r\n\r\ndata = &#x5B;]\r\nlinecount = 0\r\nfor line in fileinput.input():\r\n  if line.strip():\r\n    index = 0\r\n    for value in line.split():\r\n      if linecount == 0:\r\n        data.append(&#x5B;])\r\n      if index == 0:\r\n        data&#x5B;index].append(int(value))\r\n      else:\r\n        data&#x5B;index].append(float(value))\r\n      index+=1\r\n    linecount+=1\r\n\r\nfor row in range(0, linecount):\r\n  for col in range(0, index):\r\n    if col == 0:\r\n      sys.stdout.write(str(data&#x5B;col]&#x5B;row]))\r\n    else:\r\n      val = (data&#x5B;col]&#x5B;row] - numpy.mean(data&#x5B;col]))\/numpy.std(data&#x5B;col])\r\n      sys.stdout.write(&quot;\\t&quot; + str(val))\r\n  sys.stdout.write(&quot;\\n&quot;)\r\n\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-111","post","type-post","status-publish","format-standard","hentry","category-resources"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v23.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to normalise feature vectors - Marek Rei<\/title>\n<meta name=\"description\" content=\"Normalise feature vectors by subtracting the mean and dividing by standard deviation, using this script.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to normalise feature vectors - Marek Rei\" \/>\n<meta property=\"og:description\" content=\"Normalise feature vectors by subtracting the mean and dividing by standard deviation, using this script.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/\" \/>\n<meta property=\"og:site_name\" content=\"Marek Rei\" \/>\n<meta property=\"article:published_time\" content=\"2014-02-08T20:26:11+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-09-27T23:40:31+00:00\" \/>\n<meta name=\"author\" content=\"Marek\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Marek\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"2 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/\",\"url\":\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/\",\"name\":\"How to normalise feature vectors - Marek Rei\",\"isPartOf\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/#website\"},\"datePublished\":\"2014-02-08T20:26:11+00:00\",\"dateModified\":\"2019-09-27T23:40:31+00:00\",\"author\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191\"},\"description\":\"Normalise feature vectors by subtracting the mean and dividing by standard deviation, using this script.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.marekrei.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to normalise feature vectors\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/#website\",\"url\":\"https:\/\/www.marekrei.com\/blog\/\",\"name\":\"Marek Rei\",\"description\":\"Thoughts on Machine Learning and Natural Language Processing\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.marekrei.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191\",\"name\":\"Marek\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g\",\"caption\":\"Marek\"},\"url\":\"https:\/\/www.marekrei.com\/blog\/author\/marek\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to normalise feature vectors - Marek Rei","description":"Normalise feature vectors by subtracting the mean and dividing by standard deviation, using this script.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/","og_locale":"en_US","og_type":"article","og_title":"How to normalise feature vectors - Marek Rei","og_description":"Normalise feature vectors by subtracting the mean and dividing by standard deviation, using this script.","og_url":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/","og_site_name":"Marek Rei","article_published_time":"2014-02-08T20:26:11+00:00","article_modified_time":"2019-09-27T23:40:31+00:00","author":"Marek","twitter_misc":{"Written by":"Marek","Est. reading time":"2 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/","url":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/","name":"How to normalise feature vectors - Marek Rei","isPartOf":{"@id":"https:\/\/www.marekrei.com\/blog\/#website"},"datePublished":"2014-02-08T20:26:11+00:00","dateModified":"2019-09-27T23:40:31+00:00","author":{"@id":"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191"},"description":"Normalise feature vectors by subtracting the mean and dividing by standard deviation, using this script.","breadcrumb":{"@id":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.marekrei.com\/blog\/normalise-feature-vectors\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.marekrei.com\/blog\/"},{"@type":"ListItem","position":2,"name":"How to normalise feature vectors"}]},{"@type":"WebSite","@id":"https:\/\/www.marekrei.com\/blog\/#website","url":"https:\/\/www.marekrei.com\/blog\/","name":"Marek Rei","description":"Thoughts on Machine Learning and Natural Language Processing","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.marekrei.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/a145eb0a06ed4acf5b0f84a24b7a1191","name":"Marek","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.marekrei.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/48a65414bfda6485aaa0703e548de0ed25292b5fe0d979ed8c28ad83cf5a82c0?s=96&d=mm&r=g","caption":"Marek"},"url":"https:\/\/www.marekrei.com\/blog\/author\/marek\/"}]}},"_links":{"self":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts\/111","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/comments?post=111"}],"version-history":[{"count":8,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts\/111\/revisions"}],"predecessor-version":[{"id":1310,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/posts\/111\/revisions\/1310"}],"wp:attachment":[{"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/media?parent=111"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/categories?post=111"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.marekrei.com\/blog\/wp-json\/wp\/v2\/tags?post=111"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}