I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over the place. In this example I’m working with demographical real-world values for countries. For example, a feature for GDP per person in a country ranges from 551.27 to 88286.0, whereas estimates for corruption range between -1.56 to 2.42. This can be very confusing for machine learning algorithms, as they can end up treating bigger values as more important signals.
To handle this issue, we want to scale all the feature values into roughly the same range. We can do this by taking each feature value, subtracting its mean (thereby shifting the mean to 0), and dividing by the standard deviation (normalising the distribution). This is a piece of code I’ve implemented a number of times for various projects, so it’s time to write a nice reusable script. Hopefully it can be helpful for others as well. I chose to do this in python, as it’s easies to run compared to C++ and Java (doesn’t need to be compiled), but has better support for real-valued numbers compared to bash scripting.
Each line in the input file is assumed to be a feature vector, with values separated by whitespace. The first element is an integer class label that will be left untouched. This is followed by a number of floating point feature values which will be normalised. For example:
1 0.563 13498174.2 -21.3 0 0.114 42234434.3 15.67
We’re assuming dense vectors, meaning that each line has an equal number of features.
To execute it, simply use
python feature-normaliser.py < in.txt > out.txt
The complete script that will normalise feature vectors is here:
import sys; import fileinput; import numpy; data =  linecount = 0 for line in fileinput.input(): if line.strip(): index = 0 for value in line.split(): if linecount == 0: data.append() if index == 0: data[index].append(int(value)) else: data[index].append(float(value)) index+=1 linecount+=1 for row in range(0, linecount): for col in range(0, index): if col == 0: sys.stdout.write(str(data[col][row])) else: val = (data[col][row] - numpy.mean(data[col]))/numpy.std(data[col]) sys.stdout.write("\t" + str(val)) sys.stdout.write("\n")