Handling missing values


Many times you must have encountered dataset in which values under certain column are missing ,there can be many reason for those missing like the field might be optional in some form input or  in some cases it is just not possible to get that value etc .This is called missing values in data .

There are some standard techniques for handling those missing values ,but you must understand that these techniques are problem specific and there might be some situation where none of them are useful in that case you have to come up with your own estimation and that is beauty of machine learning.

  1. selective deletion : In this ,we delete the whole row if any of the value is missing in the data set.This method is useful if only few points are present with missing values ,and also this method human bias in the algorithm which is not a good technique if you want your algorithm behavior to generalize over that data.
  2. feature deletion: In this we eliminate the entire column in which we have any missing values,but this method is not used widely as eliminating a feature in most  cases is lost of information which is not a good idea .This is only possible if one feature is a combination of any other feature or combination of feature.
  3. Mean replacement: In this we replace the missing values by the average over the entire column ,this is one of the most popular and might be the simplest technique as neither we are loosing any information nor adding any human bias but the trade off is that our model accuracy might be affected but this is problem specific.
  4. Most occurred value: In this we replace the missing value with the most occurred value in that feature ,this is a good approximation as most occurred value will not affect the performance of the model .This technique adds information bias to the model if there are large amount of missing values present.
  5. Regression substitution: In this technique first we apply a regression model to that particular feature and by using that model we approximate missing values.This is a very sophisticated method and requires a bit amount of technical work to find the hidden values.
  6. Educated guess : This technique is problem specific and requires strong domain knowledge of the problem,this technique is not very easy to implement as getting domain knowledge of each problem you are working is a bit difficult .

There is no such technique which is good for all ,the best way to decide which is working is to compare every technique on the basis of a accuracy metric like log-loss ,squared-loss ,hinge-loss etc.This comparison will eventually land you at a correct measure of handling missing values.

About the author


I write blogs about Machine Learning and data science

By abhinavsinghml

Most common tags

%d bloggers like this: