Hypothesis testing is probably the most confusing topic in whole of statistics, even experienced machine learning engineers suffers from lack of crisp understanding of this topic. In this tutorial we will try to answer following question,

- What is hypothesis testing ?
- What it is use for ?

**What is hypothesis testing ?**

The assumption that the input data posses a specific structure and on that we predict the outcome. Then we test our outcome if it align with our assumption then the assumption is accepted otherwise rejected. The assumption here is **hypothesis **and the procedure of testing the hypothesis is **hypothesis testing.** The whole idea is of testing a given quantity provided that the quantity follows a assumption. The whole idea of hypothesis testing is based upon **proof by contradiction** method.

One concrete example of assumption can be,

- Assumption that the data follows normal distribution.

**“**The assumption of that the data follows normal distribution is called * null hypothesis* or

*,*

**H0***is the assumption which the statistical test holds initially, It can be interpreted as default assumption.*

**H0****“**

**“**The alternate assumption or violation of **null hypothesis **is called **first hypothesis** or **alternate hypothesis** or **H1**.**“**

**null hypothesis (H0: **The default assumption of statistical test and it is not rejected if the assumption has some level of significance.

**First hypothesis H1: **The default assumption does not holds and it is rejected or it is alternate assumption of default assumption.

It is recommended to understand the definition of **H0** and **H1** correctly.

**Interpreting P-Value**

This is one of the most misunderstood part in whole hypothesis testing. After applying statistical test on the data by taking an assumption, the statistical test returns what we call a p-value, on the basis of our p-value we either reject or accept our null hypothesis. It is done by comparing our p-value with the significance level.

**p-value**: The probability of observing a result of statistical test, given our null-hypothesis is true.

I am quite sure you haven’t understood the above definition. Let’s take an example, suppose we have two group of 20 peoples, group A and group B and we want to answer that what is the difference between the height of peoples in two group. One simple statistical test maybe difference between mean of the weight of two groups. here

H0 : There is no difference between the heights of the people of two group.

H1 : There is a difference between the height of the people of two group.

So our p-value can be defined as** “The probability of observing difference of given our null hypothesis is true.”**

Now if

**p-value > significance level or threshold :** We accept our null hypothesis or failed to reject our null hypothesis.

**p-value =< significance level or threshold:** We reject our null hypothesis or failed to accept our null hypothesis.

A common misinterpretation of p-value is that it is probability of null hypothesis being true or false.

So according to above example of weights of people in group,

if lets say p-value is 0.8 for difference of 12kg, if my null hypothesis is true which is there is no difference between weights of two group of peoples , then we can say that even if we are observing a p-value of 0.8 for the difference of 12kg, given our null hypothesis is true this may be happening because of less no of peoples and we accept our null hypothesis. or we have failed to reject our null hypothesis.

Understanding the above statement is slightly trickier, so read the above statement very carefully.

**What it is used for ?**

Sometimes during statistical analysis of the data we make some assumption before applying any algorithms, to make sure we are applying right algorithm on the data it is very important that our assumption is correct. A simple example of assumption can be ‘the data follows binomial distribution’ if this is true then we can confidently apply naive bayes on the given data as naive bayes assumes that the data either follows binomial distribution or multinomial distribution.

Understanding hypothesis testing is very important and in many places it is explained incorrectly so if you have any doubts please post in comment section below.