Okay, so people now we are on the last installment of our data science series. By far we have studied different layers of this complicated science, so now is the time to test our knowledge in the practical environment.
Taking one case study, we are going to frame our by far knowledge of data science to achieve the desired result.
Imagine the scenario, what if can predict the occurrence of diabetes in advance and can take concrete precautions in advance to prevent ourselves from diabetes?
So, in our case study, we are going to predict the occurrence of diabetes as per the data science life cycle which we have studied in the previous article. Let’s go through the whole process step by step;
- First, we are going to collect data on the based of the medical history of the patients as per the phase 1.
- The data has numerous different attributes in it. You will find following attributes in the data; npreg – Number of times pregnant, glucose – Plasma glucose concentration, bp – Blood pressure, skin – Triceps skinfold thickness, bmi – Body mass index, ped – Diabetes pedigree function, age – Age, income – Income.
- After getting data, the next step is to prepare and clean data for data analysis.
- That’s because this data has numerous inconsistent values in it such as missing values, blank columns, abrupt values, and incorrect data. To complete the accurate analysis, you have to clean data properly.
- To clean data, we can create a single table categorized under different attributes. This is a way you can give proper structure to your unsorted data.
Once your data is properly structured, the next phase is to draw analysis as we discussed in phase 3.
- For analysis, load all data in the sandbox and apply different statistical functions on it. Like, you can use R function to find out a number of missing values and unique numbers. In this data, we can also use the summary function to offer us statistical information like mean, median, range, min and max values.
- Next, we will use visualization techniques like histograms, line graphs, box plots to understand the data flow in a clear vision.
Now, based on the information included in the above steps, we will design the decision tree. Let’s see how?
- So, as we have already decided the major attributes, that’s why we will use a supervised learning technique to build the model.
- Now, we have used decision tree because it has taken all attributes into consideration, like all the attributes which have linear as well as non-linear attributes. In our case, we established a linear relationship between npreg and age, whereas the nonlinear relationship between vnpreg and ped.
- The decision tree model is very robust as we can use different attributes to test the decision. And, then implement the decision which offers the best efficiency.
- In our decision tree, the glucose level is the most important factor, so it is a root node.
In this phase, we will run a small pilot test to determine whether our project as complete as per our expectations or not. We also search for performance constraints if there are any. If a result isn’t appropriate, then we have to rebuild the model.
After executing the project successfully, we will share a final deployment with the stakeholders.
So, folks that would be our end of the series, we truly hope that you have gained all the knowledge of data science which you have been always looking for. Hope you enjoyed the series as much as we liked it sharing with you all. So, see you next time with another great series.