Heart Disease Prediction

6 min readAug 21, 2020

In this machine learning project, I have collected the dataset from Kaggle and I will be using Machine Learning to make predictions on whether a person is suffering from Heart Disease or not.

Import libraries

Let’s first import all the necessary libraries. I’ll use numpy and pandas to start with. For visualization, I will use pyplot subpackage of matplotlib, use rcParams to add styling to the plots and rainbow for colors. For implementing Machine Learning models and processing of data, I will use the sklearn library.

For processing the data, I’ll import a few libraries. To split the available dataset for testing and training, I’ll use the `train_test_split` method. To scale the features, I am using `StandardScaler`.

Next, I’ll import all the Machine Learning algorithms I will be using.
1. K Neighbors Classifier
2. Support Vector Classifier
3. Decision Tree Classifier
4. Random Forest Classifier

Import dataset

Now that we have all the libraries we will need, I can import the dataset and take a look at it. The dataset is stored in the file dataset.csv. I'll use the pandas read_csv method to read the dataset.

The dataset is now loaded into the variable dataset. I'll just take a glimpse of the data using the desribe() and info() methods before I actually start processing and visualizing it.

Looks like the dataset has a total of 303 rows and there are no missing values. There are a total of 13 features along with one target value that we wish to find.

The scale of each feature column is different and quite varied as well. While the maximum for age reaches 77, the maximum of chol (serum cholesterol) is 564.

Taking a look at the correlation matrix above, it’s easy to see that a few features have a negative correlation with the target value while some have positive. Next, I’ll take a look at the histograms for each variable.

Taking a look at the histograms above, I can see that each feature has a different range of distribution. Thus, using scaling before our predictions should be of great use. Also, the categorical features do stand out.

It’s always a good practice to work with a dataset where the target classes are of approximately equal size. Thus, let’s check for the same.

The two classes are not exactly 50% each but the ratio is good enough to continue without dropping/increasing our data.

Data Processing

After exploring the dataset, I observed that I need to convert some categorical variables into dummy variables and scale all the values before training the Machine Learning models. First, I’ll use the get_dummies method to create dummy columns for categorical variables.

Now, I will use the StandardScaler from sklearn to scale my dataset.

The data is not ready for our Machine Learning application.

Machine Learning

I’ll now import train_test_split to split our dataset into training and testing datasets. Then, I'll import all Machine Learning models I'll be using to train and test the data.

K Neighbors Classifier

The classification score varies based on the different values of neighbors that we choose. Thus, I’ll plot a score graph for different values of K (neighbors) and check when do I achieve the best score.

I have the scores for different neighbor values in the array knn_scores. I'll now plot it and see for which value of K did I get the best scores.

From the plot above, it is clear that the maximum score achieved was `0.87` for the 8 neighbors.

Support Vector Classifier

There are several kernels for Support Vector Classifier. I’ll test some of them and check which has the best score

I’ll now plot a bar plot of scores for each kernel and see which performed the best.

The linear kernel performed the best, being slightly better than rbf kernel.

Decision Tree Classifier

Here, I’ll use the Decision Tree Classifier to model the problem at hand. I’ll vary between a set of max_features and see which returns the best accuracy.

I selected the maximum number of features from 1 to 30 for the split. Now, let’s see the scores for each of those cases.

The model achieved the best accuracy at three values of maximum features, 2, 4 and 18.

Random Forest Classifier

Now, I’ll use the ensemble method, Random Forest Classifier, to create the model and vary the number of estimators to see their effect.

The model is trained and the scores are recorded. Let’s plot a bar plot to compare the scores.

The maximum score is achieved when the total estimators are 100 or 500.

Conclusion

In this project, I used Machine Learning to predict whether a person is suffering from heart disease. After importing the data, I analyzed it using plots. Then, I did generate dummy variables for categorical features and scaled other features. I then applied four Machine Learning algorithms, K Neighbors Classifier, Support Vector Classifier, Decision Tree Classifier and Random Forest Classifier. I varied parameters across each model to improve their scores. In the end, K Neighbors Classifier achieved the highest score of 87% with 8 nearest neighbors.

Code And Repository click here from download!!!