Naive Bayes Classification

Exploring the Naïve Bayes Algorithm for Classification Tasks

Srihari Thyagarajan

Last updated on May 26, 2024 14 min read Programming, Data Science, Algorithms, Academic

Program output

Naïve Bayes Algorithm for Classification

This README file provides instructions and information for implementing the Naïve Bayes algorithm for classification. The experiment requires prior knowledge of Python programming and the following libraries: Pandas, NumPy, Matplotlib, and Seaborn.

Aim
Prerequisite
Outcome
Theory
Tasks

Aim

The aim of this project is to implement the Naïve Bayes algorithm for classification.

Prerequisite

To successfully complete this experiment, you should have knowledge of Python programming and the following libraries: Pandas, NumPy, Matplotlib, and Seaborn.

Outcome

After successfully completing this experiment, you will be able to:

Implement the Naïve Bayes technique for classification.
Compare the results of Naïve Bayes and KNN algorithms.
Understand and infer the results of different classification metrics.
Can be found here.

Theory

Naïve Bayes Classifier

The Naïve Bayes algorithm is a supervised learning algorithm based on Bayes’ theorem. It is used for solving classification problems and is particularly effective for text classification with high-dimensional training datasets. The Naïve Bayes Classifier is a simple yet effective classification algorithm that can build fast machine learning models for quick predictions. It is a probabilistic classifier that predicts based on the probability of an object. Examples of Naïve Bayes applications include spam filtration, sentiment analysis, and article classification.

Bayes’ Theorem

Bayes’ theorem, also known as Bayes’ Rule or Bayes’ law, is used to determine the probability of a hypothesis given prior knowledge. It relies on conditional probability. The formula for Bayes’ theorem is as follows:

P(A|B) = (P(B|A) * P(A)) / P(B)

Where:

P(A|B) is the posterior probability: the probability of hypothesis A given the observed event B.
P(B|A) is the likelihood probability: the probability of the evidence given that the probability of a hypothesis A is true.
P(A) is the prior probability: the probability of the hypothesis before observing the evidence.
P(B) is the marginal probability: the probability of the evidence.

Working of Naïve Bayes Classifier

The working of the Naïve Bayes Classifier involves the following steps:

Convert the given dataset into frequency tables.
Generate a likelihood table by finding the probabilities of given features.
Use Bayes’ theorem to calculate the posterior probability.

Tasks

Perform the following tasks to implement the Naïve Bayes algorithm and compare it with KNN:

Task 1: Implementing Naïve Bayes Algorithm on Car Dataset

Apply the Naïve Bayes algorithm to the given car dataset.
Show all the steps of the training phase.
Identify the class for the test data point (color = Yellow, Type = Sports, Origin = Domestic).
Solve the answer on paper and upload the image.

Task 2: Operations on Adult.csv Dataset

Upload the dataset into a dataframe.
Check the shape of the dataset.
Find out all the categorical columns from the dataset.
Check if null values exist in all the categorical columns.
Identify the problems with the “workclass,” “Occupation,” and “native_country” columns and rectify them.
Explore numeric columns and check for any null values.
Create a feature vector with x = all the columns except income and y = income.
Implement feature engineering for the train-test split dataset:
- Check the data types of columns of the input features of the training dataset.
- Identify categorical columns that have null values and fill them with the most probable value in the dataset.
- Repeat the above step for the input features of the test dataset.
- Apply one-hot encoding on all the categorical columns.
- Apply feature scaling using a robust scaler.

Task 3: Implement KNN Algorithm on Sklearn Dataset with k=5.

Task 4: Implement Naïve Bayes Algorithm on the given dataset.

Task 5: Compare the confusion matrix for both classifiers.

Task 6: Compare the accuracy score of both classifiers.

Task 7: Draw the ROC curve to compare both models.

Follow the instructions provided for each task and analyze the results to gain a better understanding of the Naïve Bayes algorithm and its comparison with KNN.

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Reading the Dataset and loading as a dataframe.
df = pd.read_csv("/content/adultPrac7.csv")

EDA:

df.shape

(32561, 15)

df.head(15)

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
5	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	<=50K
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	16	Jamaica	<=50K
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	45	United-States	>50K
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	>50K
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	40	United-States	>50K
10	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	80	United-States	>50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	40	India	>50K
12	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	30	United-States	<=50K
13	32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	50	United-States	<=50K
14	40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	0	40	?	>50K

  <script>
    const buttonEl =
      document.querySelector('#df-fcb0abcf-ffa6-485a-99ef-105a201cac75 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-fcb0abcf-ffa6-485a-99ef-105a201cac75');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object

df.describe

<bound method NDFrame.describe of        age          workclass  fnlwgt    education  education_num  \
0       39          State-gov   77516    Bachelors             13   
1       50   Self-emp-not-inc   83311    Bachelors             13   
2       38            Private  215646      HS-grad              9   
3       53            Private  234721         11th              7   
4       28            Private  338409    Bachelors             13   
...    ...                ...     ...          ...            ...   
32556   27            Private  257302   Assoc-acdm             12   
32557   40            Private  154374      HS-grad              9   
32558   58            Private  151910      HS-grad              9   
32559   22            Private  201490      HS-grad              9   
32560   52       Self-emp-inc  287927      HS-grad              9   

            marital_status          occupation    relationship    race  \
0            Never-married        Adm-clerical   Not-in-family   White   
1       Married-civ-spouse     Exec-managerial         Husband   White   
2                 Divorced   Handlers-cleaners   Not-in-family   White   
3       Married-civ-spouse   Handlers-cleaners         Husband   Black   
4       Married-civ-spouse      Prof-specialty            Wife   Black   
...                    ...                 ...             ...     ...   
32556   Married-civ-spouse        Tech-support            Wife   White   
32557   Married-civ-spouse   Machine-op-inspct         Husband   White   
32558              Widowed        Adm-clerical       Unmarried   White   
32559        Never-married        Adm-clerical       Own-child   White   
32560   Married-civ-spouse     Exec-managerial            Wife   White   

           sex  capital_gain  capital_loss  hours_per_week  native_country  \
0         Male          2174             0              40   United-States   
1         Male             0             0              13   United-States   
2         Male             0             0              40   United-States   
3         Male             0             0              40   United-States   
4       Female             0             0              40            Cuba   
...        ...           ...           ...             ...             ...   
32556   Female             0             0              38   United-States   
32557     Male             0             0              40   United-States   
32558   Female             0             0              40   United-States   
32559     Male             0             0              20   United-States   
32560   Female         15024             0              40   United-States   

       income  
0       <=50K  
1       <=50K  
2       <=50K  
3       <=50K  
4       <=50K  
...       ...  
32556   <=50K  
32557    >50K  
32558   <=50K  
32559   <=50K  
32560    >50K  

[32561 rows x 15 columns]>

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

# Checking Labels in workclass variable:
df.workclass.unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

# Showing the Value counts of each category of each workclass category.
df.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

# To replace the '?' with NaN values as that can be handled by pandas library.
df['workclass'].replace(' ?', np.NaN, inplace = True)

df.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

df.workclass.unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', nan, ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

df.capital_gainreplace(' ?', np.NaN, inplace = True)

# Checking for '?' value in other features
df[df == "?"].count()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

# X signify the features, Y signify labels
X = df.drop(['income'], axis = 1)
Y = df["income"]

X.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
dtype: object

Y.dtypes

dtype('O')

# Displaying the categorical features:
categorical = [col for col in X.columns if X[col].dtypes == 'O']
categorical

['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native_country']

# Displaying the numerical features:
numerical = [col for col in X.columns if X[col].dtypes != 'O']
numerical

['age',
 'fnlwgt',
 'education_num',
 'capital_gain',
 'capital_loss',
 'hours_per_week']

# Print percentage of missing values in the Categorical variables in the training set
X[categorical].isnull().mean()

workclass         0.056386
education         0.000000
marital_status    0.000000
occupation        0.056601
relationship      0.000000
race              0.000000
sex               0.000000
native_country    0.017905
dtype: float64

# Since these are categorical values, we cannot use mean and hence use mode to replace the Nan values:
# Three features - workclass, occupation and native_country have null values and hence we replace them with the highest freuqncy of that respective feature.
# impute the missing categorical variables with most freuqnt value:
for df2 in [X]:
  df2['workclass'].fillna(X['workclass'].mode()[0], inplace = True)
  df2['occupation'].fillna(X['occupation'].mode()[0], inplace = True)
  df2['native_country'].fillna(X['native_country'].mode()[0], inplace = True)

# Checking missing values in the feature set:
X.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
dtype: int64

# Now we do label encoding after eliminating all null values from the dataset:
X[categorical].head()

	workclass	education	marital_status	occupation	relationship	race	sex	native_country
0	State-gov	Bachelors	Never-married	Adm-clerical	Not-in-family	White	Male	United-States
1	Self-emp-not-inc	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	Male	United-States
2	Private	HS-grad	Divorced	Handlers-cleaners	Not-in-family	White	Male	United-States
3	Private	11th	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	United-States
4	Private	Bachelors	Married-civ-spouse	Prof-specialty	Wife	Black	Female	Cuba

  <script>
    const buttonEl =
      document.querySelector('#df-f49295aa-8869-4cdd-8ab9-f3fefae63cc9 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f49295aa-8869-4cdd-8ab9-f3fefae63cc9');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

for i in X[categorical]:
  X[i] = label_encoder.fit_transform(X[i])

# The above for loop eliminates the need for transforming each categorical feature individually like this:
# X['workclass'] = label_encoder.fit_transform(X['workclass'])
# X['education'] = label_encoder.fit_transform(X['workclass'])
# X['marital_status'] = label_encoder.fit_transform(X['workclass'])
# X['occupation'] = label_encoder.fit_transform(X['workclass'])
# X['relationship'] = label_encoder.fit_transform(X['workclass'])
# X['race'] = label_encoder.fit_transform(X['workclass'])
# X['sex'] = label_encoder.fit_transform(X['workclass'])
# X['native_country'] = label_encoder.fit_transform(X['workclass'])

X.head(15)

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country
0	39	6	77516	9	13	4	0	1	4	1	2174	40	38
1	50	5	83311	9	13	2	3	0	4	1	0	13	38
2	38	3	215646	11	9	0	5	1	4	1	0	40	38
3	53	3	234721	1	7	2	5	0	2	1	0	40	38
4	28	3	338409	9	13	2	9	5	2	0	0	40	4
5	37	3	284582	12	14	2	3	5	4	0	0	40	38
6	49	3	160187	6	5	3	7	1	2	0	0	16	22
7	52	5	209642	11	9	2	3	0	4	1	0	45	38
8	31	3	45781	12	14	4	9	1	4	0	14084	50	38
9	42	3	159449	9	13	2	3	0	4	1	5178	40	38
10	37	3	280464	15	10	2	3	0	2	1	0	80	38
11	30	6	141297	9	13	2	9	0	1	1	0	40	18
12	23	3	122272	9	13	4	0	3	4	0	0	30	38
13	32	3	205019	7	12	4	11	1	2	1	0	50	38
14	40	3	121772	8	11	2	2	0	1	1	0	40	38

  <script>
    const buttonEl =
      document.querySelector('#df-c3973471-34c6-4f17-a92d-092f081e2b47 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-c3973471-34c6-4f17-a92d-092f081e2b47');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

# Now we need to normalize the data, as each feature has a varying range:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)

# Checking the shape of training and testing samples after splitting
X_train.shape, X_test.shape

((22792, 14), (9769, 14))

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Train a Gaussian Naive Bayes Classifier on the training set
from sklearn.naive_bayes import GaussianNB

# instantiate the model
gnb = GaussianNB()

# Fit the model:
gnb.fit(X_train, y_train)

GaussianNB()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred = gnb.predict(X_test)
y_pred

array([' <=50K', ' <=50K', ' <=50K', ..., ' >50K', ' <=50K', ' <=50K'],
      dtype='<U6')

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)

print(cm, "\n", ac)

[[7037  370]
 [1510  852]] 
 0.8075545091616337

print("True Positives : ", cm[0, 0])
print("True Negatives : ", cm[1, 1])
print("False Positives : ", cm[0, 1])
print("False Negatives : ", cm[1, 0])

True Positives :  7037
True Negatives :  852
False Positives :  370
False Negatives :  1510

cm_matrix = pd.DataFrame(data = cm, columns = ["Actual Positive : 1", "Actual Negative : 0"], index = ["Predict Positive : 1", "Predict Negative : 0"])
import seaborn as sns
sns.heatmap(cm_matrix, annot = True, fmt = 'd', cmap = 'YlGnBu')

<AxesSubplot:>

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       <=50K       0.82      0.95      0.88      7407
        >50K       0.70      0.36      0.48      2362

    accuracy                           0.81      9769
   macro avg       0.76      0.66      0.68      9769
weighted avg       0.79      0.81      0.78      9769

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, p = 2) # We mention in the parameters for it to take
# 5 nearest neigbors(default value) and p = 2(default value = Eucledian distance)
classifier.fit(X_train, y_train) # Feed the training data to the classifier.

KNeighborsClassifier()

y_pred = classifier.predict(X_test) # Predicting for x_test data
y_pred

array([' <=50K', ' <=50K', ' <=50K', ..., ' >50K', ' >50K', ' <=50K'],
      dtype=object)

cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)
print(cm, "\n", ac)

[[6685  722]
 [ 990 1372]] 
 0.8247517657897431

Conclusion:

The accuracy of Naive-Bayes is 80.7%
The accuracy of KNN Neighbours (with k = 5) is 82.4% From the acquired results, we can conclude that KNN is better for the sample which we took in this case.

Learnt the following from the above experiment:

Implement Naïve Bayes technique for the classification
Compare results of Naïve Bayes and KNN
Understand and infer results of different classification metrics

Edit this page

Machine Learning Classification Naïve Bayes Probabilistic Modeling Python Pandas NumPy Matplotlib Seaborn