Naive Bayes Classification

Exploring the Naïve Bayes Algorithm for Classification Tasks

Program output

Naïve Bayes Algorithm for Classification

This README file provides instructions and information for implementing the Naïve Bayes algorithm for classification. The experiment requires prior knowledge of Python programming and the following libraries: Pandas, NumPy, Matplotlib, and Seaborn.

Table of Contents

  1. Aim
  2. Prerequisite
  3. Outcome
  4. Theory
  5. Tasks

Aim

The aim of this project is to implement the Naïve Bayes algorithm for classification.

Prerequisite

To successfully complete this experiment, you should have knowledge of Python programming and the following libraries: Pandas, NumPy, Matplotlib, and Seaborn.

Outcome

After successfully completing this experiment, you will be able to:

  1. Implement the Naïve Bayes technique for classification.
  2. Compare the results of Naïve Bayes and KNN algorithms.
  3. Understand and infer the results of different classification metrics.
  4. Can be found here.

Theory

Naïve Bayes Classifier

The Naïve Bayes algorithm is a supervised learning algorithm based on Bayes’ theorem. It is used for solving classification problems and is particularly effective for text classification with high-dimensional training datasets. The Naïve Bayes Classifier is a simple yet effective classification algorithm that can build fast machine learning models for quick predictions. It is a probabilistic classifier that predicts based on the probability of an object. Examples of Naïve Bayes applications include spam filtration, sentiment analysis, and article classification.

Bayes’ Theorem

Bayes’ theorem, also known as Bayes’ Rule or Bayes’ law, is used to determine the probability of a hypothesis given prior knowledge. It relies on conditional probability. The formula for Bayes’ theorem is as follows:

P(A|B) = (P(B|A) * P(A)) / P(B)

Where:

  • P(A|B) is the posterior probability: the probability of hypothesis A given the observed event B.
  • P(B|A) is the likelihood probability: the probability of the evidence given that the probability of a hypothesis A is true.
  • P(A) is the prior probability: the probability of the hypothesis before observing the evidence.
  • P(B) is the marginal probability: the probability of the evidence.

Working of Naïve Bayes Classifier

The working of the Naïve Bayes Classifier involves the following steps:

  1. Convert the given dataset into frequency tables.
  2. Generate a likelihood table by finding the probabilities of given features.
  3. Use Bayes’ theorem to calculate the posterior probability.

Tasks

Perform the following tasks to implement the Naïve Bayes algorithm and compare it with KNN:

Task 1: Implementing Naïve Bayes Algorithm on Car Dataset

  • Apply the Naïve Bayes algorithm to the given car dataset.
  • Show all the steps of the training phase.
  • Identify the class for the test data point (color = Yellow, Type = Sports, Origin = Domestic).
  • Solve the answer on paper and upload the image.

Task 2: Operations on Adult.csv Dataset

  • Upload the dataset into a dataframe.
  • Check the shape of the dataset.
  • Find out all the categorical columns from the dataset.
  • Check if null values exist in all the categorical columns.
  • Identify the problems with the “workclass,” “Occupation,” and “native_country” columns and rectify them.
  • Explore numeric columns and check for any null values.
  • Create a feature vector with x = all the columns except income and y = income.
  • Implement feature engineering for the train-test split dataset:
    • Check the data types of columns of the input features of the training dataset.
    • Identify categorical columns that have null values and fill them with the most probable value in the dataset.
    • Repeat the above step for the input features of the test dataset.
    • Apply one-hot encoding on all the categorical columns.
    • Apply feature scaling using a robust scaler.

Task 3: Implement KNN Algorithm on Sklearn Dataset with k=5.

Task 4: Implement Naïve Bayes Algorithm on the given dataset.

Task 5: Compare the confusion matrix for both classifiers.

Task 6: Compare the accuracy score of both classifiers.

Task 7: Draw the ROC curve to compare both models.

Follow the instructions provided for each task and analyze the results to gain a better understanding of the Naïve Bayes algorithm and its comparison with KNN.

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Reading the Dataset and loading as a dataframe.
df = pd.read_csv("/content/adultPrac7.csv")

EDA:

df.shape
(32561, 15)
df.head(15)

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_countryincome
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
537Private284582Masters14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States<=50K
649Private1601879th5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica<=50K
752Self-emp-not-inc209642HS-grad9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States>50K
831Private45781Masters14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States>50K
942Private159449Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States>50K
1037Private280464Some-college10Married-civ-spouseExec-managerialHusbandBlackMale0080United-States>50K
1130State-gov141297Bachelors13Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040India>50K
1223Private122272Bachelors13Never-marriedAdm-clericalOwn-childWhiteFemale0030United-States<=50K
1332Private205019Assoc-acdm12Never-marriedSalesNot-in-familyBlackMale0050United-States<=50K
1440Private121772Assoc-voc11Married-civ-spouseCraft-repairHusbandAsian-Pac-IslanderMale0040?>50K

  <script>
    const buttonEl =
      document.querySelector('#df-fcb0abcf-ffa6-485a-99ef-105a201cac75 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-fcb0abcf-ffa6-485a-99ef-105a201cac75');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
df.dtypes
age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
income            object
dtype: object
df.describe
<bound method NDFrame.describe of        age          workclass  fnlwgt    education  education_num  \
0       39          State-gov   77516    Bachelors             13   
1       50   Self-emp-not-inc   83311    Bachelors             13   
2       38            Private  215646      HS-grad              9   
3       53            Private  234721         11th              7   
4       28            Private  338409    Bachelors             13   
...    ...                ...     ...          ...            ...   
32556   27            Private  257302   Assoc-acdm             12   
32557   40            Private  154374      HS-grad              9   
32558   58            Private  151910      HS-grad              9   
32559   22            Private  201490      HS-grad              9   
32560   52       Self-emp-inc  287927      HS-grad              9   

            marital_status          occupation    relationship    race  \
0            Never-married        Adm-clerical   Not-in-family   White   
1       Married-civ-spouse     Exec-managerial         Husband   White   
2                 Divorced   Handlers-cleaners   Not-in-family   White   
3       Married-civ-spouse   Handlers-cleaners         Husband   Black   
4       Married-civ-spouse      Prof-specialty            Wife   Black   
...                    ...                 ...             ...     ...   
32556   Married-civ-spouse        Tech-support            Wife   White   
32557   Married-civ-spouse   Machine-op-inspct         Husband   White   
32558              Widowed        Adm-clerical       Unmarried   White   
32559        Never-married        Adm-clerical       Own-child   White   
32560   Married-civ-spouse     Exec-managerial            Wife   White   

           sex  capital_gain  capital_loss  hours_per_week  native_country  \
0         Male          2174             0              40   United-States   
1         Male             0             0              13   United-States   
2         Male             0             0              40   United-States   
3         Male             0             0              40   United-States   
4       Female             0             0              40            Cuba   
...        ...           ...           ...             ...             ...   
32556   Female             0             0              38   United-States   
32557     Male             0             0              40   United-States   
32558   Female             0             0              40   United-States   
32559     Male             0             0              20   United-States   
32560   Female         15024             0              40   United-States   

       income  
0       <=50K  
1       <=50K  
2       <=50K  
3       <=50K  
4       <=50K  
...       ...  
32556   <=50K  
32557    >50K  
32558   <=50K  
32559   <=50K  
32560    >50K  

[32561 rows x 15 columns]>
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
# Checking Labels in workclass variable:
df.workclass.unique()
array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)
# Showing the Value counts of each category of each workclass category.
df.workclass.value_counts()
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
# To replace the '?' with NaN values as that can be handled by pandas library.
df['workclass'].replace(' ?', np.NaN, inplace = True)
df.workclass.value_counts()
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
df.workclass.unique()
array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', nan, ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)
df.capital_gainreplace(' ?', np.NaN, inplace = True)
# Checking for '?' value in other features
df[df == "?"].count()
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64
# X signify the features, Y signify labels
X = df.drop(['income'], axis = 1)
Y = df["income"]
X.dtypes
age                int64
workclass         object
fnlwgt             int64
education         object
education_num      int64
marital_status    object
occupation        object
relationship      object
race              object
sex               object
capital_gain       int64
capital_loss       int64
hours_per_week     int64
native_country    object
dtype: object
Y.dtypes
dtype('O')
# Displaying the categorical features:
categorical = [col for col in X.columns if X[col].dtypes == 'O']
categorical
['workclass',
 'education',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native_country']
# Displaying the numerical features:
numerical = [col for col in X.columns if X[col].dtypes != 'O']
numerical
['age',
 'fnlwgt',
 'education_num',
 'capital_gain',
 'capital_loss',
 'hours_per_week']
# Print percentage of missing values in the Categorical variables in the training set
X[categorical].isnull().mean()
workclass         0.056386
education         0.000000
marital_status    0.000000
occupation        0.056601
relationship      0.000000
race              0.000000
sex               0.000000
native_country    0.017905
dtype: float64
# Since these are categorical values, we cannot use mean and hence use mode to replace the Nan values:
# Three features - workclass, occupation and native_country have null values and hence we replace them with the highest freuqncy of that respective feature.
# impute the missing categorical variables with most freuqnt value:
for df2 in [X]:
  df2['workclass'].fillna(X['workclass'].mode()[0], inplace = True)
  df2['occupation'].fillna(X['occupation'].mode()[0], inplace = True)
  df2['native_country'].fillna(X['native_country'].mode()[0], inplace = True)
# Checking missing values in the feature set:
X.isnull().sum()
age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
dtype: int64
# Now we do label encoding after eliminating all null values from the dataset:
X[categorical].head()

workclasseducationmarital_statusoccupationrelationshipracesexnative_country
0State-govBachelorsNever-marriedAdm-clericalNot-in-familyWhiteMaleUnited-States
1Self-emp-not-incBachelorsMarried-civ-spouseExec-managerialHusbandWhiteMaleUnited-States
2PrivateHS-gradDivorcedHandlers-cleanersNot-in-familyWhiteMaleUnited-States
3Private11thMarried-civ-spouseHandlers-cleanersHusbandBlackMaleUnited-States
4PrivateBachelorsMarried-civ-spouseProf-specialtyWifeBlackFemaleCuba

  <script>
    const buttonEl =
      document.querySelector('#df-f49295aa-8869-4cdd-8ab9-f3fefae63cc9 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-f49295aa-8869-4cdd-8ab9-f3fefae63cc9');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

for i in X[categorical]:
  X[i] = label_encoder.fit_transform(X[i])

# The above for loop eliminates the need for transforming each categorical feature individually like this:
# X['workclass'] = label_encoder.fit_transform(X['workclass'])
# X['education'] = label_encoder.fit_transform(X['workclass'])
# X['marital_status'] = label_encoder.fit_transform(X['workclass'])
# X['occupation'] = label_encoder.fit_transform(X['workclass'])
# X['relationship'] = label_encoder.fit_transform(X['workclass'])
# X['race'] = label_encoder.fit_transform(X['workclass'])
# X['sex'] = label_encoder.fit_transform(X['workclass'])
# X['native_country'] = label_encoder.fit_transform(X['workclass'])
X.head(15)

ageworkclassfnlwgteducationeducation_nummarital_statusoccupationrelationshipracesexcapital_gaincapital_losshours_per_weeknative_country
03967751691340141217404038
15058331191323041001338
238321564611905141004038
35332347211725021004038
42833384099132952000404
5373284582121423540004038
64931601876537120001622
752520964211923041004538
8313457811214491401408405038
942315944991323041517804038
10373280464151023021008038
1130614129791329011004018
1223312227291340340003038
13323205019712411121005038
1440312177281122011004038

  <script>
    const buttonEl =
      document.querySelector('#df-c3973471-34c6-4f17-a92d-092f081e2b47 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-c3973471-34c6-4f17-a92d-092f081e2b47');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# Now we need to normalize the data, as each feature has a varying range:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 0)
# Checking the shape of training and testing samples after splitting
X_train.shape, X_test.shape
((22792, 14), (9769, 14))
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
# Train a Gaussian Naive Bayes Classifier on the training set
from sklearn.naive_bayes import GaussianNB

# instantiate the model
gnb = GaussianNB()

# Fit the model:
gnb.fit(X_train, y_train)
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = gnb.predict(X_test)
y_pred
array([' <=50K', ' <=50K', ' <=50K', ..., ' >50K', ' <=50K', ' <=50K'],
      dtype='<U6')
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)
print(cm, "\n", ac)
[[7037  370]
 [1510  852]] 
 0.8075545091616337
print("True Positives : ", cm[0, 0])
print("True Negatives : ", cm[1, 1])
print("False Positives : ", cm[0, 1])
print("False Negatives : ", cm[1, 0])
True Positives :  7037
True Negatives :  852
False Positives :  370
False Negatives :  1510
cm_matrix = pd.DataFrame(data = cm, columns = ["Actual Positive : 1", "Actual Negative : 0"], index = ["Predict Positive : 1", "Predict Negative : 0"])
import seaborn as sns
sns.heatmap(cm_matrix, annot = True, fmt = 'd', cmap = 'YlGnBu')
<AxesSubplot:>

png

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       <=50K       0.82      0.95      0.88      7407
        >50K       0.70      0.36      0.48      2362

    accuracy                           0.81      9769
   macro avg       0.76      0.66      0.68      9769
weighted avg       0.79      0.81      0.78      9769
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, p = 2) # We mention in the parameters for it to take
# 5 nearest neigbors(default value) and p = 2(default value = Eucledian distance)
classifier.fit(X_train, y_train) # Feed the training data to the classifier.
KNeighborsClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = classifier.predict(X_test) # Predicting for x_test data
y_pred
array([' <=50K', ' <=50K', ' <=50K', ..., ' >50K', ' >50K', ' <=50K'],
      dtype=object)
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)
print(cm, "\n", ac)
[[6685  722]
 [ 990 1372]] 
 0.8247517657897431

Conclusion:

  1. The accuracy of Naive-Bayes is 80.7%
  2. The accuracy of KNN Neighbours (with k = 5) is 82.4% From the acquired results, we can conclude that KNN is better for the sample which we took in this case.

Learnt the following from the above experiment:

  1. Implement Naïve Bayes technique for the classification
  2. Compare results of Naïve Bayes and KNN
  3. Understand and infer results of different classification metrics

Edit this page

Srihari Thyagarajan
Srihari Thyagarajan
B Tech AI Senior Student

Hi, I’m Haleshot, a final-year student studying B Tech Artificial Intelligence. I like projects relating to ML, AI, DL, CV, NLP, Image Processing, etc. Currently exploring Python, FastAPI, projects involving AI and platforms such as HuggingFace and Kaggle.

Next
Previous

Related