Eda One Hot Label Encoding

Exploring Data with Pandas Techniques

Image credit: Analytics Arora

Data Exploration using Pandas Library

This README file provides instructions and information for understanding and implementing data exploration techniques using the Pandas library. The following tasks will be performed on a car dataset:

Table of Contents

  1. Aim
  2. Prerequisite
  3. Outcome
  4. Theory
  5. Task 1: Exploratory Data Analysis on Car Dataset
  6. Task 2: One Hot and Label Encoding on “adults” Dataset

Aim

The aim of this project is to understand and implement data exploration techniques using the Pandas library.

Prerequisite

In order to complete this experiment, you should have prior knowledge of Python programming and the Pandas library.

Outcome

After successfully completing this experiment, you will be able to:

  • Read different types of data files (csv, excel, text file, etc.).
  • Obtain metadata of a given dataset.
  • Understand finding null values and replacing them.
  • Understand and implement class label encoding.
  • Understand and implement one hot encoding.
  • Can be found here.

Theory

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high-level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.

Pandas Library

Pandas is a powerful Python library for data manipulation and analysis. It provides a DataFrame, which is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. The DataFrame accepts many different kinds of input, including dictionaries, lists, arrays, and other DataFrames.

Encoding

One Hot Encoding

One-hot encoding converts categorical data into numeric data by splitting the column into multiple columns. Each unique value in the column becomes a new column, and the values are replaced by 1s and 0s, depending on which column has what value.

Label Encoding

Label encoding is a simple approach that involves converting each value in a column into a number. Each unique value is assigned a unique integer label.

Task 1: Exploratory Data Analysis on Car Dataset

Perform the following exploratory data analysis tasks on the car dataset:

  1. Read the Toyota.csv file into a DataFrame.
  2. Explore the size, shape, and data types of each column in the dataset.
  3. List down the columns of the dataset.
  4. Find out the ‘Fuel Type’ for the 4th row.
  5. Find out the value for the second column for the 4th row.
  6. Select all rows for the column “Fuel Type”.
  7. Select all rows for the columns “KM”, “HP”, and “Automatic”.
  8. Display the first five rows for columns 2 to 4 (excluding row 5 and column 4).
  9. Display the info of the dataset and state your observations.
  10. Identify unique values for the columns “KM”, “HP”, and “Doors”.
  11. Create a new data frame, replacing “???” with NaN.
  12. Replace the categorical values in the “Doors” column with their corresponding numeric values.
  13. Convert the data types of columns “Doors”, “MetColor”, and “Automatic” to int and object.
  14. Identify the total number of null values in each column of the dataset.
  15. Drop rows with null values.
  16. Identify the total number of cars that run on “Petrol”, “Diesel”, or “CNG”.
  17. Identify the mean of “KM” for the cars that run on “Diesel”.

Task 2: One Hot and Label Encoding on “adults” Dataset

Perform one hot encoding and label encoding on the relationship column of the “adults” dataset.

# ML Practical Experiment 2
# import libraries
import pandas as pd
import numpy as np
import statistics as st

Task 1

df = pd.read_excel("/content/Toyota.csv")
df

PriceAgeKMFuelTypeHPMetColorAutomaticCCDoorsWeight
01350023.046986Diesel901.002000three1165
11375023.072937Diesel901.00200031165
21395024.041711Diesel90NaN0200031165
31495026.048000Diesel900.00200031165
41375030.038500Diesel900.00200031170
.................................
14317500NaN20544Petrol861.00130031025
14321084572.0??Petrol860.00130031015
14338500NaN17016Petrol860.00130031015
1434725070.0??NaN861.00130031015
1435695076.01Petrol1100.00160051114

1436 rows × 10 columns

  <script>
    const buttonEl =
      document.querySelector('#df-0532caac-81db-469f-b417-8b29ccd959b9 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-0532caac-81db-469f-b417-8b29ccd959b9');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# Size
df.size
14360
# Shape
df.shape
(1436, 10)
# Data Types
df.dtypes
Price          int64
Age          float64
KM            object
FuelType      object
HP            object
MetColor      object
Automatic     object
CC             int64
Doors          int64
Weight         int64
dtype: object
# Columns of a Dataset
for column in df.columns:
  print(column)
Price
Age
KM
FuelType
HP
MetColor
Automatic
CC
Doors
Weight
# Fuel Type of the 4th row
df['FuelType'][3]
'Diesel'
# Value for second column for the 4th row
df.iloc[:, 2][4]
38500
df['FuelType']
0       Diesel
1       Diesel
2       Diesel
3       Diesel
4       Diesel
         ...  
1431    Petrol
1432    Petrol
1433    Petrol
1434         0
1435    Petrol
Name: FuelType, Length: 1436, dtype: object
df[["FuelType", "KM", "HP"]]

FuelTypeKMHP
0Diesel4698690
1Diesel7293790
2Diesel4171190
3Diesel4800090
4Diesel3850090
............
1431Petrol2054486
1432PetrolNaN86
1433Petrol1701686
14340NaN86
1435Petrol1110

1436 rows × 3 columns

  <script>
    const buttonEl =
      document.querySelector('#df-d0dd26dd-2974-49f0-ae0e-76d086f78926 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-d0dd26dd-2974-49f0-ae0e-76d086f78926');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# Value for 1-5 rows and 2-4 columns exluding the 5th row and 4th column.
df.iloc[1: 5, 2 : 4]

fnlwgteducation
189814HS-grad
2336951Assoc-acdm
3160323Some-college
4103497Some-college

  <script>
    const buttonEl =
      document.querySelector('#df-59a9d4eb-5f95-4d68-a50b-74fdfc9f1fd3 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-59a9d4eb-5f95-4d68-a50b-74fdfc9f1fd3');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# Info of dataset:
df

PriceAgeKMFuelTypeHPMetColorAutomaticCCDoorsWeight
01350023.046986Diesel901.00200031165
11375023.072937Diesel901.00200031165
21395024.041711Diesel900.00200031165
31495026.048000Diesel900.00200031165
41375030.038500Diesel900.00200031170
.................................
143175000.020544Petrol861.00130031025
14321084572.0NaNPetrol860.00130031015
143385000.017016Petrol860.00130031015
1434725070.0NaN0861.00130031015
1435695076.01Petrol1100.00160051114

1436 rows × 10 columns

  <script>
    const buttonEl =
      document.querySelector('#df-db86cb62-dd1d-4262-b279-e0b0cd32f98a button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-db86cb62-dd1d-4262-b279-e0b0cd32f98a');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Observations from the Dataset:

From the Dataset we observe that:

  1. The columns KM -> Kilometres and Doors should have the Integer datatype. However from the dataframe we observe that some values in these columns have non-integer values.
  2. The datatypes of these 2 columns have the “object” datatype.
df["KM"].unique()
array([46986, 72937, 41711, ..., 30964, 20544, 17016], dtype=object)
df["HP"].unique()
array([90, '????', 192, 110, 97, 71, 116, 98, 69, 86, 72, 107, 73],
      dtype=object)
df["Doors"].unique()
array(['three', 3, 5, 4, 'four', 'five', 2], dtype=object)
df[["KM","HP","Doors"]].nunique()
KM       1256
HP         13
Doors       7
dtype: int64
df = df.fillna(0)
df.replace('??', 'NaN', inplace = True)

# Forming a new dataframe
newdf = df.replace(to_replace = ["??","????"], value = "NAN")
newdf

# df_1 = df.fillna(0)
# df_1.replace('??', 'NaN', inplace = True)

PriceAgeKMFuelTypeHPMetColorAutomaticCCDoorsWeight
01350023.046986Diesel901.002000three1165
11375023.072937Diesel901.00200031165
21395024.041711Diesel900.00200031165
31495026.048000Diesel900.00200031165
41375030.038500Diesel900.00200031170
.................................
143175000.020544Petrol861.00130031025
14321084572.0NaNPetrol860.00130031015
143385000.017016Petrol860.00130031015
1434725070.0NaN0861.00130031015
1435695076.01Petrol1100.00160051114

1436 rows × 10 columns

  <script>
    const buttonEl =
      document.querySelector('#df-b1c33b8a-ff6f-46e9-bca3-1953e5ce8516 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-b1c33b8a-ff6f-46e9-bca3-1953e5ce8516');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# New dataframe containing ?? replaced with NaN
newdf

PriceAgeKMFuelTypeHPMetColorAutomaticCCDoorsWeight
01350023.046986Diesel901.002000three1165
11375023.072937Diesel901.00200031165
21395024.041711Diesel900.00200031165
31495026.048000Diesel900.00200031165
41375030.038500Diesel900.00200031170
.................................
143175000.020544Petrol861.00130031025
14321084572.0NaNPetrol860.00130031015
143385000.017016Petrol860.00130031015
1434725070.0NaN0861.00130031015
1435695076.01Petrol1100.00160051114

1436 rows × 10 columns

  <script>
    const buttonEl =
      document.querySelector('#df-211404a5-c7fd-46c7-95d3-6fc8d1386320 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-211404a5-c7fd-46c7-95d3-6fc8d1386320');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# Categorical
newdf["Doors"].replace(["three", "four", "five"], [3, 4, 5], inplace = True)
newdf

PriceAgeKMFuelTypeHPMetColorAutomaticCCDoorsWeight
01350023.046986Diesel901.00200031165
11375023.072937Diesel901.00200031165
21395024.041711Diesel900.00200031165
31495026.048000Diesel900.00200031165
41375030.038500Diesel900.00200031170
.................................
143175000.020544Petrol861.00130031025
14321084572.0NaNPetrol860.00130031015
143385000.017016Petrol860.00130031015
1434725070.0NaN0861.00130031015
1435695076.01Petrol1100.00160051114

1436 rows × 10 columns

  <script>
    const buttonEl =
      document.querySelector('#df-626a62c2-5981-41b1-ab78-d07c081eae5e button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-626a62c2-5981-41b1-ab78-d07c081eae5e');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
newdf["Doors"] = newdf["Doors"].astype(int)
newdf["MetColor"] = newdf["MetColor"].astype(object)
newdf["Automatic"] = newdf["Automatic"].astype(object)
df_1 = pd.read_excel("/content/Toyota.csv")
df_1.isnull().sum()
Price          0
Age          100
KM             0
FuelType     100
HP             0
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64
newdf.dropna()

PriceAgeKMFuelTypeHPMetColorAutomaticCCDoorsWeight
01350023.046986Diesel901.00200031165
11375023.072937Diesel901.00200031165
21395024.041711Diesel900.00200031165
31495026.048000Diesel900.00200031165
41375030.038500Diesel900.00200031170
.................................
143175000.020544Petrol861.00130031025
14321084572.0NaNPetrol860.00130031015
143385000.017016Petrol860.00130031015
1434725070.0NaN0861.00130031015
1435695076.01Petrol1100.00160051114

1436 rows × 10 columns

  <script>
    const buttonEl =
      document.querySelector('#df-7ea89f2b-0234-4312-9f63-308900d9ab64 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-7ea89f2b-0234-4312-9f63-308900d9ab64');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
newdf["FuelType"].value_counts()
Petrol    1177
Diesel     144
0          100
CNG         15
Name: FuelType, dtype: int64
# indexKM = newdf[(newdf['KM'] == 'NAN')].index
# newdf.drop(indexKM, inplace = True)
# newdf = newdf.reset_index()
newdf
l = []

for i in range(len(newdf['FuelType'])):
  if newdf['FuelType'][i] == 'Diesel':
    if newdf['KM'][i] != 'NaN':
      l.append(int(newdf['KM'][i]))
 
np.mean(l)
114927.87857142858

Task 2

from sklearn.preprocessing import OneHotEncoder
df = pd.read_excel("/content/adult.csv")
df_1 = pd.read_excel("/content/adult.csv")
# Checking for the labels in the categorical parameters
df_1["relationship"].unique()
array(['Own-child', 'Husband', 'Not-in-family', 'Unmarried', 'Wife',
       'Other-relative'], dtype=object)
# Checking for the label counts in the categorical parameters
df_1["relationship"].value_counts()
Husband           19716
Not-in-family     12583
Own-child          7581
Unmarried          5125
Wife               2331
Other-relative     1506
Name: relationship, dtype: int64

Method 1:

One Hot Encoding using Sci-kit learn Library:

# Creating aninstance of the one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# Perform one-hot encoding on 'relationship' column 
encoder_df = pd.DataFrame(encoder.fit_transform(df[['relationship']]).toarray())
# Merging one-hot encoded columns back with original DataFrame df.
final_df = df.join(encoder_df)
final_df

ageworkclassfnlwgteducationeducational-nummarital-statusoccupationrelationshipracegender...capital-losshours-per-weeknative-countryincome012345
025Private22680211th7Never-marriedMachine-op-inspctOwn-childBlackMale...040United-States<=50K0.00.00.01.00.00.0
138Private89814HS-grad9Married-civ-spouseFarming-fishingHusbandWhiteMale...050United-States<=50K1.00.00.00.00.00.0
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servHusbandWhiteMale...040United-States>50K1.00.00.00.00.00.0
344Private160323Some-college10Married-civ-spouseMachine-op-inspctHusbandBlackMale...040United-States>50K1.00.00.00.00.00.0
418?103497Some-college10Never-married?Own-childWhiteFemale...030United-States<=50K0.00.00.01.00.00.0
..................................................................
4883727Private257302Assoc-acdm12Married-civ-spouseTech-supportWifeWhiteFemale...038United-States<=50K0.00.00.00.00.01.0
4883840Private154374HS-grad9Married-civ-spouseMachine-op-inspctHusbandWhiteMale...040United-States>50K1.00.00.00.00.00.0
4883958Private151910HS-grad9WidowedAdm-clericalUnmarriedWhiteFemale...040United-States<=50K0.00.00.00.01.00.0
4884022Private201490HS-grad9Never-marriedAdm-clericalOwn-childWhiteMale...020United-States<=50K0.00.00.01.00.00.0
4884152Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWifeWhiteFemale...040United-States>50K0.00.00.00.00.01.0

48842 rows × 21 columns

  <script>
    const buttonEl =
      document.querySelector('#df-ea75d176-b02c-4743-b21c-8e243cfc080d button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-ea75d176-b02c-4743-b21c-8e243cfc080d');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
# Dropping the original relationship column from the dataframe as we will be refering only to the Numerical values
# which are to be generated

final_df.drop('relationship', axis=1, inplace=True)
final_df

ageworkclassfnlwgteducationeducational-nummarital-statusoccupationracegendercapital-gaincapital-losshours-per-weeknative-countryincome012345
025Private22680211th7Never-marriedMachine-op-inspctBlackMale0040United-States<=50K0.00.00.01.00.00.0
138Private89814HS-grad9Married-civ-spouseFarming-fishingWhiteMale0050United-States<=50K1.00.00.00.00.00.0
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servWhiteMale0040United-States>50K1.00.00.00.00.00.0
344Private160323Some-college10Married-civ-spouseMachine-op-inspctBlackMale7688040United-States>50K1.00.00.00.00.00.0
418?103497Some-college10Never-married?WhiteFemale0030United-States<=50K0.00.00.01.00.00.0
...............................................................
4883727Private257302Assoc-acdm12Married-civ-spouseTech-supportWhiteFemale0038United-States<=50K0.00.00.00.00.01.0
4883840Private154374HS-grad9Married-civ-spouseMachine-op-inspctWhiteMale0040United-States>50K1.00.00.00.00.00.0
4883958Private151910HS-grad9WidowedAdm-clericalWhiteFemale0040United-States<=50K0.00.00.00.01.00.0
4884022Private201490HS-grad9Never-marriedAdm-clericalWhiteMale0020United-States<=50K0.00.00.01.00.00.0
4884152Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWhiteFemale15024040United-States>50K0.00.00.00.00.01.0

48842 rows × 20 columns

  <script>
    const buttonEl =
      document.querySelector('#df-ac003884-3101-4005-b1ea-b04389f486cc button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-ac003884-3101-4005-b1ea-b04389f486cc');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
final_df.columns = ['age',       'workclass',          'fnlwgt',
             'education', 'educational-num',  'marital-status',
            'occupation',            'race',          'gender',
          'capital-gain',    'capital-loss',  'hours-per-week',
        'native-country',          'income', 'Own-child', 'Husband', 'Not-in-family', 'Unmarried', 'Wife', 'Other-relative']
final_df

ageworkclassfnlwgteducationeducational-nummarital-statusoccupationracegendercapital-gaincapital-losshours-per-weeknative-countryincomeOwn-childHusbandNot-in-familyUnmarriedWifeOther-relative
025Private22680211th7Never-marriedMachine-op-inspctBlackMale0040United-States<=50K0.00.00.01.00.00.0
138Private89814HS-grad9Married-civ-spouseFarming-fishingWhiteMale0050United-States<=50K1.00.00.00.00.00.0
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servWhiteMale0040United-States>50K1.00.00.00.00.00.0
344Private160323Some-college10Married-civ-spouseMachine-op-inspctBlackMale7688040United-States>50K1.00.00.00.00.00.0
418?103497Some-college10Never-married?WhiteFemale0030United-States<=50K0.00.00.01.00.00.0
...............................................................
4883727Private257302Assoc-acdm12Married-civ-spouseTech-supportWhiteFemale0038United-States<=50K0.00.00.00.00.01.0
4883840Private154374HS-grad9Married-civ-spouseMachine-op-inspctWhiteMale0040United-States>50K1.00.00.00.00.00.0
4883958Private151910HS-grad9WidowedAdm-clericalWhiteFemale0040United-States<=50K0.00.00.00.01.00.0
4884022Private201490HS-grad9Never-marriedAdm-clericalWhiteMale0020United-States<=50K0.00.00.01.00.00.0
4884152Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWhiteFemale15024040United-States>50K0.00.00.00.00.01.0

48842 rows × 20 columns

  <script>
    const buttonEl =
      document.querySelector('#df-5ad9fab4-f0ef-40ce-83b0-5264336c9ed8 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5ad9fab4-f0ef-40ce-83b0-5264336c9ed8');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Method 2

One-Hot encoding the categorical parameters using get_dummies()

one_hot_encoded_data = pd.get_dummies(df_1, columns = ['relationship'])
one_hot_encoded_data

ageworkclassfnlwgteducationeducational-nummarital-statusoccupationracegendercapital-gaincapital-losshours-per-weeknative-countryincomerelationship_Husbandrelationship_Not-in-familyrelationship_Other-relativerelationship_Own-childrelationship_Unmarriedrelationship_Wife
025Private22680211th7Never-marriedMachine-op-inspctBlackMale0040United-States<=50K000100
138Private89814HS-grad9Married-civ-spouseFarming-fishingWhiteMale0050United-States<=50K100000
228Local-gov336951Assoc-acdm12Married-civ-spouseProtective-servWhiteMale0040United-States>50K100000
344Private160323Some-college10Married-civ-spouseMachine-op-inspctBlackMale7688040United-States>50K100000
418?103497Some-college10Never-married?WhiteFemale0030United-States<=50K000100
...............................................................
4883727Private257302Assoc-acdm12Married-civ-spouseTech-supportWhiteFemale0038United-States<=50K000001
4883840Private154374HS-grad9Married-civ-spouseMachine-op-inspctWhiteMale0040United-States>50K100000
4883958Private151910HS-grad9WidowedAdm-clericalWhiteFemale0040United-States<=50K000010
4884022Private201490HS-grad9Never-marriedAdm-clericalWhiteMale0020United-States<=50K000100
4884152Self-emp-inc287927HS-grad9Married-civ-spouseExec-managerialWhiteFemale15024040United-States>50K000001

48842 rows × 20 columns

  <script>
    const buttonEl =
      document.querySelector('#df-a6850ea9-0336-4e78-89c9-4e13c5b161a8 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-a6850ea9-0336-4e78-89c9-4e13c5b161a8');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>
Warning: total number of rows (48842) exceeds max_rows (20000). Limiting to first (20000) rows.

Edit this page

Srihari Thyagarajan
Srihari Thyagarajan
B Tech AI Senior Student

Hi, I’m Haleshot, a final-year student studying B Tech Artificial Intelligence. I like projects relating to ML, AI, DL, CV, NLP, Image Processing, etc. Currently exploring Python, FastAPI, projects involving AI and platforms such as HuggingFace and Kaggle.

Next
Previous

Related