Eda One Hot Label Encoding
Exploring Data with Pandas Techniques
Data Exploration using Pandas Library
This README file provides instructions and information for understanding and implementing data exploration techniques using the Pandas library. The following tasks will be performed on a car dataset:
Table of Contents
- Aim
- Prerequisite
- Outcome
- Theory
- Task 1: Exploratory Data Analysis on Car Dataset
- Task 2: One Hot and Label Encoding on “adults” Dataset
Aim
The aim of this project is to understand and implement data exploration techniques using the Pandas library.
Prerequisite
In order to complete this experiment, you should have prior knowledge of Python programming and the Pandas library.
Outcome
After successfully completing this experiment, you will be able to:
- Read different types of data files (csv, excel, text file, etc.).
- Obtain metadata of a given dataset.
- Understand finding null values and replacing them.
- Understand and implement class label encoding.
- Understand and implement one hot encoding.
- Can be found here.
Theory
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an open-ended process where we calculate statistics and make figures to find trends, anomalies, patterns, or relationships within the data. The goal of EDA is to learn what our data can tell us. It generally starts out with a high-level overview, then narrows in to specific areas as we find intriguing areas of the data. The findings may be interesting in their own right, or they can be used to inform our modeling choices, such as by helping us decide which features to use.
Pandas Library
Pandas is a powerful Python library for data manipulation and analysis. It provides a DataFrame, which is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. The DataFrame accepts many different kinds of input, including dictionaries, lists, arrays, and other DataFrames.
Encoding
One Hot Encoding
One-hot encoding converts categorical data into numeric data by splitting the column into multiple columns. Each unique value in the column becomes a new column, and the values are replaced by 1s and 0s, depending on which column has what value.
Label Encoding
Label encoding is a simple approach that involves converting each value in a column into a number. Each unique value is assigned a unique integer label.
Task 1: Exploratory Data Analysis on Car Dataset
Perform the following exploratory data analysis tasks on the car dataset:
- Read the Toyota.csv file into a DataFrame.
- Explore the size, shape, and data types of each column in the dataset.
- List down the columns of the dataset.
- Find out the ‘Fuel Type’ for the 4th row.
- Find out the value for the second column for the 4th row.
- Select all rows for the column “Fuel Type”.
- Select all rows for the columns “KM”, “HP”, and “Automatic”.
- Display the first five rows for columns 2 to 4 (excluding row 5 and column 4).
- Display the info of the dataset and state your observations.
- Identify unique values for the columns “KM”, “HP”, and “Doors”.
- Create a new data frame, replacing “???” with NaN.
- Replace the categorical values in the “Doors” column with their corresponding numeric values.
- Convert the data types of columns “Doors”, “MetColor”, and “Automatic” to int and object.
- Identify the total number of null values in each column of the dataset.
- Drop rows with null values.
- Identify the total number of cars that run on “Petrol”, “Diesel”, or “CNG”.
- Identify the mean of “KM” for the cars that run on “Diesel”.
Task 2: One Hot and Label Encoding on “adults” Dataset
Perform one hot encoding and label encoding on the relationship column of the “adults” dataset.
# ML Practical Experiment 2
# import libraries
import pandas as pd
import numpy as np
import statistics as st
Task 1
df = pd.read_excel("/content/Toyota.csv")
df
Price | Age | KM | FuelType | HP | MetColor | Automatic | CC | Doors | Weight | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13500 | 23.0 | 46986 | Diesel | 90 | 1.0 | 0 | 2000 | three | 1165 |
1 | 13750 | 23.0 | 72937 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
2 | 13950 | 24.0 | 41711 | Diesel | 90 | NaN | 0 | 2000 | 3 | 1165 |
3 | 14950 | 26.0 | 48000 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
4 | 13750 | 30.0 | 38500 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1431 | 7500 | NaN | 20544 | Petrol | 86 | 1.0 | 0 | 1300 | 3 | 1025 |
1432 | 10845 | 72.0 | ?? | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1433 | 8500 | NaN | 17016 | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1434 | 7250 | 70.0 | ?? | NaN | 86 | 1.0 | 0 | 1300 | 3 | 1015 |
1435 | 6950 | 76.0 | 1 | Petrol | 110 | 0.0 | 0 | 1600 | 5 | 1114 |
1436 rows × 10 columns
<script>
const buttonEl =
document.querySelector('#df-0532caac-81db-469f-b417-8b29ccd959b9 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-0532caac-81db-469f-b417-8b29ccd959b9');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
# Size
df.size
14360
# Shape
df.shape
(1436, 10)
# Data Types
df.dtypes
Price int64
Age float64
KM object
FuelType object
HP object
MetColor object
Automatic object
CC int64
Doors int64
Weight int64
dtype: object
# Columns of a Dataset
for column in df.columns:
print(column)
Price
Age
KM
FuelType
HP
MetColor
Automatic
CC
Doors
Weight
# Fuel Type of the 4th row
df['FuelType'][3]
'Diesel'
# Value for second column for the 4th row
df.iloc[:, 2][4]
38500
df['FuelType']
0 Diesel
1 Diesel
2 Diesel
3 Diesel
4 Diesel
...
1431 Petrol
1432 Petrol
1433 Petrol
1434 0
1435 Petrol
Name: FuelType, Length: 1436, dtype: object
df[["FuelType", "KM", "HP"]]
FuelType | KM | HP | |
---|---|---|---|
0 | Diesel | 46986 | 90 |
1 | Diesel | 72937 | 90 |
2 | Diesel | 41711 | 90 |
3 | Diesel | 48000 | 90 |
4 | Diesel | 38500 | 90 |
... | ... | ... | ... |
1431 | Petrol | 20544 | 86 |
1432 | Petrol | NaN | 86 |
1433 | Petrol | 17016 | 86 |
1434 | 0 | NaN | 86 |
1435 | Petrol | 1 | 110 |
1436 rows × 3 columns
<script>
const buttonEl =
document.querySelector('#df-d0dd26dd-2974-49f0-ae0e-76d086f78926 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-d0dd26dd-2974-49f0-ae0e-76d086f78926');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
# Value for 1-5 rows and 2-4 columns exluding the 5th row and 4th column.
df.iloc[1: 5, 2 : 4]
fnlwgt | education | |
---|---|---|
1 | 89814 | HS-grad |
2 | 336951 | Assoc-acdm |
3 | 160323 | Some-college |
4 | 103497 | Some-college |
<script>
const buttonEl =
document.querySelector('#df-59a9d4eb-5f95-4d68-a50b-74fdfc9f1fd3 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-59a9d4eb-5f95-4d68-a50b-74fdfc9f1fd3');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
# Info of dataset:
df
Price | Age | KM | FuelType | HP | MetColor | Automatic | CC | Doors | Weight | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13500 | 23.0 | 46986 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
1 | 13750 | 23.0 | 72937 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
2 | 13950 | 24.0 | 41711 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
3 | 14950 | 26.0 | 48000 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
4 | 13750 | 30.0 | 38500 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1431 | 7500 | 0.0 | 20544 | Petrol | 86 | 1.0 | 0 | 1300 | 3 | 1025 |
1432 | 10845 | 72.0 | NaN | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1433 | 8500 | 0.0 | 17016 | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1434 | 7250 | 70.0 | NaN | 0 | 86 | 1.0 | 0 | 1300 | 3 | 1015 |
1435 | 6950 | 76.0 | 1 | Petrol | 110 | 0.0 | 0 | 1600 | 5 | 1114 |
1436 rows × 10 columns
<script>
const buttonEl =
document.querySelector('#df-db86cb62-dd1d-4262-b279-e0b0cd32f98a button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-db86cb62-dd1d-4262-b279-e0b0cd32f98a');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
Observations from the Dataset:
From the Dataset we observe that:
- The columns KM -> Kilometres and Doors should have the Integer datatype. However from the dataframe we observe that some values in these columns have non-integer values.
- The datatypes of these 2 columns have the “object” datatype.
df["KM"].unique()
array([46986, 72937, 41711, ..., 30964, 20544, 17016], dtype=object)
df["HP"].unique()
array([90, '????', 192, 110, 97, 71, 116, 98, 69, 86, 72, 107, 73],
dtype=object)
df["Doors"].unique()
array(['three', 3, 5, 4, 'four', 'five', 2], dtype=object)
df[["KM","HP","Doors"]].nunique()
KM 1256
HP 13
Doors 7
dtype: int64
df = df.fillna(0)
df.replace('??', 'NaN', inplace = True)
# Forming a new dataframe
newdf = df.replace(to_replace = ["??","????"], value = "NAN")
newdf
# df_1 = df.fillna(0)
# df_1.replace('??', 'NaN', inplace = True)
Price | Age | KM | FuelType | HP | MetColor | Automatic | CC | Doors | Weight | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13500 | 23.0 | 46986 | Diesel | 90 | 1.0 | 0 | 2000 | three | 1165 |
1 | 13750 | 23.0 | 72937 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
2 | 13950 | 24.0 | 41711 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
3 | 14950 | 26.0 | 48000 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
4 | 13750 | 30.0 | 38500 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1431 | 7500 | 0.0 | 20544 | Petrol | 86 | 1.0 | 0 | 1300 | 3 | 1025 |
1432 | 10845 | 72.0 | NaN | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1433 | 8500 | 0.0 | 17016 | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1434 | 7250 | 70.0 | NaN | 0 | 86 | 1.0 | 0 | 1300 | 3 | 1015 |
1435 | 6950 | 76.0 | 1 | Petrol | 110 | 0.0 | 0 | 1600 | 5 | 1114 |
1436 rows × 10 columns
<script>
const buttonEl =
document.querySelector('#df-b1c33b8a-ff6f-46e9-bca3-1953e5ce8516 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-b1c33b8a-ff6f-46e9-bca3-1953e5ce8516');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
# New dataframe containing ?? replaced with NaN
newdf
Price | Age | KM | FuelType | HP | MetColor | Automatic | CC | Doors | Weight | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13500 | 23.0 | 46986 | Diesel | 90 | 1.0 | 0 | 2000 | three | 1165 |
1 | 13750 | 23.0 | 72937 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
2 | 13950 | 24.0 | 41711 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
3 | 14950 | 26.0 | 48000 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
4 | 13750 | 30.0 | 38500 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1431 | 7500 | 0.0 | 20544 | Petrol | 86 | 1.0 | 0 | 1300 | 3 | 1025 |
1432 | 10845 | 72.0 | NaN | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1433 | 8500 | 0.0 | 17016 | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1434 | 7250 | 70.0 | NaN | 0 | 86 | 1.0 | 0 | 1300 | 3 | 1015 |
1435 | 6950 | 76.0 | 1 | Petrol | 110 | 0.0 | 0 | 1600 | 5 | 1114 |
1436 rows × 10 columns
<script>
const buttonEl =
document.querySelector('#df-211404a5-c7fd-46c7-95d3-6fc8d1386320 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-211404a5-c7fd-46c7-95d3-6fc8d1386320');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
# Categorical
newdf["Doors"].replace(["three", "four", "five"], [3, 4, 5], inplace = True)
newdf
Price | Age | KM | FuelType | HP | MetColor | Automatic | CC | Doors | Weight | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13500 | 23.0 | 46986 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
1 | 13750 | 23.0 | 72937 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
2 | 13950 | 24.0 | 41711 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
3 | 14950 | 26.0 | 48000 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
4 | 13750 | 30.0 | 38500 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1431 | 7500 | 0.0 | 20544 | Petrol | 86 | 1.0 | 0 | 1300 | 3 | 1025 |
1432 | 10845 | 72.0 | NaN | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1433 | 8500 | 0.0 | 17016 | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1434 | 7250 | 70.0 | NaN | 0 | 86 | 1.0 | 0 | 1300 | 3 | 1015 |
1435 | 6950 | 76.0 | 1 | Petrol | 110 | 0.0 | 0 | 1600 | 5 | 1114 |
1436 rows × 10 columns
<script>
const buttonEl =
document.querySelector('#df-626a62c2-5981-41b1-ab78-d07c081eae5e button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-626a62c2-5981-41b1-ab78-d07c081eae5e');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
newdf["Doors"] = newdf["Doors"].astype(int)
newdf["MetColor"] = newdf["MetColor"].astype(object)
newdf["Automatic"] = newdf["Automatic"].astype(object)
df_1 = pd.read_excel("/content/Toyota.csv")
df_1.isnull().sum()
Price 0
Age 100
KM 0
FuelType 100
HP 0
MetColor 150
Automatic 0
CC 0
Doors 0
Weight 0
dtype: int64
newdf.dropna()
Price | Age | KM | FuelType | HP | MetColor | Automatic | CC | Doors | Weight | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 13500 | 23.0 | 46986 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
1 | 13750 | 23.0 | 72937 | Diesel | 90 | 1.0 | 0 | 2000 | 3 | 1165 |
2 | 13950 | 24.0 | 41711 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
3 | 14950 | 26.0 | 48000 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1165 |
4 | 13750 | 30.0 | 38500 | Diesel | 90 | 0.0 | 0 | 2000 | 3 | 1170 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1431 | 7500 | 0.0 | 20544 | Petrol | 86 | 1.0 | 0 | 1300 | 3 | 1025 |
1432 | 10845 | 72.0 | NaN | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1433 | 8500 | 0.0 | 17016 | Petrol | 86 | 0.0 | 0 | 1300 | 3 | 1015 |
1434 | 7250 | 70.0 | NaN | 0 | 86 | 1.0 | 0 | 1300 | 3 | 1015 |
1435 | 6950 | 76.0 | 1 | Petrol | 110 | 0.0 | 0 | 1600 | 5 | 1114 |
1436 rows × 10 columns
<script>
const buttonEl =
document.querySelector('#df-7ea89f2b-0234-4312-9f63-308900d9ab64 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-7ea89f2b-0234-4312-9f63-308900d9ab64');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
newdf["FuelType"].value_counts()
Petrol 1177
Diesel 144
0 100
CNG 15
Name: FuelType, dtype: int64
# indexKM = newdf[(newdf['KM'] == 'NAN')].index
# newdf.drop(indexKM, inplace = True)
# newdf = newdf.reset_index()
newdf
l = []
for i in range(len(newdf['FuelType'])):
if newdf['FuelType'][i] == 'Diesel':
if newdf['KM'][i] != 'NaN':
l.append(int(newdf['KM'][i]))
np.mean(l)
114927.87857142858
Task 2
from sklearn.preprocessing import OneHotEncoder
df = pd.read_excel("/content/adult.csv")
df_1 = pd.read_excel("/content/adult.csv")
# Checking for the labels in the categorical parameters
df_1["relationship"].unique()
array(['Own-child', 'Husband', 'Not-in-family', 'Unmarried', 'Wife',
'Other-relative'], dtype=object)
# Checking for the label counts in the categorical parameters
df_1["relationship"].value_counts()
Husband 19716
Not-in-family 12583
Own-child 7581
Unmarried 5125
Wife 2331
Other-relative 1506
Name: relationship, dtype: int64
Method 1:
One Hot Encoding using Sci-kit learn Library:
# Creating aninstance of the one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')
# Perform one-hot encoding on 'relationship' column
encoder_df = pd.DataFrame(encoder.fit_transform(df[['relationship']]).toarray())
# Merging one-hot encoded columns back with original DataFrame df.
final_df = df.join(encoder_df)
final_df
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | relationship | race | gender | ... | capital-loss | hours-per-week | native-country | income | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Own-child | Black | Male | ... | 0 | 40 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | Husband | White | Male | ... | 0 | 50 | United-States | <=50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | Husband | White | Male | ... | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | ... | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | Own-child | White | Female | ... | 0 | 30 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | ... | 0 | 38 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
48838 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | ... | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48839 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | ... | 0 | 40 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
48840 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | ... | 0 | 20 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
48841 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | ... | 0 | 40 | United-States | >50K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
48842 rows × 21 columns
<script>
const buttonEl =
document.querySelector('#df-ea75d176-b02c-4743-b21c-8e243cfc080d button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-ea75d176-b02c-4743-b21c-8e243cfc080d');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
# Dropping the original relationship column from the dataframe as we will be refering only to the Numerical values
# which are to be generated
final_df.drop('relationship', axis=1, inplace=True)
final_df
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | 0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Black | Male | 0 | 0 | 40 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | White | Male | 0 | 0 | 50 | United-States | <=50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | White | Male | 0 | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Black | Male | 7688 | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | White | Female | 0 | 0 | 30 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | White | Female | 0 | 0 | 38 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
48838 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | White | Male | 0 | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48839 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | White | Female | 0 | 0 | 40 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
48840 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | White | Male | 0 | 0 | 20 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
48841 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | White | Female | 15024 | 0 | 40 | United-States | >50K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
48842 rows × 20 columns
<script>
const buttonEl =
document.querySelector('#df-ac003884-3101-4005-b1ea-b04389f486cc button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-ac003884-3101-4005-b1ea-b04389f486cc');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
final_df.columns = ['age', 'workclass', 'fnlwgt',
'education', 'educational-num', 'marital-status',
'occupation', 'race', 'gender',
'capital-gain', 'capital-loss', 'hours-per-week',
'native-country', 'income', 'Own-child', 'Husband', 'Not-in-family', 'Unmarried', 'Wife', 'Other-relative']
final_df
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | Own-child | Husband | Not-in-family | Unmarried | Wife | Other-relative | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Black | Male | 0 | 0 | 40 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | White | Male | 0 | 0 | 50 | United-States | <=50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | White | Male | 0 | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Black | Male | 7688 | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | White | Female | 0 | 0 | 30 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | White | Female | 0 | 0 | 38 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
48838 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | White | Male | 0 | 0 | 40 | United-States | >50K | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48839 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | White | Female | 0 | 0 | 40 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
48840 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | White | Male | 0 | 0 | 20 | United-States | <=50K | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
48841 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | White | Female | 15024 | 0 | 40 | United-States | >50K | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
48842 rows × 20 columns
<script>
const buttonEl =
document.querySelector('#df-5ad9fab4-f0ef-40ce-83b0-5264336c9ed8 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-5ad9fab4-f0ef-40ce-83b0-5264336c9ed8');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
Method 2
One-Hot encoding the categorical parameters using get_dummies()
one_hot_encoded_data = pd.get_dummies(df_1, columns = ['relationship'])
one_hot_encoded_data
age | workclass | fnlwgt | education | educational-num | marital-status | occupation | race | gender | capital-gain | capital-loss | hours-per-week | native-country | income | relationship_Husband | relationship_Not-in-family | relationship_Other-relative | relationship_Own-child | relationship_Unmarried | relationship_Wife | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 25 | Private | 226802 | 11th | 7 | Never-married | Machine-op-inspct | Black | Male | 0 | 0 | 40 | United-States | <=50K | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 38 | Private | 89814 | HS-grad | 9 | Married-civ-spouse | Farming-fishing | White | Male | 0 | 0 | 50 | United-States | <=50K | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 28 | Local-gov | 336951 | Assoc-acdm | 12 | Married-civ-spouse | Protective-serv | White | Male | 0 | 0 | 40 | United-States | >50K | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 44 | Private | 160323 | Some-college | 10 | Married-civ-spouse | Machine-op-inspct | Black | Male | 7688 | 0 | 40 | United-States | >50K | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 18 | ? | 103497 | Some-college | 10 | Never-married | ? | White | Female | 0 | 0 | 30 | United-States | <=50K | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 27 | Private | 257302 | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | White | Female | 0 | 0 | 38 | United-States | <=50K | 0 | 0 | 0 | 0 | 0 | 1 |
48838 | 40 | Private | 154374 | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | White | Male | 0 | 0 | 40 | United-States | >50K | 1 | 0 | 0 | 0 | 0 | 0 |
48839 | 58 | Private | 151910 | HS-grad | 9 | Widowed | Adm-clerical | White | Female | 0 | 0 | 40 | United-States | <=50K | 0 | 0 | 0 | 0 | 1 | 0 |
48840 | 22 | Private | 201490 | HS-grad | 9 | Never-married | Adm-clerical | White | Male | 0 | 0 | 20 | United-States | <=50K | 0 | 0 | 0 | 1 | 0 | 0 |
48841 | 52 | Self-emp-inc | 287927 | HS-grad | 9 | Married-civ-spouse | Exec-managerial | White | Female | 15024 | 0 | 40 | United-States | >50K | 0 | 0 | 0 | 0 | 0 | 1 |
48842 rows × 20 columns
<script>
const buttonEl =
document.querySelector('#df-a6850ea9-0336-4e78-89c9-4e13c5b161a8 button.colab-df-convert');
buttonEl.style.display =
google.colab.kernel.accessAllowed ? 'block' : 'none';
async function convertToInteractive(key) {
const element = document.querySelector('#df-a6850ea9-0336-4e78-89c9-4e13c5b161a8');
const dataTable =
await google.colab.kernel.invokeFunction('convertToInteractive',
[key], {});
if (!dataTable) return;
const docLinkHtml = 'Like what you see? Visit the ' +
'<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
+ ' to learn more about interactive tables.';
element.innerHTML = '';
dataTable['output_type'] = 'display_data';
await google.colab.output.renderOutput(dataTable, element);
const docLink = document.createElement('div');
docLink.innerHTML = docLinkHtml;
element.appendChild(docLink);
}
</script>
</div>
Warning: total number of rows (48842) exceeds max_rows (20000). Limiting to first (20000) rows.