DBScan Clustering Algorithm

Exploring Density-Based Spatial Clustering with DBSCAN

Srihari Thyagarajan

Last updated on May 25, 2024 15 min read Programming, Data Science, AI, Algorithms, Academic

Program output

DBSCAN Clustering Algorithm

This README file provides information on the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm and its implementation. DBSCAN is a density-based clustering algorithm that groups together data points that are close to each other in dense regions and identifies outliers as points in low-density regions. Below is a breakdown of the content covered in this README.

Part A: DBSCAN Clustering Algorithm
Part B: Implementing DBSCAN
Part C: Handling Outliers
Part D: Exploring Different Parameters
Part E: DBSCAN on Circular Dataset
Part F: DBSCAN on Moon-shaped Dataset
Part G: DBSCAN on Diabetes Dataset

Part A: DBSCAN Clustering Algorithm

In this section, the DBSCAN algorithm is introduced. The make_blobs function from sklearn.datasets is used to generate a synthetic dataset for demonstration purposes.

Part B: Implementing DBSCAN

The DBSCAN algorithm is implemented using the DBSCAN class from sklearn.cluster. The eps (epsilon) and min_samples parameters are set to define the clustering behavior.

Part C: Handling Outliers

This section discusses how DBSCAN identifies outliers as noise points. The outliers are plotted separately to visualize their presence in the dataset.

Part D: Exploring Different Parameters

The impact of changing the eps and min_samples parameters on the DBSCAN algorithm is explored. The dataset is visualized with different parameter values to observe the clustering behavior and outlier detection.

Part E: DBSCAN on Circular Dataset

DBSCAN is applied to a circular dataset generated using the make_circles function from sklearn.datasets. The resulting clusters are visualized with different colors.

Part F: DBSCAN on Moon-shaped Dataset

DBSCAN is applied to a moon-shaped dataset generated using the make_moons function from sklearn.datasets. The resulting clusters are visualized with different colors.

Part G: DBSCAN on Diabetes Dataset

The DBSCAN algorithm is applied to the diabetes dataset. The steps performed in the previous sections are repeated on this dataset, showcasing how DBSCAN can be used in a real-world scenario.

In this README, you will find code examples, visualizations, and explanations of the DBSCAN algorithm and its application to different datasets. Follow the provided instructions and explore the implementation of DBSCAN for clustering and outlier detection purposes.

# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

DBScan Clustering Algorithm

from sklearn.datasets import make_blobs

# Generating the dataset
# Classes denote the labels here:
dataset, classes = make_blobs(n_samples = 250, n_features = 2, centers = 1, cluster_std = 0.3, random_state = 1)

# Dataset:
plt.scatter(dataset[:, 0], dataset[:, 1])

<matplotlib.collections.PathCollection at 0x7f3fa21fa7f0>

classes

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps = 0.3, min_samples = 20) # Epsilon default value = 0.5.

pred = dbscan.fit_predict(dataset)

pred

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

# Plot the data points of the cluster and show the outliers(those with 1 value):

# Sample Number of Outliers:
outlier_index = np.where(pred == -1)

# Value of the outlier:
outlier_val = dataset[outlier_index]

print("Outlier Index : \n", outlier_index, "\nOutlier Value : \n", outlier_val)

Outlier Index : 
 (array([ 20,  41,  56,  92, 154, 190, 202]),) 
Outlier Value : 
 [[-0.75030277  4.65386526]
 [-2.4114921   3.77224069]
 [-2.38361081  3.87321996]
 [-1.00234999  3.83758159]
 [-2.27072027  3.82371311]
 [-2.49748541  4.98774851]
 [-1.50503782  3.57172953]]

plt.scatter(dataset[:, 0], dataset[:, 1])
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3fa218baf0>

On Changing the Eps value and min samples

dbscan = DBSCAN(eps = 0.1, min_samples = 5) # Epsilon default value = 0.5.
pred = dbscan.fit_predict(dataset)

# Sample Number of Outliers:
outlier_index = np.where(pred == -1)

# Value of the outlier:
outlier_val = dataset[outlier_index]

plt.scatter(dataset[:, 0], dataset[:, 1])
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3fa2271640>

Hence from the above values, we see that taking eps value of 0.5 and above helps in eliminating outliers.

from sklearn.datasets import make_circles

dataset, classes = make_circles(n_samples = 500, factor = 0.3, noise = 0.1)

plt.scatter(dataset[:, 0], dataset[:, 1], c = classes) # c = classes gives different colours to the clusters

<matplotlib.collections.PathCollection at 0x7f3fa057a880>

dbscan = DBSCAN(eps = 0.2, min_samples = 15) # Epsilon default value = 0.5.
pred = dbscan.fit_predict(dataset)
# Sample Number of Outliers:
outlier_index = np.where(pred == -1)

# Value of the outlier:
outlier_val = dataset[outlier_index]

plt.scatter(dataset[:, 0], dataset[:, 1], c = classes)
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3fa04f4580>

pred

array([ 0,  0,  2,  2,  0,  8,  1,  0,  0,  0,  4,  8,  2,  0,  5,  0,  9,
        1,  0,  0,  0,  0,  1,  0,  9,  0,  1,  0,  0,  0,  0,  0,  0,  7,
        0,  1,  6,  0,  6,  0, -1,  2,  9,  3,  4,  2,  0,  8,  0,  0,  1,
        0,  2,  0,  0,  0,  0,  5,  1,  0,  0,  0,  4,  4,  5, 10, -1,  4,
        6,  0,  2,  0, -1,  6,  0,  0,  2,  3,  0,  0,  0,  5,  9,  0,  0,
        7,  0,  0,  6,  0,  2,  0,  2,  0,  9,  1,  0,  0,  4,  1, 10,  6,
        0,  2,  0,  1,  0,  0,  5,  4,  7,  0,  0, -1,  2,  0,  3,  1,  5,
        2,  0,  0,  0,  0,  0,  7,  6,  4,  0,  0,  8,  0,  0,  0,  4,  0,
        0,  0,  7,  0,  0,  0,  2,  4,  8,  0,  0,  0, -1,  9, 10,  2,  0,
        1,  0,  0, -1,  0, -1,  1,  2,  0,  0,  8,  0,  0,  1,  0,  2,  0,
        0,  0, 10,  8,  0,  8,  0, -1,  0,  0,  9, 10, -1,  0, -1,  2,  0,
        0,  0,  0,  6,  0,  7,  8,  3,  0,  0,  0,  0,  0,  0,  2, 10,  0,
        0, -1,  0,  3,  2,  0,  0,  0,  2,  8,  0,  0,  4,  0, 10,  0,  8,
        6,  9, -1,  7, -1,  0,  0,  2,  0,  9,  8,  0,  0,  0,  0,  0,  0,
        3,  0,  0,  0,  6,  0,  9,  0,  0,  0,  0,  1,  8,  8,  0,  0,  0,
        3,  0,  0,  4,  0,  9,  0,  0,  0,  0,  2,  0,  0,  4,  1,  4,  0,
       -1,  0,  0,  1, 10,  0,  0,  6,  2,  4,  3,  4,  0, -1,  0, -1,  0,
        0,  0,  0,  5,  6,  4,  0,  0, 10,  0, -1, -1,  0,  7,  9,  4,  8,
       10,  0, 10,  0,  0,  7,  5,  8,  0,  6,  5,  0, 10,  6,  2,  3,  5,
        8,  0,  0,  0,  8,  0,  8,  0,  0, -1,  0,  0,  0,  6,  2,  1,  0,
        0,  6,  0,  1,  2, 10,  9,  2,  0,  3,  7,  0,  0,  0,  0,  2,  0,
        9,  0,  1,  0,  8,  2,  1,  0,  0,  0,  0,  0,  2,  0,  0,  0,  9,
        9,  2,  0,  0,  0,  0,  0,  0,  4,  8, 10, -1,  9,  3,  2,  4,  9,
       -1,  0,  0, -1,  0,  0, -1,  2, -1,  0,  0,  0,  0,  0,  0,  9,  1,
       -1,  0,  0, -1, -1,  2, -1,  0,  0,  0,  0, 10,  0,  0,  3,  0,  8,
       -1,  0, -1,  0,  8,  3,  2,  1,  8,  0,  0,  0,  6,  0,  0,  4,  0,
        7, -1,  0,  0,  0,  3,  5,  0, -1,  1,  2, 10,  2,  5,  0,  0,  4,
        8,  6,  7,  0,  0,  6,  0,  6,  0,  6,  0, 10,  2,  0,  0,  0,  5,
        0,  0,  0,  1,  0,  0,  6,  0,  0,  1,  7,  5,  0,  0,  6,  5,  4,
        6,  4,  0,  0,  7,  0,  3])

from sklearn.datasets import make_moons

dataset, classes = make_moons(n_samples = 500, random_state = 1, noise = 0.15)
plt.scatter(dataset[:, 0], dataset[:, 1], c = classes) # c = classes gives different colours to the clusters

<matplotlib.collections.PathCollection at 0x7f3fa0455f10>

dbscan = DBSCAN(eps = 0.2, min_samples = 15) # Epsilon default value = 0.5.
pred = dbscan.fit_predict(dataset)
# Sample Number of Outliers:
outlier_index = np.where(pred == -1)

# Value of the outlier:
outlier_val = dataset[outlier_index]

plt.scatter(dataset[:, 0], dataset[:, 1], c = classes)
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3fa22f0c10>

On Changing the Eps value and min samples

dbscan = DBSCAN(eps = 0.3, min_samples = 15) # Epsilon default value = 0.5.
pred = dbscan.fit_predict(dataset)
# Sample Number of Outliers:
outlier_index = np.where(pred == -1)

# Value of the outlier:
outlier_val = dataset[outlier_index]

plt.scatter(dataset[:, 0], dataset[:, 1], c = classes)
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3f9ffab9a0>

Hence from the above values, we see that taking eps value of 0.3 and above helps in eliminating outliers.

On increasing the number of samples, outliers seem to increase.

Loading the dataset:

df = pd.read_csv("/content/diabetes.csv")

df

	Glucose	BMI	Outcome
0	148	33.6	1
1	85	26.6	0
2	183	23.3	1
3	89	28.1	0
4	137	43.1	1
...	...	...	...
763	101	32.9	0
764	122	36.8	0
765	121	26.2	0
766	126	30.1	1
767	93	30.4	0

768 rows × 3 columns

  <script>
    const buttonEl =
      document.querySelector('#df-633213a4-5b73-49c1-af8c-02b29db5f3f6 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-633213a4-5b73-49c1-af8c-02b29db5f3f6');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

EDA:

df.head()

	Glucose	BMI	Outcome
0	148	33.6	1
1	85	26.6	0
2	183	23.3	1
3	89	28.1	0
4	137	43.1	1

  <script>
    const buttonEl =
      document.querySelector('#df-5cefb7b3-8099-4c43-9c9a-0020b9880abb button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-5cefb7b3-8099-4c43-9c9a-0020b9880abb');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Glucose  768 non-null    int64  
 1   BMI      768 non-null    float64
 2   Outcome  768 non-null    int64  
dtypes: float64(1), int64(2)
memory usage: 18.1 KB

df.shape

(768, 3)

df.describe()

	Glucose	BMI	Outcome
count	768.000000	768.000000	768.000000
mean	120.894531	31.992578	0.348958
std	31.972618	7.884160	0.476951
min	0.000000	0.000000	0.000000
25%	99.000000	27.300000	0.000000
50%	117.000000	32.000000	0.000000
75%	140.250000	36.600000	1.000000
max	199.000000	67.100000	1.000000

  <script>
    const buttonEl =
      document.querySelector('#df-c7b13b05-315d-4802-ad90-5d29f753a613 button.colab-df-convert');
    buttonEl.style.display =
      google.colab.kernel.accessAllowed ? 'block' : 'none';

    async function convertToInteractive(key) {
      const element = document.querySelector('#df-c7b13b05-315d-4802-ad90-5d29f753a613');
      const dataTable =
        await google.colab.kernel.invokeFunction('convertToInteractive',
                                                 [key], {});
      if (!dataTable) return;

      const docLinkHtml = 'Like what you see? Visit the ' +
        '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
        + ' to learn more about interactive tables.';
      element.innerHTML = '';
      dataTable['output_type'] = 'display_data';
      await google.colab.output.renderOutput(dataTable, element);
      const docLink = document.createElement('div');
      docLink.innerHTML = docLinkHtml;
      element.appendChild(docLink);
    }
  </script>
</div>

Dropping the outcome column as we are building an unsupervised model. So outcome is not required.

X = df.drop(['Outcome'], axis=1)

plt.scatter(X["Glucose"], X["BMI"], c = df["Outcome"])

<matplotlib.collections.PathCollection at 0x7f3f9fbaf970>

dbscan = DBSCAN(eps = 4, min_samples = 6) # Epsilon default value = 0.5.
pred = dbscan.fit_predict(X)

pred

array([ 0,  0, -1,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0, -1,  0,  0,
        0,  0,  0,  0,  0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0, -1,  0,  0,  0,  0, -1,
        0,  0,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0, -1,  0,  0, -1,
        0,  0,  0,  0,  0, -1,  0, -1,  0,  0,  0,  0, -1,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0, -1,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0,  0,  0,
        0, -1, -1,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, -1,  0,  0,  0, -1,  0,  0,  0,  0, -1,  0,  0,  2,  0,
        0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  2, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, -1,  2,  0,  0, -1,  0,  0,  0,  1,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,
        0,  0, -1,  0,  0,  0,  0,  0,  0, -1,  0,  0, -1,  0,  0,  0,  0,
        0,  0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,
        0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       -1,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       -1,  1,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,
        0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, -1,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, -1,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0, -1,  0,  0,  0,
        0, -1,  0,  0,  0, -1,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0, -1,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,
        0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0,
        0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,  0,  0,  0,  0,  0,
       -1, -1,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1,  0,  0,
        0,  1,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,
        0,  0,  0])

plt.scatter(X["Glucose"], X["BMI"], c = pred)
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3f9f7de220>

Hence for eps value of 14 and samples as 25 helps in getting a clear distinction in the clusters.

dbscan = DBSCAN(eps = 14, min_samples = 25) # Epsilon default value = 0.5.
pred = dbscan.fit_predict(X)

plt.scatter(X["Glucose"], X["BMI"], c = pred)
plt.scatter(outlier_val[:, 0], outlier_val[:, 1], color = 'r')

<matplotlib.collections.PathCollection at 0x7f3f9f764820>

Edit this page

Machine Learning Clustering DBSCAN Density-Based Clustering Outlier Detection

DBScan Clustering Algorithm

DBSCAN Clustering Algorithm

Table of Contents

Part A: DBSCAN Clustering Algorithm

Part B: Implementing DBSCAN

Part C: Handling Outliers

Part D: Exploring Different Parameters

Part E: DBSCAN on Circular Dataset

Part F: DBSCAN on Moon-shaped Dataset

Part G: DBSCAN on Diabetes Dataset

DBScan Clustering Algorithm

On Changing the Eps value and min samples

Hence from the above values, we see that taking eps value of 0.5 and above helps in eliminating outliers.

On Changing the Eps value and min samples

Hence from the above values, we see that taking eps value of 0.3 and above helps in eliminating outliers.

On increasing the number of samples, outliers seem to increase.

Loading the dataset:

EDA:

Dropping the outcome column as we are building an unsupervised model. So outcome is not required.

Hence for eps value of 14 and samples as 25 helps in getting a clear distinction in the clusters.

Srihari Thyagarajan

B Tech AI Senior Student

Related