Implementing Sentiment Analysis in Natural Language Processing

Comprehensive Guide to Sentiment Analysis

Srihari Thyagarajan

Last updated on Jun 3, 2024 6 min read Programming, Machine Learning, Data Science, Natural Language Processing

Program output

Natural Language Processing

Aim
Prerequisite
Outcome
Theory
Task to be completed in PART B
References

Aim

Sentiment Analysis

Select a dataset and identify the problem statement.
Perform EDA, text preprocessing, feature engineering.
Implement sentiment analysis on the given dataset in Natural Language Processing.
Analyze and comprehend the results obtained.

Prerequisite

Python

Outcome

After successful completion of this experiment, students will be able to:

Perform end-to-end implementation of sentiment analysis using various NLP concepts.

Theory

Understand concepts such as EDA, text preprocessing, and feature engineering for the selected dataset.

Task to be completed in PART B

Task

A.5.1. Task

Implement word embedding using Word2Vec.
Find similarity between two documents using Word2Vec.

For further information and datasets, refer to:

Data.World Sentiment Datasets

# import libraries used:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import spacy
from gensim.models import Word2Vec
from numpy import dot, nan_to_num
from numpy.linalg import norm

Task 1: Select a dataset and identify the problem statement

df = pd.read_csv('/content/Apple-Twitter-Sentiment-DFE.csv', encoding='ISO-8859-1')

Task 2: Perform EDA, text preprocessing, feature engineering

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   _unit_id              3886 non-null   int64  
 1   _golden               3886 non-null   bool   
 2   _unit_state           3886 non-null   object 
 3   _trusted_judgments    3886 non-null   int64  
 4   _last_judgment_at     3783 non-null   object 
 5   sentiment             3886 non-null   object 
 6   sentiment:confidence  3886 non-null   float64
 7   date                  3886 non-null   object 
 8   id                    3886 non-null   float64
 9   query                 3886 non-null   object 
 10  sentiment_gold        103 non-null    object 
 11  text                  3886 non-null   object 
dtypes: bool(1), float64(2), int64(2), object(7)
memory usage: 337.9+ KB

df.describe()

	_unit_id	_trusted_judgments	sentiment:confidence	id
count	3.886000e+03	3886.000000	3886.000000	3.886000e+03
mean	6.234975e+08	3.687082	0.829526	5.410039e+17
std	1.171906e+03	2.004595	0.175864	7.942752e+14
min	6.234955e+08	3.000000	0.332700	5.400000e+17
25%	6.234965e+08	3.000000	0.674475	5.400000e+17
50%	6.234975e+08	3.000000	0.811250	5.410000e+17
75%	6.234984e+08	3.000000	1.000000	5.420000e+17
max	6.235173e+08	27.000000	1.000000	5.420000e+17

<script>
  const buttonEl =
    document.querySelector('#df-32850436-6608-4845-93de-1644b1359bf2 button.colab-df-convert');
  buttonEl.style.display =
    google.colab.kernel.accessAllowed ? 'block' : 'none';

  async function convertToInteractive(key) {
    const element = document.querySelector('#df-32850436-6608-4845-93de-1644b1359bf2');
    const dataTable =
      await google.colab.kernel.invokeFunction('convertToInteractive',
                                                [key], {});
    if (!dataTable) return;

    const docLinkHtml = 'Like what you see? Visit the ' +
      '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
      + ' to learn more about interactive tables.';
    element.innerHTML = '';
    dataTable['output_type'] = 'display_data';
    await google.colab.output.renderOutput(dataTable, element);
    const docLink = document.createElement('div');
    docLink.innerHTML = docLinkHtml;
    element.appendChild(docLink);
  }
</script>

df.head()

	_unit_id	_golden	_unit_state	_trusted_judgments	_last_judgment_at	sentiment	sentiment:confidence	date	id	query	sentiment_gold	text
0	623495513	True	golden	10	NaN	3	0.6264	Mon Dec 01 19:30:03 +0000 2014	5.400000e+17	#AAPL OR @Apple	3\nnot_relevant	#AAPL:The 10 best Steve Jobs emails ever...htt...
1	623495514	True	golden	12	NaN	3	0.8129	Mon Dec 01 19:43:51 +0000 2014	5.400000e+17	#AAPL OR @Apple	3\n1	RT @JPDesloges: Why AAPL Stock Had a Mini-Flas...
2	623495515	True	golden	10	NaN	3	1.0000	Mon Dec 01 19:50:28 +0000 2014	5.400000e+17	#AAPL OR @Apple	3	My cat only chews @apple cords. Such an #Apple...
3	623495516	True	golden	17	NaN	3	0.5848	Mon Dec 01 20:26:34 +0000 2014	5.400000e+17	#AAPL OR @Apple	3\n1	I agree with @jimcramer that the #IndividualIn...
4	623495517	False	finalized	3	12/12/14 12:14	3	0.6474	Mon Dec 01 20:29:33 +0000 2014	5.400000e+17	#AAPL OR @Apple	NaN	Nobody expects the Spanish Inquisition #AAPL

<script>
  const buttonEl =
    document.querySelector('#df-3668bf32-2cb2-4df9-9707-6cc01f8683ee button.colab-df-convert');
  buttonEl.style.display =
    google.colab.kernel.accessAllowed ? 'block' : 'none';

  async function convertToInteractive(key) {
    const element = document.querySelector('#df-3668bf32-2cb2-4df9-9707-6cc01f8683ee');
    const dataTable =
      await google.colab.kernel.invokeFunction('convertToInteractive',
                                                [key], {});
    if (!dataTable) return;

    const docLinkHtml = 'Like what you see? Visit the ' +
      '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
      + ' to learn more about interactive tables.';
    element.innerHTML = '';
    dataTable['output_type'] = 'display_data';
    await google.colab.output.renderOutput(dataTable, element);
    const docLink = document.createElement('div');
    docLink.innerHTML = docLinkHtml;
    element.appendChild(docLink);
  }
</script>

df['sentiment'].value_counts()

3               2162
1               1219
5                423
not_relevant      82
Name: sentiment, dtype: int64

def preprocess_text(text):
    text = text.lower()
    text = ' '.join([word for word in text.split() if word.isalnum()])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

df['text_cleaned'] = df['text'].apply(preprocess_text)

tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = tfidf_vectorizer.fit_transform(df['text_cleaned'])
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_pred = classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

confusion

array([[140,  97,   3,   0],
       [ 26, 397,   1,   0],
       [ 12,  67,  20,   0],
       [  1,  14,   0,   0]])

accuracy

0.7159383033419023

report

'              precision    recall  f1-score   support\n\n           1       0.78      0.58      0.67       240\n           3       0.69      0.94      0.79       424\n           5       0.83      0.20      0.33        99\nnot_relevant       0.00      0.00      0.00        15\n\n    accuracy                           0.72       778\n   macro avg       0.58      0.43      0.45       778\nweighted avg       0.72      0.72      0.68       778\n'

Task 3: Implement sentiment analysis on the given dataset in Natural Language Processing

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    return [token.text for token in doc if not token.is_punct and not token.is_space]

tokenized_tweets = [preprocess(tweet) for tweet in df['text']]

model = Word2Vec(tokenized_tweets, vector_size=100, window=5, min_count=1, sg=0)
document_embeddings = []

for tokenized_tweet in tokenized_tweets:

    valid_tokens = [token for token in tokenized_tweet if token in model.wv]

    if valid_tokens:

        avg_embedding = sum(model.wv[token] for token in valid_tokens) / len(valid_tokens)
        document_embeddings.append(avg_embedding)
    else:

        document_embeddings.append(None)

Task 4: Analyze and comprehend the results obtained

tweet1_embedding = document_embeddings[0]
tweet2_embedding = document_embeddings[1]

def cosine_similarity(vec1, vec2):
    return dot(vec1, vec2) / (norm(vec1) * norm(vec2))

if tweet1_embedding is not None and tweet2_embedding is not None:
    similarity = cosine_similarity(tweet1_embedding, tweet2_embedding)
else:
    similarity = None
print(f"Similarity between tweet 1 and tweet 2: {similarity}")

Similarity between tweet 1 and tweet 2: 0.9979607462882996

Conclusion:

Based on the experiment conducted, we can conclude that the sentiment analysis performed on the given dataset in Natural Language Processing was successful. The dataset was selected and the problem statement was identified. EDA, text preprocessing, and feature engineering were performed to prepare the dataset for sentiment analysis. The results obtained were analyzed and comprehended.

Edit this page

Sentiment Analysis Natural Language Processing Text Preprocessing Feature Engineering