Implementing Sentiment Analysis in Natural Language Processing

Comprehensive Guide to Sentiment Analysis

Program output

Natural Language Processing

Table of Contents

Aim

Sentiment Analysis

  • Select a dataset and identify the problem statement.
  • Perform EDA, text preprocessing, feature engineering.
  • Implement sentiment analysis on the given dataset in Natural Language Processing.
  • Analyze and comprehend the results obtained.

Prerequisite

  • Python

Outcome

After successful completion of this experiment, students will be able to:

  1. Perform end-to-end implementation of sentiment analysis using various NLP concepts.

Theory

Understand concepts such as EDA, text preprocessing, and feature engineering for the selected dataset.

Task to be completed in PART B

Task

A.5.1. Task

  • Implement word embedding using Word2Vec.
  • Find similarity between two documents using Word2Vec.

For further information and datasets, refer to:

# import libraries used:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import spacy
from gensim.models import Word2Vec
from numpy import dot, nan_to_num
from numpy.linalg import norm

Task 1: Select a dataset and identify the problem statement

df = pd.read_csv('/content/Apple-Twitter-Sentiment-DFE.csv', encoding='ISO-8859-1')

Task 2: Perform EDA, text preprocessing, feature engineering

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   _unit_id              3886 non-null   int64  
 1   _golden               3886 non-null   bool   
 2   _unit_state           3886 non-null   object 
 3   _trusted_judgments    3886 non-null   int64  
 4   _last_judgment_at     3783 non-null   object 
 5   sentiment             3886 non-null   object 
 6   sentiment:confidence  3886 non-null   float64
 7   date                  3886 non-null   object 
 8   id                    3886 non-null   float64
 9   query                 3886 non-null   object 
 10  sentiment_gold        103 non-null    object 
 11  text                  3886 non-null   object 
dtypes: bool(1), float64(2), int64(2), object(7)
memory usage: 337.9+ KB
df.describe()

_unit_id_trusted_judgmentssentiment:confidenceid
count3.886000e+033886.0000003886.0000003.886000e+03
mean6.234975e+083.6870820.8295265.410039e+17
std1.171906e+032.0045950.1758647.942752e+14
min6.234955e+083.0000000.3327005.400000e+17
25%6.234965e+083.0000000.6744755.400000e+17
50%6.234975e+083.0000000.8112505.410000e+17
75%6.234984e+083.0000001.0000005.420000e+17
max6.235173e+0827.0000001.0000005.420000e+17
<script>
  const buttonEl =
    document.querySelector('#df-32850436-6608-4845-93de-1644b1359bf2 button.colab-df-convert');
  buttonEl.style.display =
    google.colab.kernel.accessAllowed ? 'block' : 'none';

  async function convertToInteractive(key) {
    const element = document.querySelector('#df-32850436-6608-4845-93de-1644b1359bf2');
    const dataTable =
      await google.colab.kernel.invokeFunction('convertToInteractive',
                                                [key], {});
    if (!dataTable) return;

    const docLinkHtml = 'Like what you see? Visit the ' +
      '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
      + ' to learn more about interactive tables.';
    element.innerHTML = '';
    dataTable['output_type'] = 'display_data';
    await google.colab.output.renderOutput(dataTable, element);
    const docLink = document.createElement('div');
    docLink.innerHTML = docLinkHtml;
    element.appendChild(docLink);
  }
</script>

df.head()

_unit_id_golden_unit_state_trusted_judgments_last_judgment_atsentimentsentiment:confidencedateidquerysentiment_goldtext
0623495513Truegolden10NaN30.6264Mon Dec 01 19:30:03 +0000 20145.400000e+17#AAPL OR @Apple3\nnot_relevant#AAPL:The 10 best Steve Jobs emails ever...htt...
1623495514Truegolden12NaN30.8129Mon Dec 01 19:43:51 +0000 20145.400000e+17#AAPL OR @Apple3\n1RT @JPDesloges: Why AAPL Stock Had a Mini-Flas...
2623495515Truegolden10NaN31.0000Mon Dec 01 19:50:28 +0000 20145.400000e+17#AAPL OR @Apple3My cat only chews @apple cords. Such an #Apple...
3623495516Truegolden17NaN30.5848Mon Dec 01 20:26:34 +0000 20145.400000e+17#AAPL OR @Apple3\n1I agree with @jimcramer that the #IndividualIn...
4623495517Falsefinalized312/12/14 12:1430.6474Mon Dec 01 20:29:33 +0000 20145.400000e+17#AAPL OR @AppleNaNNobody expects the Spanish Inquisition #AAPL
<script>
  const buttonEl =
    document.querySelector('#df-3668bf32-2cb2-4df9-9707-6cc01f8683ee button.colab-df-convert');
  buttonEl.style.display =
    google.colab.kernel.accessAllowed ? 'block' : 'none';

  async function convertToInteractive(key) {
    const element = document.querySelector('#df-3668bf32-2cb2-4df9-9707-6cc01f8683ee');
    const dataTable =
      await google.colab.kernel.invokeFunction('convertToInteractive',
                                                [key], {});
    if (!dataTable) return;

    const docLinkHtml = 'Like what you see? Visit the ' +
      '<a target="_blank" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'
      + ' to learn more about interactive tables.';
    element.innerHTML = '';
    dataTable['output_type'] = 'display_data';
    await google.colab.output.renderOutput(dataTable, element);
    const docLink = document.createElement('div');
    docLink.innerHTML = docLinkHtml;
    element.appendChild(docLink);
  }
</script>

df['sentiment'].value_counts()
3               2162
1               1219
5                423
not_relevant      82
Name: sentiment, dtype: int64
def preprocess_text(text):
    text = text.lower()
    text = ' '.join([word for word in text.split() if word.isalnum()])
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text
df['text_cleaned'] = df['text'].apply(preprocess_text)
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = tfidf_vectorizer.fit_transform(df['text_cleaned'])
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

classifier = LogisticRegression(max_iter=1000)
classifier.fit(X_train, y_train)
LogisticRegression(max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_pred = classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1344: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
confusion
array([[140,  97,   3,   0],
       [ 26, 397,   1,   0],
       [ 12,  67,  20,   0],
       [  1,  14,   0,   0]])
accuracy
0.7159383033419023
report
'              precision    recall  f1-score   support\n\n           1       0.78      0.58      0.67       240\n           3       0.69      0.94      0.79       424\n           5       0.83      0.20      0.33        99\nnot_relevant       0.00      0.00      0.00        15\n\n    accuracy                           0.72       778\n   macro avg       0.58      0.43      0.45       778\nweighted avg       0.72      0.72      0.68       778\n'

Task 3: Implement sentiment analysis on the given dataset in Natural Language Processing

nlp = spacy.load("en_core_web_sm")
def preprocess(text):
    doc = nlp(text)
    return [token.text for token in doc if not token.is_punct and not token.is_space]

tokenized_tweets = [preprocess(tweet) for tweet in df['text']]
model = Word2Vec(tokenized_tweets, vector_size=100, window=5, min_count=1, sg=0)
document_embeddings = []

for tokenized_tweet in tokenized_tweets:

    valid_tokens = [token for token in tokenized_tweet if token in model.wv]

    if valid_tokens:

        avg_embedding = sum(model.wv[token] for token in valid_tokens) / len(valid_tokens)
        document_embeddings.append(avg_embedding)
    else:

        document_embeddings.append(None)

Task 4: Analyze and comprehend the results obtained

tweet1_embedding = document_embeddings[0]
tweet2_embedding = document_embeddings[1]
def cosine_similarity(vec1, vec2):
    return dot(vec1, vec2) / (norm(vec1) * norm(vec2))
if tweet1_embedding is not None and tweet2_embedding is not None:
    similarity = cosine_similarity(tweet1_embedding, tweet2_embedding)
else:
    similarity = None
print(f"Similarity between tweet 1 and tweet 2: {similarity}")
Similarity between tweet 1 and tweet 2: 0.9979607462882996

Conclusion:

Based on the experiment conducted, we can conclude that the sentiment analysis performed on the given dataset in Natural Language Processing was successful. The dataset was selected and the problem statement was identified. EDA, text preprocessing, and feature engineering were performed to prepare the dataset for sentiment analysis. The results obtained were analyzed and comprehended.

Edit this page

Srihari Thyagarajan
Srihari Thyagarajan
B Tech AI Senior Student

Hi, I’m Haleshot, a final-year student studying B Tech Artificial Intelligence. I like projects relating to ML, AI, DL, CV, NLP, Image Processing, etc. Currently exploring Python, FastAPI, projects involving AI and platforms such as HuggingFace and Kaggle.

Next
Previous

Related