Building a Text Analyzer with Python and Natural Language Processing (NLP)

Natural Language Processing (NLP) is an exciting field that gives code the ability to understand and work with human language.

In this article, we’ll walk through building a basic text analyzer (NLTK |Natural Language Toolkit). We’ll focus on extracting keywords, performing sentiment analysis, and creating simple text summaries. We’ll use popular NLP libraries, including NLTK and SpaCy, and create a dataset of chat messages for analysis.

Let’s setup and build our environment.

Setting Up a Virtual Environment

To keep things organized and dependencies separate, we’ll use a virtual environment. Here’s how to set it up:

Open your terminal (or Command Prompt on Windows).

Run the following commands:

# For Windows
python -m venv nlp_env
nlp_env\Scripts\activate

# For Linux
python3 -m venv nlp_env
source nlp_env/bin/activate

Install the required libraries for our project.

pip install nltk spacy pandas

SpaCy also requires downloading a language model, which we can do with:

python -m spacy download en_core_web_sm

NOTE 1: The en_core_web_sm model for SpaCy is approximately 12 MB in size. It is a lightweight model with vocabulary, syntax, and entities, making it ideal for general-purpose NLP tasks that don’t require a large amount of memory or processing power.

If you need more advanced features, SpaCy also offers larger models, like en_core_web_md (about 50 MB) and en_core_web_lg (about 790 MB), which include additional word vectors and may provide better accuracy for tasks requiring more context or detail.

NOTE 2: SpaCy installs multiple dependencies and several large machine learning packages, which results in a long installation time — just give it a few minutes…

Step 1: Creating a Fake Chat Dataset

In this project, we’ll analyze chat messages about a fictional product called “TechWidget.” We’ll create a dataset with 50 records/fake comments.

Save the following code in a file named chat_data.csv:

username,message
user1,"I love TechWidget! It's amazing!"
user2,"TechWidget could be better. The battery life is too short."
user3,"Just bought TechWidget and it's incredible for the price."
user4,"I'm not impressed with TechWidget. Expected more."
user5,"TechWidget works well for what I need. Good value."
user6,"Battery life on TechWidget is disappointing."
user7,"TechWidget is so easy to use. Perfect for beginners."
user8,"I think TechWidget needs a software update."
user9,"TechWidget saved me a lot of time!"
user10,"I wouldn't recommend TechWidget. Too many issues."
user11,"TechWidget has changed the way I work. Great product!"
user12,"Not happy with TechWidget. It doesn't work as expected."
user13,"TechWidget is okay but not worth the hype."
user14,"Love the sleek design of TechWidget!"
user15,"TechWidget is super useful for daily tasks."
user16,"My TechWidget broke after a month. Very disappointing."
user17,"TechWidget is affordable and does the job."
user18,"Would buy TechWidget again. Very reliable."
user19,"TechWidget could use more features."
user20,"I had a great experience with TechWidget support."
user21,"TechWidget makes things so much easier for me."
user22,"Not worth the money. TechWidget didn't meet my needs."
user23,"TechWidget has great functionality!"
user24,"I wish TechWidget had a better battery."
user25,"TechWidget is very responsive and quick."
user26,"Just bought TechWidget, and it's fantastic!"
user27,"TechWidget could use more customization options."
user28,"Absolutely love TechWidget. Worth every penny."
user29,"TechWidget is solid but has room for improvement."
user30,"TechWidget support was very helpful with my questions."
user31,"I think TechWidget is overrated."
user32,"TechWidget has been a game-changer for my work."
user33,"I'm returning my TechWidget. It didn’t meet my expectations."
user34,"TechWidget works as advertised. No complaints."
user35,"TechWidget is reliable and easy to set up."
user36,"TechWidget is amazing but expensive."
user37,"Happy with my purchase of TechWidget."
user38,"TechWidget software needs a lot of updates."
user39,"TechWidget has a great battery life for my needs."
user40,"TechWidget saved me time, but it's too bulky."
user41,"TechWidget quality is very good."
user42,"I expected more from TechWidget after reading reviews."
user43,"TechWidget performs well under heavy use."
user44,"Not happy with the recent TechWidget update."
user45,"TechWidget is good for its price range."
user46,"I don't like the new interface on TechWidget."
user47,"TechWidget exceeded my expectations."
user48,"TechWidget makes my work much more efficient."
user49,"Not sure if I would recommend TechWidget."
user50,"TechWidget is exactly what I needed!"
user51,"The battery life on TechWidget could be better."
user52,"TechWidget is perfect for beginners like me."
user53,"TechWidget performance is fantastic!"
user54,"I love using TechWidget for my daily tasks."
user55,"TechWidget needs more features to be worth the price."
user56,"TechWidget works well but feels a bit slow at times."
user57,"Happy with TechWidget. It's very reliable."
user58,"TechWidget could improve in terms of design."
user59,"TechWidget makes my workflow much faster."
user60,"I'm disappointed with TechWidget's customer support."

This CSV file contains two columns: username and message. The message field will be our main target for analysis. You can adjust the messages as needed.

Step 2: Loading the Dataset and Preparing for Analysis

We’ll use Python’s pandas library to load and manipulate the data. In a file named text_analyzer.py, add the following code:

import pandas as pd

# Load the dataset
data = pd.read_csv("chat_data.csv")

# Display the first few records
print(data.head())

Running this script should display the first few messages, helping us confirm that the data loaded correctly.

Step 3: Analyzing Product Mentions with SpaCy

To extract keywords like “TechWidget” from each message, we’ll use SpaCy. Start by initializing a SpaCy model and defining a function to extract product mentions.

Add the following code to text_analyzer.py:

import spacy

# Load SpaCy's English model
nlp = spacy.load("en_core_web_sm")

def extract_product_mentions(text):
    doc = nlp(text)
    products = [ent.text for ent in doc.ents if ent.label_ == "PRODUCT"]
    return products

# Apply the function to the messages
data['product_mentions'] = data['message'].apply(extract_product_mentions)

print(data[['message', 'product_mentions']])

The extract_product_mentions function processes each message to identify entities classified as products. Running this code will display messages along with identified product names.

Step 4: Performing Sentiment Analysis with NLTK

Sentiment analysis helps determine the emotional tone of each message. NLTK’s VADER (Valence Aware Dictionary and Sentiment Reasoner) is ideal for analyzing short, informal text like chat messages.

First, import and initialize the SentimentIntensityAnalyzer:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# Download NLTK's VADER lexicon for sentiment analysis
nltk.download("vader_lexicon")
sia = SentimentIntensityAnalyzer()

# Define a function for sentiment analysis
def analyze_sentiment(text):
    score = sia.polarity_scores(text)
    return score['compound']

# Apply sentiment analysis to the messages
data['sentiment'] = data['message'].apply(analyze_sentiment)

print(data[['message', 'sentiment']])

This code calculates a sentiment score for each message, which is then stored in a new column called sentiment. The compound score provides an overall sentiment rating between -1 (negative) and 1 (positive).

Step 5: Summarizing Messages

While basic text summarization is more advanced, we can create a simple summary by focusing on messages with high or low sentiment scores. These extremes often represent the most enthusiastic or critical opinions.

Add this code at the end of text_analyzer.py:

# Define thresholds for positive and negative sentiment
positive_threshold = 0.5
negative_threshold = -0.5

# Identify positive and negative messages
positive_messages = data[data['sentiment'] > positive_threshold]['message']
negative_messages = data[data['sentiment'] < negative_threshold]['message']

print("Positive Messages Summary:")
for msg in positive_messages:
    print("-", msg)

print("\nNegative Messages Summary:")
for msg in negative_messages:
    print("-", msg)

This code filters the messages based on sentiment scores. Messages with a score above 0.5 are positive, and those below -0.5 are negative. Displaying these messages provides a quick summary of the strongest feedback.

Step 6: Running the Text Analyzer

With everything in place, run the analyzer with:

python text_analyzer.py

This will display each message along with the identified product mentions and sentiment scores. The positive and negative message summaries offer insights into user feedback trends.

Here’s the complete text_analyzer.py file that combines all the snippets for loading, analyzing, and summarizing the data:

import pandas as pd
import spacy
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# Step 1: Load the dataset
data = pd.read_csv("chat_data.csv")

# Display the first few records
print("Data preview:")
print(data.head())

# Step 2: Initialize SpaCy for product mention extraction
nlp = spacy.load("en_core_web_sm")

def extract_product_mentions(text):
    doc = nlp(text)
    products = [ent.text for ent in doc.ents if ent.label_ == "PRODUCT"]
    return products

# Apply the function to extract product mentions
data['product_mentions'] = data['message'].apply(extract_product_mentions)
print("\nData with Product Mentions:")
print(data[['message', 'product_mentions']])

# Step 3: Perform Sentiment Analysis using NLTK's VADER
nltk.download("vader_lexicon")
sia = SentimentIntensityAnalyzer()

def analyze_sentiment(text):
    score = sia.polarity_scores(text)
    return score['compound']

# Apply sentiment analysis to the messages
data['sentiment'] = data['message'].apply(analyze_sentiment)
print("\nData with Sentiment Scores:")
print(data[['message', 'sentiment']])

# Step 4: Summarize Positive and Negative Messages
positive_threshold = 0.5
negative_threshold = -0.5

# Identify positive and negative messages
positive_messages = data[data['sentiment'] > positive_threshold]['message']
negative_messages = data[data['sentiment'] < negative_threshold]['message']

print("\nPositive Messages Summary:")
for msg in positive_messages:
    print("-", msg)

print("\nNegative Messages Summary:")
for msg in negative_messages:
    print("-", msg)

THE OUTPUT

("I'm not impressed with TechWidget. Expected more.") will not be classified as negative comment due to the threshold set for negative messages. In the code, we specified that only messages with a sentiment score below -0.5 would be considered negative. Since -0.3724 is higher than this threshold, it was not included in the negative messages summary.

You can lower the negative threshold so that comments with milder negative sentiments are included. For example, changing the threshold to -0.3, would capture more negative comments. Here’s the updated line to adjust the threshold:

negative_threshold = -0.3

Limitations

While sentiment analysis is a powerful tool for extracting insights from textual data, it is not without its limitations — particularly when applied to short texts as used in this article.

Many sentiment analysis models, especially basic ones, rely heavily on word associations or polarity to determine sentiment. Short texts, such as individual sentences or phrases, often lack the contextual clues necessary for accurate sentiment classification.

For instance, consider the role of negations, where a single word like “not” can completely alter the sentiment. A sentence like “Today is not the end of the world” could be interpreted as negative due to the presence of “not,” even though its intended meaning is neutral or even optimistic.

Similarly, mixed sentiments within a single sentence can confound basic models. For example, “The product is good, but I don’t like the packaging” contains both a positive and a negative sentiment. Without the ability to understand the context and weight of each clause, the model might incorrectly classify the overall sentiment as entirely negative or fail to recognize the dual sentiment.

Sarcasm is another significant challenge. A statement such as “The crazy lady is happy today” could be marked as negative simply due to the presence of “crazy,” even though the sentiment intended by the user might be positive or humorous.

These examples illustrate that short texts often require a more nuanced approach to sentiment analysis, one that goes beyond simple keyword-based methods. This limitation is particularly evident in English, a language known for its rich complexity and flexible syntax, where the same words can convey vastly different sentiments depending on context.

To mitigate these challenges, advanced techniques like isolating key components of a sentence (e.g., verbs and objects), leveraging context-aware models, and analyzing sentiments within clusters of related data can provide more reliable results. However, it is crucial to understand these limitations upfront and account for them when interpreting the results of sentiment analysis in short-text scenarios.


Thank you for reading this article. We hope you found it helpful and informative. If you have any questions, or if you would like to suggest new Python code examples or topics for future tutorials/articles, please feel free to join and comment. Your feedback and suggestions are always welcome!

You can find the same tutorial on Medium.com.

Leave a Reply