I have fetched Amazon Reviews for a product and now trying to train logistic regression model on it to categorize customer reviews. It gives 100 percent accuracy. I am unable to understand the issue. Here is a sample from my dataset:
Name | Stars | Title | Date | Description |
---|---|---|---|---|
Dipam | 5 | 5.0 out of 5 stars | N/A | A very good fragrance. Recommended Seller - Sun Fragrances |
sanket shah | 5 | 5.0 out of 5 stars | N/A | Yes |
Manoranjidham | 5 | 5.0 out of 5 stars | N/A | This perfume is ranked No 3 .. Good one :) |
Moukthika | 5 | 5.0 out of 5 stars | N/A | I was gifted Versace Bright on my 25th Birthday. Fragrance stays for at least for 24 hours. I love it. This is one of my best collections. |
megh | 5 | 5.0 out of 5 stars | N/A | I have this perfume but didn't get it online..the smell is just amazing.it stays atleast for 2 days even if you take bath or wash d cloth. I have got so many compliments.. |
riya | 5 | 5.0 out of 5 stars | N/A | Bought it from somewhere else,awesome fragrance, pure rose kind of smell stays for long,my guy loves this purchase of mine n fragrance too. |
manisha.chauhan0091 | 5 | 5.0 out of 5 stars | N/A | Its light n long lasting i like it |
UPS | 1 | 1.0 out of 5 stars | N/A | Absolutely fake. Fragrance barely lasts for 15 minutes. Extremely harsh on the skin as well. |
sanaa | 1 | 1.0 out of 5 stars | N/A | a con game. fake product. dont fall for it |
Juliana Soares Ferreira | N/A | Ótimo produto | N/A | Produto verdadeiro, com cheio da riqueza, não fixa muito, mas é delicioso. Dura na minha pele umas 3 horas e depois fica um cheirinho leve...Super recomendo |
Here is my code
import re
import nltkimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupfrom nltk.sentiment import SentimentIntensityAnalyzerfrom nltk.tokenize import word_tokenizefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.utils.class_weight import compute_class_weight# Ensure necessary NLTK datasets and models are downloaded# nltk.download('punkt')# nltk.download('vader_lexicon')# Load the datadf = pd.read_csv("reviews.csv") # Make sure to replace 'reviews.csv' with your actual file path# Preprocess datadf['Stars'] = df['Stars'].fillna(3.0) # Handle missing valuesdf['Title'] = df['Title'].str.lower() # Standardize text formatsdf['Description'] = df['Description'].str.lower()df = df.drop(['Name', 'Date'], axis=1) # Drop unnecessary columnsprint(df)# Categorize sentiment based on star ratingsdef categorize_sentiment(stars): if stars >= 4.0: return 'Positive' elif stars <= 2.0: return 'Negative' else: return 'Neutral'df['Sentiment'] = df['Stars'].apply(categorize_sentiment)# Clean and tokenize textdef clean_text(text): text = BeautifulSoup(text, "html.parser").get_text() letters_only = re.sub("[^a-zA-Z]", " ", text) return letters_only.lower()def tokenize(text): return word_tokenize(text)df['Clean_Description'] = df['Description'].apply(clean_text)df['Tokens'] = df['Clean_Description'].apply(tokenize)# Apply NLTK's VADER for sentiment analysissia = SentimentIntensityAnalyzer()def get_sentiment(text): score = sia.polarity_scores(text) if score['compound'] >= 0.05: return 'Positive' elif score['compound'] <= -0.05: return 'Negative' else: return 'Neutral'df['NLTK_Sentiment'] = df['Clean_Description'].apply(get_sentiment)print("df['NLTK_Sentiment'].value_counts()")print(df['NLTK_Sentiment'].value_counts())# Prepare data for machine learningfrom sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(tokenizer=tokenize)X = vectorizer.fit_transform(df['Clean_Description'])y = df['NLTK_Sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)# Split the datasetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=80)# Train a Logistic Regression model# Compute class weightsclass_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)class_weights_dict = dict(enumerate(class_weights))print(f"class_weights_dict {class_weights_dict}")# Apply to Logistic Regression# model = LogisticRegression(class_weight=class_weights_dict)model = LogisticRegression(C=0.001, penalty='l2', class_weight='balanced')model.fit(X_train, y_train)# Predict sentiments on the test setpredictions = model.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test, predictions)precision = precision_score(y_test, predictions, average='weighted')recall = recall_score(y_test, predictions, average='weighted')f1 = f1_score(y_test, predictions, average='weighted')print(f"Accuracy: {accuracy:.4f}")print(f"Precision: {precision:.4f}")print(f"Recall: {recall:.4f}")print(f"F1 Score: {f1:.4f}")
Here are the restults of the print statements:
NLTK_Sentiment
Positive 8000
Negative 2000
Name: count, dtype: int64
class_weights_dict {0: 2.3696682464454977, 1: 0.6337135614702155}
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000
I am unable to find the reason why my model is always giving 100 percent accuracy.