Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 16536

my LogisticRegression model is producing 100 percent accuracy

$
0
0

I have fetched Amazon Reviews for a product and now trying to train logistic regression model on it to categorize customer reviews. It gives 100 percent accuracy. I am unable to understand the issue. Here is a sample from my dataset:

NameStarsTitleDateDescription
Dipam55.0 out of 5 starsN/AA very good fragrance. Recommended Seller - Sun Fragrances
sanket shah55.0 out of 5 starsN/AYes
Manoranjidham55.0 out of 5 starsN/AThis perfume is ranked No 3 .. Good one :)
Moukthika55.0 out of 5 starsN/AI was gifted Versace Bright on my 25th Birthday. Fragrance stays for at least for 24 hours. I love it. This is one of my best collections.
megh55.0 out of 5 starsN/AI have this perfume but didn't get it online..the smell is just amazing.it stays atleast for 2 days even if you take bath or wash d cloth. I have got so many compliments..
riya55.0 out of 5 starsN/ABought it from somewhere else,awesome fragrance, pure rose kind of smell stays for long,my guy loves this purchase of mine n fragrance too.
manisha.chauhan009155.0 out of 5 starsN/AIts light n long lasting i like it
UPS11.0 out of 5 starsN/AAbsolutely fake. Fragrance barely lasts for 15 minutes. Extremely harsh on the skin as well.
sanaa11.0 out of 5 starsN/Aa con game. fake product. dont fall for it
Juliana Soares FerreiraN/AÓtimo produtoN/AProduto verdadeiro, com cheio da riqueza, não fixa muito, mas é delicioso. Dura na minha pele umas 3 horas e depois fica um cheirinho leve...Super recomendo

Here is my code

import re

import nltkimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupfrom nltk.sentiment import SentimentIntensityAnalyzerfrom nltk.tokenize import word_tokenizefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_scorefrom sklearn.model_selection import train_test_splitfrom sklearn.utils.class_weight import compute_class_weight# Ensure necessary NLTK datasets and models are downloaded# nltk.download('punkt')# nltk.download('vader_lexicon')# Load the datadf = pd.read_csv("reviews.csv")  # Make sure to replace 'reviews.csv' with your actual file path# Preprocess datadf['Stars'] = df['Stars'].fillna(3.0)  # Handle missing valuesdf['Title'] = df['Title'].str.lower()  # Standardize text formatsdf['Description'] = df['Description'].str.lower()df = df.drop(['Name', 'Date'], axis=1)  # Drop unnecessary columnsprint(df)# Categorize sentiment based on star ratingsdef categorize_sentiment(stars):    if stars >= 4.0:        return 'Positive'    elif stars <= 2.0:        return 'Negative'    else:        return 'Neutral'df['Sentiment'] = df['Stars'].apply(categorize_sentiment)# Clean and tokenize textdef clean_text(text):    text = BeautifulSoup(text, "html.parser").get_text()    letters_only = re.sub("[^a-zA-Z]", " ", text)    return letters_only.lower()def tokenize(text):    return word_tokenize(text)df['Clean_Description'] = df['Description'].apply(clean_text)df['Tokens'] = df['Clean_Description'].apply(tokenize)# Apply NLTK's VADER for sentiment analysissia = SentimentIntensityAnalyzer()def get_sentiment(text):    score = sia.polarity_scores(text)    if score['compound'] >= 0.05:        return 'Positive'    elif score['compound'] <= -0.05:        return 'Negative'    else:        return 'Neutral'df['NLTK_Sentiment'] = df['Clean_Description'].apply(get_sentiment)print("df['NLTK_Sentiment'].value_counts()")print(df['NLTK_Sentiment'].value_counts())# Prepare data for machine learningfrom sklearn.feature_extraction.text import TfidfVectorizervectorizer = TfidfVectorizer(tokenizer=tokenize)X = vectorizer.fit_transform(df['Clean_Description'])y = df['NLTK_Sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)# Split the datasetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=80)# Train a Logistic Regression model# Compute class weightsclass_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)class_weights_dict = dict(enumerate(class_weights))print(f"class_weights_dict {class_weights_dict}")# Apply to Logistic Regression# model = LogisticRegression(class_weight=class_weights_dict)model = LogisticRegression(C=0.001, penalty='l2', class_weight='balanced')model.fit(X_train, y_train)# Predict sentiments on the test setpredictions = model.predict(X_test)# Evaluate the modelaccuracy = accuracy_score(y_test, predictions)precision = precision_score(y_test, predictions, average='weighted')recall = recall_score(y_test, predictions, average='weighted')f1 = f1_score(y_test, predictions, average='weighted')print(f"Accuracy: {accuracy:.4f}")print(f"Precision: {precision:.4f}")print(f"Recall: {recall:.4f}")print(f"F1 Score: {f1:.4f}")

Here are the restults of the print statements:

NLTK_Sentiment
Positive 8000
Negative 2000
Name: count, dtype: int64

class_weights_dict {0: 2.3696682464454977, 1: 0.6337135614702155}
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000

I am unable to find the reason why my model is always giving 100 percent accuracy.


Viewing all articles
Browse latest Browse all 16536

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>