Ecommerce Product Category Classification Project Using Logistic Regression
Introduction:
Logistic regression is a data analysis technique that uses mathematics to find the relationships between two data factors. It then uses this relationship to predict the value of one of those factors based on the other. The prediction usually has a finite number of outcomes, like yes or no.
CODE 😃👇:
Import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords
import re
import string
# Download NLTK stopwords
nltk.download(‘stopwords’)
stop_words = set(stopwords.words(‘english’))
# Step 1: Load the Dataset
# Load the CSV file
data_path = ‘/path/to/amazon_reviews.csv’ # Update with your dataset path
df = pd.read_csv(data_path)
# Sample data structure — keep only relevant columns
# Ensure dataset has columns: ‘product_title’ and ‘category’
df = df[[‘product_title’, ‘category’]]
df = df.dropna()
# Step 2: Data Preprocessing
def clean_text(text):
text = text.lower()
text = re.sub(f’[{re.escape(string.punctuation)}]’, ‘’, text) # Remove punctuation
text = re.sub(r’\d+’, ‘’, text) # Remove numbers
text = ‘ ‘.join([word for word in text.split() if word not in stop_words]) # Remove stopwords
return text
df[‘cleaned_title’] = df[‘product_title’].apply(clean_text)
# Step 3: Vectorization (Convert text data to numerical data)
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df[‘cleaned_title’]).toarray()
y = df[‘category’]
# Step 4: Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 5: Train the Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
# Step 6: Make Predictions
y_pred = model.predict(X_test)
# Step 7: Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f’Accuracy: {accuracy:.4f}’)
print(“Classification Report:\n”, classification_report(y_test, y_pred))
# Confusion matrix (optional)
conf_matrix = confusion_matrix(y_test, y_pred)
print(“Confusion Matrix:\n”, conf_matrix)
DATASET 😃👇
https://www.kaggle.com/datasets/lakritidis/product-classification-and-categorization
SUPPORT ME 😟
FREE C++ SKILLSHARE COURSE
FREE C SKILLSHARE COURSE
All Courses 😃👇
https://linktr.ee/Freetech2024
All Products 😃👇
https://linktr.ee/rockstararun
HP Laptop 🤩👇
Asus Laptop 🤩👇
No comments:
Post a Comment