Misinformation spreads faster than corrections. My Fake News Detector project tackled this as an NLP classification problem: given the text of a news article, predict whether it is real or fake. In this article I walk through the complete pipeline — data preparation, TF-IDF feature extraction, model selection, ensemble construction, and hyperparameter optimisation — that produced 99.48% accuracy on a held-out test set.
The Dataset
I combined two publicly available datasets: ISOT Fake News Dataset and the Kaggle Fake News corpus, producing 44,898 articles — roughly 50/50 real/fake. This balanced split is important: an unbalanced dataset can yield artificially high accuracy while completely failing on the minority class.
import pandas as pd
real = pd.read_csv('True.csv')
fake = pd.read_csv('Fake.csv')
real['label'] = 1
fake['label'] = 0
df = pd.concat([real, fake]).sample(frac=1, random_state=42).reset_index(drop=True)
df['text'] = df['title'] + ' ' + df['text'] # Combine title + body
print(df['label'].value_counts())
# 1 23481
# 0 21417
Text Preprocessing
Raw news text is noisy. I applied a lightweight cleaning pipeline before feature extraction:
import re
def clean_text(text):
text = text.lower()
text = re.sub(r'https?://\S+', '', text) # Remove URLs
text = re.sub(r'\[.*?\]', '', text) # Remove bracketed text
text = re.sub(r'[^a-z\s]', '', text) # Keep letters only
text = re.sub(r'\s+', ' ', text).strip()
return text
df['text'] = df['text'].apply(clean_text)
I deliberately did not apply stemming or stop-word removal. With a large enough TF-IDF vocabulary (80K features), the model learns which stop words are discriminative for fake news style (e.g., excessive use of "breaking", "shocking").
TF-IDF Feature Extraction
Term Frequency–Inverse Document Frequency (TF-IDF) represents each article as a sparse vector of weighted word frequencies. Words that appear frequently in one article but rarely across the corpus get high weight — they are the most discriminative features.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)
vectorizer = TfidfVectorizer(
max_features=80_000,
ngram_range=(1, 2), # Unigrams + bigrams
sublinear_tf=True, # Log-scale TF — reduces impact of very frequent terms
min_df=2, # Ignore terms appearing in only 1 document
)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
Using ngram_range=(1,2) captures phrases like "breaking news" and "fake president" as single features — these bigrams are highly predictive of fake news.
Model Selection and Ensemble
I trained and evaluated three classifiers individually, then combined them via soft-voting:
from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
gb = GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42)
rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
lr = LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs')
ensemble = VotingClassifier(
estimators=[('gb', gb), ('rf', rf), ('lr', lr)],
voting='soft' # Average predicted probabilities — better than hard majority vote
)
ensemble.fit(X_train_tfidf, y_train)
Individual Model Accuracy (before ensemble)
- Logistic Regression: 98.67%
- Random Forest: 98.91%
- Gradient Boosting: 99.12%
- Soft Voting Ensemble: 99.48%
The ensemble outperforms any single model because the three learners make different errors. Combining them via soft voting averages out individual mistakes.
Hyperparameter Optimisation
I used GridSearchCV with 5-fold stratified cross-validation to tune the Gradient Boosting model:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200],
'max_depth': [3, 5],
'learning_rate': [0.05, 0.1],
'subsample': [0.8, 1.0]
}
gs = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
gs.fit(X_train_tfidf, y_train)
print(gs.best_params_)
# {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}
Model Evaluation
from sklearn.metrics import classification_report, confusion_matrix
y_pred = ensemble.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
# precision recall f1-score
# 0 (Fake) 0.994 0.996 0.995
# 1 (Real) 0.996 0.994 0.995
# accuracy 0.9948
Both precision and recall exceed 99% for each class — confirming the model is not over-fitting to one class. The confusion matrix shows only 47 misclassifications across 8,980 test articles.
Key Lessons for ML Engineers
- Feature engineering matters more than model choice at this scale. The jump from a basic unigram TF-IDF to a bigram TF-IDF with
sublinear_tfadded 0.6% accuracy. - Ensembles are almost always worth it when you have multiple uncorrelated learners.
- Always stratify your train/test split on the label column to ensure balanced representation.
- Classical ML still wins over neural networks for tabular/structured text problems when training data is limited and inference speed matters.
Conclusion
A well-engineered classical ML pipeline — TF-IDF features, a Gradient Boosting ensemble, and careful cross-validation — can achieve near-perfect accuracy on news classification without any neural networks or GPU compute. The full code is available on GitHub.