Misinformation spreads faster than corrections. My Fake News Detector project tackled this as an NLP classification problem: given the text of a news article, predict whether it is real or fake. In this article I walk through the complete pipeline — data preparation, TF-IDF feature extraction, model selection, ensemble construction, and hyperparameter optimisation — that produced 99.48% accuracy on a held-out test set.

99.48%
Test Accuracy
44K
Articles
80K
Vocab Size
<1s
CPU Inference

The Dataset

I combined two publicly available datasets: ISOT Fake News Dataset and the Kaggle Fake News corpus, producing 44,898 articles — roughly 50/50 real/fake. This balanced split is important: an unbalanced dataset can yield artificially high accuracy while completely failing on the minority class.

import pandas as pd

real = pd.read_csv('True.csv')
fake = pd.read_csv('Fake.csv')

real['label'] = 1
fake['label'] = 0

df = pd.concat([real, fake]).sample(frac=1, random_state=42).reset_index(drop=True)
df['text'] = df['title'] + ' ' + df['text']   # Combine title + body

print(df['label'].value_counts())
# 1    23481
# 0    21417

Text Preprocessing

Raw news text is noisy. I applied a lightweight cleaning pipeline before feature extraction:

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'https?://\S+', '', text)       # Remove URLs
    text = re.sub(r'\[.*?\]', '', text)             # Remove bracketed text
    text = re.sub(r'[^a-z\s]', '', text)           # Keep letters only
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['text'] = df['text'].apply(clean_text)

I deliberately did not apply stemming or stop-word removal. With a large enough TF-IDF vocabulary (80K features), the model learns which stop words are discriminative for fake news style (e.g., excessive use of "breaking", "shocking").

TF-IDF Feature Extraction

Term Frequency–Inverse Document Frequency (TF-IDF) represents each article as a sparse vector of weighted word frequencies. Words that appear frequently in one article but rarely across the corpus get high weight — they are the most discriminative features.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

vectorizer = TfidfVectorizer(
    max_features=80_000,
    ngram_range=(1, 2),      # Unigrams + bigrams
    sublinear_tf=True,       # Log-scale TF — reduces impact of very frequent terms
    min_df=2,                # Ignore terms appearing in only 1 document
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

Using ngram_range=(1,2) captures phrases like "breaking news" and "fake president" as single features — these bigrams are highly predictive of fake news.

Model Selection and Ensemble

I trained and evaluated three classifiers individually, then combined them via soft-voting:

from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

gb  = GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42)
rf  = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
lr  = LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs')

ensemble = VotingClassifier(
    estimators=[('gb', gb), ('rf', rf), ('lr', lr)],
    voting='soft'   # Average predicted probabilities — better than hard majority vote
)

ensemble.fit(X_train_tfidf, y_train)

Individual Model Accuracy (before ensemble)

The ensemble outperforms any single model because the three learners make different errors. Combining them via soft voting averages out individual mistakes.

Hyperparameter Optimisation

I used GridSearchCV with 5-fold stratified cross-validation to tune the Gradient Boosting model:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0]
}

gs = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
gs.fit(X_train_tfidf, y_train)
print(gs.best_params_)
# {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}

Model Evaluation

from sklearn.metrics import classification_report, confusion_matrix

y_pred = ensemble.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
#               precision    recall  f1-score
# 0 (Fake)       0.994       0.996     0.995
# 1 (Real)       0.996       0.994     0.995
# accuracy                             0.9948

Both precision and recall exceed 99% for each class — confirming the model is not over-fitting to one class. The confusion matrix shows only 47 misclassifications across 8,980 test articles.

Key Lessons for ML Engineers

Conclusion

A well-engineered classical ML pipeline — TF-IDF features, a Gradient Boosting ensemble, and careful cross-validation — can achieve near-perfect accuracy on news classification without any neural networks or GPU compute. The full code is available on GitHub.