Fake News Detection Using TF-IDF and Machine Learning

Misinformation spreads faster than corrections. My Fake News Detector project tackled this as an NLP classification problem: given the text of a news article, predict whether it is real or fake. In this article I walk through the complete pipeline — data preparation, TF-IDF feature extraction, model selection, ensemble construction, and hyperparameter optimisation — that produced 99.48% accuracy on a held-out test set.

99.48%

Test Accuracy

44K

Articles

80K

Vocab Size

<1s

CPU Inference

The Dataset

I combined two publicly available datasets: ISOT Fake News Dataset and the Kaggle Fake News corpus, producing 44,898 articles — roughly 50/50 real/fake. This balanced split is important: an unbalanced dataset can yield artificially high accuracy while completely failing on the minority class.

import pandas as pd

real = pd.read_csv('True.csv')
fake = pd.read_csv('Fake.csv')

real['label'] = 1
fake['label'] = 0

df = pd.concat([real, fake]).sample(frac=1, random_state=42).reset_index(drop=True)
df['text'] = df['title'] + ' ' + df['text']   # Combine title + body

print(df['label'].value_counts())
# 1    23481
# 0    21417

Text Preprocessing

Raw news text is noisy. I applied a lightweight cleaning pipeline before feature extraction:

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'https?://\S+', '', text)       # Remove URLs
    text = re.sub(r'\[.*?\]', '', text)             # Remove bracketed text
    text = re.sub(r'[^a-z\s]', '', text)           # Keep letters only
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df['text'] = df['text'].apply(clean_text)

I deliberately did not apply stemming or stop-word removal. With a large enough TF-IDF vocabulary (80K features), the model learns which stop words are discriminative for fake news style (e.g., excessive use of "breaking", "shocking").

TF-IDF Feature Extraction

Term Frequency–Inverse Document Frequency (TF-IDF) represents each article as a sparse vector of weighted word frequencies. Words that appear frequently in one article but rarely across the corpus get high weight — they are the most discriminative features.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

vectorizer = TfidfVectorizer(
    max_features=80_000,
    ngram_range=(1, 2),      # Unigrams + bigrams
    sublinear_tf=True,       # Log-scale TF — reduces impact of very frequent terms
    min_df=2,                # Ignore terms appearing in only 1 document
)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

Using ngram_range=(1,2) captures phrases like "breaking news" and "fake president" as single features — these bigrams are highly predictive of fake news.

Model Selection and Ensemble

I trained and evaluated three classifiers individually, then combined them via soft-voting:

from sklearn.ensemble import GradientBoostingClassifier, VotingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression

gb  = GradientBoostingClassifier(n_estimators=200, max_depth=5, random_state=42)
rf  = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
lr  = LogisticRegression(C=1.0, max_iter=1000, solver='lbfgs')

ensemble = VotingClassifier(
    estimators=[('gb', gb), ('rf', rf), ('lr', lr)],
    voting='soft'   # Average predicted probabilities — better than hard majority vote
)

ensemble.fit(X_train_tfidf, y_train)

Individual Model Accuracy (before ensemble)

Logistic Regression: 98.67%
Random Forest: 98.91%
Gradient Boosting: 99.12%
Soft Voting Ensemble: 99.48%

The ensemble outperforms any single model because the three learners make different errors. Combining them via soft voting averages out individual mistakes.

Hyperparameter Optimisation

I used GridSearchCV with 5-fold stratified cross-validation to tune the Gradient Boosting model:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0]
}

gs = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
gs.fit(X_train_tfidf, y_train)
print(gs.best_params_)
# {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200, 'subsample': 0.8}

Model Evaluation

from sklearn.metrics import classification_report, confusion_matrix

y_pred = ensemble.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))
#               precision    recall  f1-score
# 0 (Fake)       0.994       0.996     0.995
# 1 (Real)       0.996       0.994     0.995
# accuracy                             0.9948

Both precision and recall exceed 99% for each class — confirming the model is not over-fitting to one class. The confusion matrix shows only 47 misclassifications across 8,980 test articles.

Key Lessons for ML Engineers

Feature engineering matters more than model choice at this scale. The jump from a basic unigram TF-IDF to a bigram TF-IDF with sublinear_tf added 0.6% accuracy.
Ensembles are almost always worth it when you have multiple uncorrelated learners.
Always stratify your train/test split on the label column to ensure balanced representation.
Classical ML still wins over neural networks for tabular/structured text problems when training data is limited and inference speed matters.

Conclusion

A well-engineered classical ML pipeline — TF-IDF features, a Gradient Boosting ensemble, and careful cross-validation — can achieve near-perfect accuracy on news classification without any neural networks or GPU compute. The full code is available on GitHub.