Naive Bayes - Multinomial Classification¶

!uv pip install -q \
    numpy==2.3.2 \
    matplotlib==3.10.6 \
    seaborn==0.13.2 \
    scikit-learn==1.7.1

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

%matplotlib inline

sns.set_style("darkgrid")
sns.set_theme(style="darkgrid")

data = fetch_20newsgroups()

data.target_names

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

categories = [
    "talk.religion.misc",
    "soc.religion.christian",
    "sci.space",
    "comp.graphics",
]

train = fetch_20newsgroups(subset="train", categories=categories)
test = fetch_20newsgroups(subset="test", categories=categories)

print(train.data[5])

From: dmcgee@uluhe.soest.hawaii.edu (Don McGee)

Subject: Federal Hearing

Originator: dmcgee@uluhe

Organization: School of Ocean and Earth Science and Technology

Distribution: usa

Lines: 10

Fact or rumor....? Madalyn Murray O'Hare an atheist who eliminated the

use of the bible reading and prayer in public schools 15 years ago is now

going to appear before the FCC with a petition to stop the reading of the

Gospel on the airways of America. And she is also campaigning to remove

Christmas programs, songs, etc from the public schools. If it is true

then mail to Federal Communications Commission 1919 H Street Washington DC

20054 expressing your opposition to her request. Reference Petition number

2493.

In order to use this data we need to convert the content of each string into a vector of numbers using TF-IDF vectorizer

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

model.fit(train.data, train.target)

1 2	`Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()), ('multinomialnb', MultinomialNB())])`

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Parameters


	steps	[('tfidfvectorizer', ...), ('multinomialnb', ...)]
	transform_input	None
	memory	None
	verbose	False

TfidfVectorizer

?Documentation for TfidfVectorizer

Parameters


	input	'content'
	encoding	'utf-8'
	decode_error	'strict'
	strip_accents	None
	lowercase	True
	preprocessor	None
	tokenizer	None
	analyzer	'word'
	stop_words	None
	token_pattern	'(?u)\b\w\w+\b'
	ngram_range	(1, ...)
	max_df	1.0
	min_df	1
	max_features	None
	vocabulary	None
	binary	False
	dtype
	norm	'l2'
	use_idf	True
	smooth_idf	True
	sublinear_tf	False

MultinomialNB

?Documentation for MultinomialNB

Parameters


	alpha	1.0
	force_alpha	True
	fit_prior	True
	class_prior	None

labels = model.predict(test.data)

cm = confusion_matrix(test.target, labels)

sns.heatmap(
    cm.T,
    square=True,
    annot=True,
    fmt="d",
    cbar=False,
    xticklabels=train.target_names,
    yticklabels=train.target_names,
)

plt.xlabel("True Label")
plt.ylabel("Predicted Label")
plt.show()