In [1]:
import os
import re    # regular expressions
import string    # string manipulation
from collections import Counter
import copy
import numpy as np
import scipy
import pandas
import polars as pl
import matplotlib.pyplot as plt

# Multinomial Naive Bayes classification on text data

The goal of this exercise is to use a naive Bayes model to classify text files. Each text file is given a category ("business", "entertainment", "politics", "sport", "tech"), and the goal is to infer this category from an analysis of the text.

In [2]:
path_dataset = "bbc"
path_stopwords = "bbc/stopwords-en.txt"
list_classes = ["business", "entertainment", "politics", "sport", "tech"]
n_classes = len(list_classes)

**Preliminary question**

Open some of the text files and take a look at the data.

## Load the data

First, the data must be loaded. One must load the "stop-words", that are the words that are commonly used in English, and are not believed to contain any crucial information about the theme of the text. Then, the files must be loaded.

**Question 1**

Store the content of the stopwords file into a set of strings. Add to it the empty string "".

One can use the string method `rstrip` to remove the trailing blank characters.

**Question 2**

Write a function loading the dataset into a dictionary `dct_data`:
 * whose keys are the classes names;
 * whose values are the lists of the contents of the text files.
For instance, `dct_data["business"]` returns a list of strings, where each string is the content of a file (in the folder "business").

One can use the function `os.path.join` to build paths to files, `os.listdir` to list all the files contained in a directory and the file method `read` to gather the entire content of a file.

## Preprocess the data

We want to use perform Multinomial Naive Bayes on this dataset. To use it, we have first to precprocess the data (clean the text, prepare the data for the naive Bayes method).

**Question 3**

Write a function to remove the punctuation from a string, a function to transform all characters of a string into lower-case characters, and a function that transforms a string into the list of words it contains.

One can use the string method `translate`, the static method `str.maketrans`, the in-built string `string.punctuation`, and `re.split`.

**Question 4**

Process the texts to transform them into lists of words.

**Question 5**

Create the set of all the different words present in the texts, excuding the stopwords. 

**Question 6**

Create the array of features of the dataset. This array must be of size n * p, where n is the number of texts, and p is the number of different words in the dataset. Each row i must contain the number of occurrences of each word. For example, the coefficient at (i, j) contains the number of occurrences of the word n° j in the text n° i. Create also the array of targets: this should be an array of size n, where the coefficient n° i contains a number representing the class of the text n° i.

Note: one should give one number to each word of the set and to each class.

The object `collections.Counter` can be very useful in this situation.

## By-hand multinomial naive Bayes

**Question 7**

Split the dataset into a training set and a test set (we do not need a validation set since we do not perform model selection).

The first step of naive Bayes is to find the distribution that fits the best to the training data. Given a class $c$, with the multinomial distribution $\mathcal{M}(\theta_1^c, \theta_2^c, \cdots, \theta_p^c, p)$, we want to find the parameters $\hat{\theta}_j^c$ maximizing the likehood of the training data. We have:
$$
\hat{\theta}_j^c = \frac{n_j^c}{\sum_{k = 1}^p n_k^c} ,
$$
where $n_j^c$ is the number of occurrences of the word n° j in the texts with label c.

We will not use this formula, but a regularized version of it:
$$
\hat{\theta}_j^c = \frac{n_j^c + \alpha}{\sum_{k = 1}^p n_k^c + \alpha p} ,
$$
where $\alpha > 0$ to avoid a zero denominator. We can set $\alpha = 1$.

Compute the $\hat{\theta}_j^c$ for each class c and each word j.

In [3]:
from sklearn.model_selection import train_test_split

**Question 8**

Predict the classes of the test dataset. From a Bayesian point of view, for each data point, the chosen class should maximize the posterior distribution.

## Multinomial naive Bayes with sklearn

**Question 9**

Split the dataset into a training set and a test set (we do not need a validation set since we do not perform model selection).

Use `sklearn.naive_bayes.MultinomialNB`: train it on the train set and compute the score on the test set. What do the score represent? Comment the efficiency of naive Bayes in this situation.

In [4]:
from sklearn.naive_bayes import MultinomialNB



# Gaussian naive Bayes

We will perform Gaussian naive Bayes on the "iris" dataset of sklearn. There are 4 features and 3 classes.

## Load the dataset

**Question 1**

Load the "Iris" dataset. and split it into a training set and a test set.

In [5]:
from sklearn.datasets import load_iris



In [6]:
from sklearn.model_selection import train_test_split



## By-hand Gaussian naive Bayes

**Question 2**

For each class, for each feature, compute the optimal parameters of the Gaussian (used to model the current feature).

**Question 3**

Compute the prior over the classes based on the population inside each class.

Predict the class of the data points belonging to the test set.

## Gaussian naive Bayes with sklearn

**Question 4**

Perform the Gaussian naive Bayes with sklearn.

In [7]:
from sklearn.naive_bayes import GaussianNB

**Question 5**

Compare with the knn classifier with $k = 5$ neighbors by using cross-validation.

# Learning from multimodal features

We will process a dataset of predictive maintenance. The goal is to predict the failure type on a machine in function of physical measures and the type of machine. So, we are in a classification setup with multimodal features (categorical features and numerical features).

In [8]:
df = pl.read_csv("predictive_maintenance.csv")
features = ['Type',
 'Air temperature [K]',
 'Process temperature [K]',
 'Rotational speed [rpm]',
 'Torque [Nm]',
 'Tool wear [min]',
 'Target',
 'Failure Type']
df

UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Target,Failure Type
i64,str,str,f64,f64,i64,f64,i64,i64,str
1,"""M14860""","""M""",298.1,308.6,1551,42.8,0,0,"""No Failure"""
2,"""L47181""","""L""",298.2,308.7,1408,46.3,3,0,"""No Failure"""
3,"""L47182""","""L""",298.1,308.5,1498,49.4,5,0,"""No Failure"""
4,"""L47183""","""L""",298.2,308.6,1433,39.5,7,0,"""No Failure"""
5,"""L47184""","""L""",298.2,308.7,1408,40.0,9,0,"""No Failure"""
…,…,…,…,…,…,…,…,…,…
9996,"""M24855""","""M""",298.8,308.4,1604,29.5,14,0,"""No Failure"""
9997,"""H39410""","""H""",298.9,308.4,1632,31.8,17,0,"""No Failure"""
9998,"""M24857""","""M""",299.0,308.6,1645,33.4,22,0,"""No Failure"""
9999,"""H39412""","""H""",299.0,308.7,1408,48.5,25,0,"""No Failure"""


**Question 1**

Select the features `features` in the dataset.

Transform the strings in the columns `Type` and `Failure Type` by integers (which will represent the classes). One can use the method `unique` of `Series` and the method `replace_strict` of `Expr`.

Split the columns of the dataset in 3:
1. categorical features;
2. numerical features;
3. targets.

**Question 2**

Split the dataset into a train set and a test set.

**Question 3**

Perform naive Bayes on the categorial features and on the numerical features.

**Question 4**

Combine the results and compute the accuracy when taking into account both the numerical features and the categorical features.

One can access the results of the preceding naive Bayes computations by looking at the attributes and methods of `CategoricalNB` and `GaussianNB` (`class_prior_`, `predict_log_proba`, etc.).

Is the combined result better than the two individual results? Why?