Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

KMeans Class

Signature

model = sp.KMeans(
    k: int = 3,
    max_iter: int = 300,
    tol: float = 1e-4,
    mini_batch: bool = False,
    batch_size: int = 1000,
)

model.fit(x: list[list[float]]) -> None
model.fit_predict(x: list[list[float]]) -> list[int]
model.predict(x: list[list[float]]) -> list[int]
model.transform(x: list[list[float]]) -> list[list[float]]

model.labels_     -> list[int]
model.centroids_  -> list[list[float]]
model.inertia_    -> float
model.n_iter_     -> int
model.n_clusters  -> int

Description

High-performance K-Means class for N-dimensional data with a scikit-learn-compatible API.


Constructor Parameters

ParameterTypeDefaultDescription
kint3Number of clusters
max_iterint300Maximum EM iterations
tolfloat1e-4Convergence tolerance on inertia delta
mini_batchboolFalseForce mini-batch mode
batch_sizeint1000Mini-batch sample size

Methods

fit(x)

Runs K-Means on the N-D data. Populates labels_, centroids_, and inertia_.

ArgumentTypeDescription
xlist[list[float]]Data matrix (rows = samples, cols = features)

fit_predict(x) -> list[int]

Equivalent to fit(x) then returning labels_.

predict(x) -> list[int]

Assign new samples to the nearest centroid (does not refit).

transform(x) -> list[list[float]]

Return Euclidean distance from each sample to each centroid (shape: n_samples × k).


Attributes

AttributeTypeDescription
labels_list[int]Cluster index per point (0-based)
centroids_list[list[float]]Final centroid coordinates (k × dims)
inertia_floatSum of squared distances to assigned centroids
n_iter_intNumber of iterations run
n_clustersintEffective number of clusters found
kintRequested k

Examples

Basic N-D clustering

import seraplot as sp
import random

random.seed(42)
centers = [(-2, -2, 0), (2, -2, 0), (0, 2, 1)]
data = [[cx + random.gauss(0, 0.4), cy + random.gauss(0, 0.4), cz + random.gauss(0, 0.3)]
        for cx, cy, cz in centers for _ in range(300)]

model = sp.KMeans(k=3)
labels = model.fit_predict(data)

print(f"Clusters: {model.n_clusters}")
print(f"Inertia: {model.inertia_:.2f}")
print(f"Centroids: {model.centroids_}")

Combine class + chart

import seraplot as sp
import random

random.seed(0)
pts = [(random.gauss(cx, 0.3), random.gauss(cy, 0.3))
       for cx, cy in [(0,0),(3,0),(1.5,2.5)] for _ in range(500)]
x, y = zip(*pts)

model = sp.KMeans(k=3)
labels = model.fit_predict([[xi, yi] for xi, yi in zip(x, y)])

# Build chart with known labels
chart = sp.kmeans(
    title="K-Means Result",
    x_values=list(x),
    y_values=list(y),
    k=3,
)
chart.show()
print(f"Inertia: {model.inertia_:.4f}")

Distance transform

import seraplot as sp

data = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [0.0, 0.0]]

model = sp.KMeans(k=2)
model.fit(data)

distances = model.transform(data)
for i, row in enumerate(distances):
    print(f"Point {i}: distances to centroids = {[f'{d:.3f}' for d in row]}")

Algorithmic Functioning

K-Means minimises the total inertia — the sum of squared distances from each point to its assigned centroid:

$$J = \sum_{i=1}^{n} \|x_i - \mu_{c(x_i)}\|^2$$

K-Means++ initialisation

The first centroid $\mu_1$ is chosen uniformly at random. Each subsequent centroid $\mu_j$ is sampled with probability proportional to $D(x)^2$ — the squared distance to the nearest already-placed centroid. This reduces the expected inertia at convergence to $O(\log k)$ of optimal.

EM iterations

  1. Assignment: $c(x_i) = \underset{k}{\arg\min}\ |x_i - \mu_k|^2$
  2. Update: $\mu_k = \dfrac{1}{|C_k|}\displaystyle\sum_{x_i \in C_k} x_i$

Iterations run until inertia delta $< $ tol or max_iter is reached.

transform(x) returns the $n \times k$ matrix of Euclidean distances from each sample to each centroid, useful for soft-assignment and feature engineering.

Description

Classe K-Means haute performance pour données N-dimensionnelles, compatible avec l'API scikit-learn. Passe automatiquement en mode mini-batch pour n > 100 000.

Constructeur

ParamètreTypeDéfautDescription
kint3Nombre de clusters
max_iterint300Nombre maximum d'itérations
tolfloat1e-4Tolérance de convergence
mini_batchboolFalseForcer le mode mini-batch
batch_sizeint1000Taille du mini-batch

Méthodes

MéthodeDescription
fit(x)Ajuste le modèle
fit_predict(x)Ajuste et retourne les labels
predict(x)Prédit les clusters
transform(x)Distances aux centroïdes

Attributs

AttributDescription
labels_Labels par point
centroids_Coordonnées des centroïdes
inertia_Inertie finale
n_iter_Nombre d'itérations

Fonctionnement algorithmique

K-Means minimise l'inertie totale — la somme des carrés des distances de chaque point à son centroïde assigné :

$$J = \sum_{i=1}^{n} \|x_i - \mu_{c(x_i)}\|^2$$

Initialisation K-Means++

Le premier centroïde $\mu_1$ est choisi de façon uniforme aléatoire. Chaque centroïde suivant $\mu_j$ est échantillonné avec une probabilité proportionnelle à $D(x)^2$ — la distance au carré au centroïde le plus proche déjà placé. Cela réduit l'inertie attendue à la convergence à $O(\log k)$ de l'optimal.

Itérations EM

  1. Affectation : $c(x_i) = \underset{k}{\arg\min}\ |x_i - \mu_k|^2$
  2. Mise à jour : $\mu_k = \dfrac{1}{|C_k|}\displaystyle\sum_{x_i \in C_k} x_i$

Les itérations tournent jusqu'à ce que le delta d'inertie passe sous tol ou que max_iter soit atteint.

transform(x) retourne la matrice $n \times k$ des distances euclidiennes de chaque échantillon à chaque centroïde, utile pour l'affectation douce et l'ingénierie de caractéristiques.