k-means cluster filling


K-means clustering, also known as fast clustering, is a clustering method that requires the number of categories to be determined in advance. K-means clustering can be used to divide all samples into several groups. If variables containing missing values are assumed to have different values in different groups, the mean value of the non-missing part of the variable in each group can be used to fill in the missing value of the corresponding position.

The kmeans() function is provided in SPL for fast clustering

For example, use k-means to fill in the "Age" variable in Titanic data.

Due to the existence of character variables in the titanic.csv data, which could not be directly used the k-means algorithm. So the categorical data were processed in advance and the processed data was titanic_impute.csv.


















A1 Import titanic_impute.csv

A2-A3 Remove the variable Age

A4 Turn the sequence table with the Age variable removed into vector form

A5 Using kmeans() modeling and prediction, the data samples are divided into 2 categories.

A6 Add the classification results for each sample to the table A1

A7 Calculate the Age average for each group of categories.


A8 Use the average of each group to fill in the missing values within that group. As shown in the figure, different groups of samples have different filling values.