Outlier Handling in Clustering: A Comparative Experiment of K-Means, Robust Trimmed K-Means, and K-Means Least Trimmed Squared

Estella, Tricia; Andrita Intan Ghayatrie, Nadzla; Wibowo, Antoni

doi:http://dx.doi.org/10.12785/ijcds/XXXXXX

Journals About us Ethics and Policies Objectives Values Contact us

UOB Journals
→
02. International Journal of Computing and Digital Systems
→
Preprint
→
View Item

Outlier Handling in Clustering: A Comparative Experiment of K-Means, Robust Trimmed K-Means, and K-Means Least Trimmed Squared

Estella, Tricia; Andrita Intan Ghayatrie, Nadzla; Wibowo, Antoni

DOI: http://dx.doi.org/10.12785/ijcds/XXXXXX

ISSN: 2210-142X

Date: 2024-03-14

Abstract:

The presence of outliers in data often leads to unsatisfactory modeling outcomes, especially when employing clustering algorithms for population segmentation and behavioral analysis. While various outlier-resilient clustering algorithms like DBSCAN, LDOF, t-SNE, and others exist, one of the most renowned algorithms, k-Means, still faces challenges in effectively handling outliers. This journal proposes an optimization of the k-Means algorithm resilient to outliers by incorporating the Least Trimmed Square technique as post-processing, referred to as k-Means LTS. The outlier trimming process occurs after the grouping process, allowing trimming within each cluster. This algorithm will be compared with ordinary k-Means and Robust Trimmed k-Means, as known as RTKM, both employing outlier trimming. The comparison of these three algorithms will consider performance metrics, clustering results, and running time. The contribution of this research lies in the enhanced optimality of k-Means LTS algorithm, outperforming the other two algorithms across all comparison parameters. By utilizing this algorithm, the presence of outliers within each cluster can be more easily explained, and the running time is notably shorter compared to RTKM. As a result, the proposed algorithm of k- Means LTS consistently proves to work better than ordinary k-Means and RTKM when implemented across ten datasets of varying types.

Show full item record