Abstract:
There is an increasing demand for techniques that can process and collect valuable information from huge data in the Big Data era. Duplicates can seriously influence data processing and data mining, so the major challenge is finding as many duplicate records as possible. Data deduplication (or Redundancy Removal) removes redundant data and stores only one copy, promoting single instance storage. The main idea suggests using K-Means clustering for big data deduplication. K-Means Clustering, a localized optimization approach, is vulnerable to the starting point chosen from the cluster's center. The K-Means Clustering technique will produce more errors and bad cluster outcomes if the center of a defective cluster is used as the starting point. The suggested deduplication solution is based on the numeric conversion of the dataset and pre-processing them to extract useful information utilized by Dynamic K-Mean clustering (DKMEAN) to categorize replicated chunks. The proposed system greatly improves dataset quality and ultimately reduces resource consumption. It outperformed Traditional K-Means (TKMEAN) in terms of the number of detected redundant chunks, accuracy, the number of iterations, and efficiency.