I am working with deduplication of values in dataset I have created dataset with
n_samples = 1000n_features = 4 centers = 2 cluster_std = 1 center_box = (-10, 10)
parameters for make_blob
def duplicate_data(data, percent,var_val): num_rows_to_duplicate = int(len(data) * percent / 100) duplicated_data = data.copy() duplicates = [] variation = np.full((data.shape[1]), var_val) indices_to_duplicate = np.random.choice(data.shape[0], num_rows_to_duplicate, replace=False) data_ne = data[indices_to_duplicate]+variation duplicated_data = np.vstack([data,data_ne] ) return duplicated_data clustering_dataset_noisy_ne = duplicate_data(clustering_dataset, 50,1.5) clustering_dataset_noisy_ne = np.random.permutation(clustering_dataset_noisy_ne)
Now for this data set with duplicated(non-exact) data I am using recordlinkage to find those non-exact duplicates
noisey_ne_df = pd.DataFrame(clustering_dataset_noisy_ne, columns=['f1', 'f2', 'f3', 'f4'])indexer = recordlinkage.Index().full() indexer = indexer.index(noisey_ne_df)comp = recordlinkage.Compare() comp.numeric('f1','f1') comp.numeric('f2','f2') comp.numeric('f3','f3') comp.numeric('f4','f4') abc = comp.compute(indexer,noisey_ne_df)
Now I want to remove the values from dataset abc
that are similar or the values are 1.5 apart as introduced in dataset clustering_dataset_noisy_ne
above, and have a de-duplicated dataset that are total around 1000 samples. I would appreciate the help thank you.