Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 16951

I Isolating the distances recieved from record linkage

$
0
0

I am working with deduplication of values in dataset I have created dataset with

n_samples = 1000n_features = 4 centers = 2 cluster_std = 1 center_box = (-10, 10)

parameters for make_blob

    def duplicate_data(data, percent,var_val):      num_rows_to_duplicate = int(len(data) * percent / 100)       duplicated_data = data.copy() duplicates = []      variation = np.full((data.shape[1]), var_val)      indices_to_duplicate = np.random.choice(data.shape[0], num_rows_to_duplicate, replace=False)       data_ne = data[indices_to_duplicate]+variation      duplicated_data = np.vstack([data,data_ne] )      return duplicated_data     clustering_dataset_noisy_ne =   duplicate_data(clustering_dataset, 50,1.5)     clustering_dataset_noisy_ne = np.random.permutation(clustering_dataset_noisy_ne)

Now for this data set with duplicated(non-exact) data I am using recordlinkage to find those non-exact duplicates

noisey_ne_df = pd.DataFrame(clustering_dataset_noisy_ne, columns=['f1', 'f2', 'f3', 'f4'])indexer = recordlinkage.Index().full() indexer = indexer.index(noisey_ne_df)comp = recordlinkage.Compare() comp.numeric('f1','f1') comp.numeric('f2','f2') comp.numeric('f3','f3') comp.numeric('f4','f4') abc = comp.compute(indexer,noisey_ne_df)

Now I want to remove the values from dataset abc that are similar or the values are 1.5 apart as introduced in dataset clustering_dataset_noisy_ne above, and have a de-duplicated dataset that are total around 1000 samples. I would appreciate the help thank you.


Viewing all articles
Browse latest Browse all 16951

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>