Generate unique substrings for each string

String Compression - Compress each string to unique substring identifiers. Algorithm takes too long to run.

Hello everyone,

I am constructing a crosswalk between string identifiers of two datasets. For this reason, I am trying to "compress" the strings in both datasets to substring(s) with lowest number of characters that uniquely identify them.

For an example, consider a dataset of the following:

Raw ID	Compressed ID
Apple	Apple
Appha	pph
pple	pple
Apps	pps
Alpha	lp,Al
Apples	es

In this example, Apples was compressed to "es", because no other id contained "es" and "es" had the lowest number of characters in such unique substrings.

Additionally, Alpha was compressed to "lp" and "Al", separated by a (,) as they have the same number of characters.

The idea is to use these compressed strings and apply a containment criterion across both datasets to generate a crosswalk.

I have thought about the following simple algorithm to get these lowest-character unique substrings: For each string, get every single substring. Check for each substring of each string whether that substring exists elsewhere (so that it is a substring of another string), if not (therefore it is unique) note the character length of this unique substring. Now, pick the substrings with lowest number of characters for each string.

Here is the python code I've written to implement the algorithm above:

def get_substrings(string):"""Generate all substrings of a string."""    substrings = set()    for i in range(len(string)):        for j in range(i + 1, len(string) + 1):            substrings.add(string[i:j])    return substringsdef preprocess_dataset(dataset):"""Generate all substrings for each string in the dataset."""    all_substrings = {}    for string in dataset:        all_substrings[string] = get_substrings(string)    return all_substringsdef get_minimal_unique_substrings(string, all_substrings):    min_length = len(string)    minimal_substrings = []    for substring in all_substrings[string]:        if all(substring not in all_substrings[other_string] for other_string in all_substrings if other_string != string):            if len(substring) < min_length:                min_length = len(substring)                minimal_substrings = [substring]            elif len(substring) == min_length and substring not in minimal_substrings:                minimal_substrings.append(substring)    return ",".join(set(minimal_substrings))

However, this code takes too much time to run, given my dataset consists of around 70million observations. I was wondering whether anyone has an alternative suggestion to the algorithm above, or any suggestions on how to speed it up. I am also open to your opinions on other string compression methods.

Thank you! Any hints and help will be appreciated. Best wishes.

Generate unique substrings for each string

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Notification of Pre-Mature Increment to All the Upgraded Employees since...

99 God Status for Whatsapp, Facebook

Skint TV teen to be sentenced

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

Kanulanu Thaake Lyrics and translation | Manam (2014)

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

Gudur Mandal Sarpanch Wardmumbers Mobile Numbers List Warangal District in...

DD Kashir channel packaging bids invited by 29 june

Practice Sheet of Right form of verbs for HSC Students

Muloraki Au

Brunei reaffirms healthcare commitment

Thread: Ticket to Ride Legacy: Legends of the West:: General:: [SPOILERS]...

Kalank - Malayalam (1CD ) - subtitles

Stephanie cheung vs victoria hay vs estrina ang

I Offer a Relaxing Swedish Massage for adult males and females of all ages. :...

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

libdevinfo を使ってネットワークインターフェイスデバイスの一覧を取得する