String Compression - Compress each string to unique substring identifiers. Algorithm takes too long to run.
Hello everyone,
I am constructing a crosswalk between string identifiers of two datasets. For this reason, I am trying to "compress" the strings in both datasets to substring(s) with lowest number of characters that uniquely identify them.
For an example, consider a dataset of the following:
Raw ID | Compressed ID |
---|---|
Apple | Apple |
Appha | pph |
pple | pple |
Apps | pps |
Alpha | lp,Al |
Apples | es |
In this example, Apples was compressed to "es", because no other id contained "es" and "es" had the lowest number of characters in such unique substrings.
Additionally, Alpha was compressed to "lp" and "Al", separated by a (,) as they have the same number of characters.
The idea is to use these compressed strings and apply a containment criterion across both datasets to generate a crosswalk.
I have thought about the following simple algorithm to get these lowest-character unique substrings: For each string, get every single substring. Check for each substring of each string whether that substring exists elsewhere (so that it is a substring of another string), if not (therefore it is unique) note the character length of this unique substring. Now, pick the substrings with lowest number of characters for each string.
Here is the python code I've written to implement the algorithm above:
def get_substrings(string):"""Generate all substrings of a string.""" substrings = set() for i in range(len(string)): for j in range(i + 1, len(string) + 1): substrings.add(string[i:j]) return substringsdef preprocess_dataset(dataset):"""Generate all substrings for each string in the dataset.""" all_substrings = {} for string in dataset: all_substrings[string] = get_substrings(string) return all_substringsdef get_minimal_unique_substrings(string, all_substrings): min_length = len(string) minimal_substrings = [] for substring in all_substrings[string]: if all(substring not in all_substrings[other_string] for other_string in all_substrings if other_string != string): if len(substring) < min_length: min_length = len(substring) minimal_substrings = [substring] elif len(substring) == min_length and substring not in minimal_substrings: minimal_substrings.append(substring) return ",".join(set(minimal_substrings))
However, this code takes too much time to run, given my dataset consists of around 70million observations. I was wondering whether anyone has an alternative suggestion to the algorithm above, or any suggestions on how to speed it up. I am also open to your opinions on other string compression methods.
Thank you! Any hints and help will be appreciated. Best wishes.