I have a Python code that calculates products based on combinations of keys from a correlation matrix. The code works well for when the dataframe have small numbers of columns (e.g., less than 95 columns), but its performance degrades significantly as the column count increases (e.g., >95 columns). Even for small datasets I faced challenge to compute products more than 4 keys. I suspect there's room for improvement in terms of time complexity and memory efficiency. Below is the code:
import pandas as pdfrom itertools import combinationsimport numpy as npimport pandas as pdfrom itertools import combinationsfrom itertools import islice# synthetic data generated# Set seed for reproducibilitynp.random.seed(42)# Generate random column namescolumn_names = ['test_'+ str(i) for i in range(1, 1195)]# Generate random row namesrow_names = [f'ROW_{i}' for i in range(0, 151)]# Create a DataFrame with random integers between 0 and 15data = np.random.randint(0, 16, size=(len(row_names), len(column_names)))df = pd.DataFrame(data, index=row_names, columns=column_names)correlation_matrix = df.corr()def compute_products(correlation_matrix): out = {} keys = correlation_matrix.index for r in range(2, 5): # Compute products for 2, 3, 4 keys at a time for combo in combinations(keys, r): prod = 1 for i in range(len(combo)): for j in range(i + 1, len(combo)): prod *= correlation_matrix.loc[combo[i], combo[j]] ** 2 out[str(combo)] = {'names': list(combo),'prod': prod } return outbb = compute_products(correlation_matrix)
Specific Questions:
What optimisations can be applied to improve the time complexity andmemory efficiency of code especially the compute_products function?
Are there alternative approaches or algorithms for achieving the sameresults with better scalability?
Additional Information:
I am using Python with pandas and numpy, I'm most comfortable withpython but I don't mind answers using other language.
The code and a brief explanation of its purpose are provided above.
The generated dataset size is 151 rows by >100 columns, but it does vary depending on.the problem I'm working on.
I would appreciate any insights, suggestions, or improvements that can be made to enhance the efficiency of this code.