I have a very large numpy array with entries like:
[['0/1''2/0']['3/0''1/4']]I want to convert it/ get an array with the 3d array like
[[[0 1] [2 0]][[3 0] [1 4]]]The array is very wide, so a lot of columns, but not many rows. And there are around 100 or so possibilities for the string. This isnt actually a fraction, just a demonstration of what is in the file (its genomics data, given to me in this format).
I don't want to run in parallel, as I will be running this on a single CPU before moving to a single GPU, so the extra CPUs would be idle while the GPU kernel is running.I have tried numba:
import numpy as npimport itertoolsfrom numba import njitimport time@njit(nopython=True)def index_with_numba(data,int_data,indices): for pos in indices: str_match = str(pos[0])+'/'+str(pos[1]) for i in range(data.shape[0]): for j in range(data.shape[1]): if data[i, j] == str_match: int_data[i,j] = pos return int_datadef generate_masks(): masks=[] def _2d_array(i,j): return np.asarray([i,j],dtype=np.int32) for i in range(10): for j in range(10): masks.append(_2d_array(i,j)) return masksrows = 100000cols = 200numerators = np.random.randint(0, 10, size=(rows,cols))denominators = np.random.randint(0, 10, size=(rows,cols))samples = np.array([f"{numerator}/{denominator}" for numerator, denominator in zip(numerators.flatten(), denominators.flatten())],dtype=str).reshape(rows, cols)samples_int = np.empty((samples.shape[0],samples.shape[1],2),dtype=np.int32)# Generate all possible masksmasks = generate_masks()t0=time.time()samples_int = index_with_numba(samples,samples_int, masks)t1=time.time()print(f"Time to index {t1-t0}")But it is too slow to be feasible.
Time to index 182.0304057598114The reason I want this is I want to write a cuda kernel to perform an operation based on the original values - so for '0/1' i need 0 and 1 etc, but I cannot handle the strings. I had thought perhaps masks could be used, but they dont seem to be suitable.
Any suggestions appreciated.