Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Is there a better way to use multiprocessing within a loop?

$
0
0

I'm quite new to using multiprocessing and I'm trying to figure out if there is a better way to use multiprocessing when it has to be within a loop. Or at least I think it has to be within a loop. Let me try to describe the outline and then a better solution may be obvious.

I'm processing large spatial datasets with lots of data points. I'm trying to calculate values for smaller sub-regions of the large dataset by splitting it up with a regular grid. Each grid calculation can be calculated independently of another so this makes it perfect for parallel processing - I just need to know what data to send to each grid. Typically, there will be 1000's of the smaller grid-bins.

However, I'm trying to track these changes over time so I'm then repeating the same process for each timestep of data. Typically there will be 10-500 timesteps of data to process. Each timestep's data is stored in a several files on disk that I have to load when processing the timestep.

Currently my code looks a bit like the following psuedo-code and it takes about 30-60s to process a timestep. Reading and binning probably only takes a few seconds of that time and the rest is the multi-processing of each bin.

from multiprocessing import Poolfor t in range(num_timeteps_to_process):    # read data files for this timestep    data_1 = read_file1_data(path, t)    data_2 = read_file2_data(path, t)    data_3 = read_file3_data(path, t)    # find what data is in each grid_bin - returns data and bin_id    binned_data = spatial_binning_function(data_1, num_bins_x=100, num_bins_y=100)    # loop through all bins doing calculation    bin_calc_arg_list = []    for bin_indx, data_id in enumerate(binned_data):        bin_calc_arg_list.append((bin_indx, data_id, data_1, data_2, data_3, other_settings))        print("        Running Multiprocessing Pool")        # create the process pool - multiprocessing        with Pool(processes=num_processors) as pool:            # execute a task            results = pool.starmap(multiprocess_timesteps_bins, bin_calc_arg_list)            # close the process pool            pool.close()            # wait for issued tasks to complete            pool.join()        # collate results here and return        ...        ...        ...

That's the general flow of my code and it's quite a bit quicker than any single-threaded implementation but I'm aware that the opening and closing of the processing pool within each timestep loop is probably a really bad idea but the reading and subdivision of data needs to be done per timestep as I don't think running multiple processes where each one is reading the same data a few thousand times is better.

From a few print statements it seems that there may be several seconds required starting up the pool each timestep and maybe some more time spent shutting it down.

As I said, I'm not overly familiar with multiprocessing yet so I'm not quite sure what is the best way to set up a problem like this - I'm assuming my implementation is not how it should be done, but it works and it's faster than single threaded.

Should I be using some other dynamic queue that I open at the start and then pop items into that queue during each timestep? And then close once at the end?

Any pointers or suggestions welcome.


Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>