Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23160

Can ProcessPoolExecutor work with yield generator in Python?

$
0
0

I have a python script aiming to process some a large file and write the results in a new txt file. I simplified it as Code example 1.Code example 1:

from concurrent.futures import ProcessPoolExecutorimport pandas as pdfrom collections import OrderedDictimport osdef process(args):    parm1, parm2, parm3, output_file = args    # do something ...    output_file.write("result")def main(large_file_path, output_path, max_processes):    #do something ...    with open(large_file_path, 'rt') as large_file, open(output_path, 'w') as output_file:        arg_list = []        for line in vcf_file:            #do something ...            arg_list.append((parm1, parm2, parm3, output_file))        with ProcessPoolExecutor(max_processes) as executor:            executor.map(process, arg_list, chunksize=int(max_processes/2))if __name__ == "__main__":    vcf_path = "/path/to/large_file"    output_path = f"para_scores_SI_{block_size/1000}k.txt"    max_processes = int(os.cpu_count()/2)# Set the maximum number of processes    main(large_path, output_path, max_processes)

I realized that arg_list might be quite large if the large_file is very large. I am not sure if there is enough free memory for it. Then I tried to use yield generator instead of just a python list as Code example 2, which runs normally but does not generate anything.Code example 2:

from concurrent.futures import ProcessPoolExecutorimport pandas as pdfrom collections import OrderedDictimport osdef process(args):    parm1, parm2, parm3, output_file = args    # do something ...    output_file.write("result")def main(large_file_path, output_path, max_processes):    #do something ...    with open(large_file_path, 'rt') as large_file, open(output_path, 'w') as output_file:        def arg_generator(large_file, output_file):            for line in vcf_file:                #do something ...                yield (parm1, parm2, parm3, output_file)        with ProcessPoolExecutor(max_processes) as executor:            executor.map(process, arg_generator(large_file, output_file),                 chunksize=int(max_processes/2))if __name__ == "__main__":    vcf_path = "/path/to/large_file"    output_path = f"para_scores_SI_{block_size/1000}k.txt"    max_processes = int(os.cpu_count()/2)# Set the maximum number of processes    main(large_path, output_path, max_processes)

I ran the code on a ubuntu 20.04.6 LTS server, python 3.9.18.

So can ProcessPoolExecutor work with yield generator in Python? Or the use of executor.map is problemmatic? What should I do to make it work?


Viewing all articles
Browse latest Browse all 23160

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>