I have a python script aiming to process some a large file and write the results in a new txt file. I simplified it as Code example 1.Code example 1:
from concurrent.futures import ProcessPoolExecutorimport pandas as pdfrom collections import OrderedDictimport osdef process(args): parm1, parm2, parm3, output_file = args # do something ... output_file.write("result")def main(large_file_path, output_path, max_processes): #do something ... with open(large_file_path, 'rt') as large_file, open(output_path, 'w') as output_file: arg_list = [] for line in vcf_file: #do something ... arg_list.append((parm1, parm2, parm3, output_file)) with ProcessPoolExecutor(max_processes) as executor: executor.map(process, arg_list, chunksize=int(max_processes/2))if __name__ == "__main__": vcf_path = "/path/to/large_file" output_path = f"para_scores_SI_{block_size/1000}k.txt" max_processes = int(os.cpu_count()/2)# Set the maximum number of processes main(large_path, output_path, max_processes)I realized that arg_list might be quite large if the large_file is very large. I am not sure if there is enough free memory for it. Then I tried to use yield generator instead of just a python list as Code example 2, which runs normally but does not generate anything.Code example 2:
from concurrent.futures import ProcessPoolExecutorimport pandas as pdfrom collections import OrderedDictimport osdef process(args): parm1, parm2, parm3, output_file = args # do something ... output_file.write("result")def main(large_file_path, output_path, max_processes): #do something ... with open(large_file_path, 'rt') as large_file, open(output_path, 'w') as output_file: def arg_generator(large_file, output_file): for line in vcf_file: #do something ... yield (parm1, parm2, parm3, output_file) with ProcessPoolExecutor(max_processes) as executor: executor.map(process, arg_generator(large_file, output_file), chunksize=int(max_processes/2))if __name__ == "__main__": vcf_path = "/path/to/large_file" output_path = f"para_scores_SI_{block_size/1000}k.txt" max_processes = int(os.cpu_count()/2)# Set the maximum number of processes main(large_path, output_path, max_processes)I ran the code on a ubuntu 20.04.6 LTS server, python 3.9.18.
So can ProcessPoolExecutor work with yield generator in Python? Or the use of executor.map is problemmatic? What should I do to make it work?