I'm using Python to analyze a large dataset which is distributed in several files with different sizes. The analysis is like a database search, that I loaded the database and kept it in memory, then processed the search using data in those files. To avoid out of memory, I processed the data files one by one and cached the results after processing a fixed number of data and cleared everything except the database using del variable_names and gc.collect().
The database occupies around 35% of my memory, maximum 50% if I loaded the data file with max size into the memory, maximum 70% after processing the fixed number of data of which the results will be cached and cleared from the memory. I monitored the usage of memory using psutil.virtual_memory() and found that the memory flutuated between 40% and 70%.
Everything was fine after running the analysis for 2 days, and the speed suddenly dropped after 2 or 3 days' running. I searched online, not sure the problem, but I think it was probably because the OS can swap or compress the memory so that Python has to decompress the memory or load the database there that would take lots of time.
Actually, I did a comparison. When the speed dropped, I kept it running (at slow speed). Then I ran Python in another thread using the database and the data file for which the processing speed was slow, the speed became much faster, like that 2 days ago. So I'm much sure this speed drop was because of OS memory compression or swap.
Is there an operator in Python, so that I can "refresh" the memory to "activate" everything in memory? Then I can call the function after running for, e.g., 1 day, instead of clearing everything in the memory and reloading the database to process those unprocessed data files again and again, something like restart the Python?