I have a ligh Python app which should perform a very simple task, but keeps crashing due to OOM.
What app should do
- Loads data from
.parquetin to dataframe - Calculate indicator using
stockstatspackage - Merge freshly calculated data into original dataframe -> here is crashes
- Store dataframe as
.parquet
Where is crashes
df = pd.merge(df, st, on=['datetime'])Using
- Python
3.10 pandas~=2.1.4stockstats~=0.4.1- Kubernetes
1.28.2-do.0(running in Digital Ocean)
Here is the strange thing, the dataframe is very small (df.size is 208446, file size is 1.00337 MB, mem usage is 1.85537 MB).
Measured
import osfile_stats = os.stat(filename)file_size = file_stats.st_size / (1024 * 1024) # 1.00337 MBdf_mem_usage = dataframe.memory_usage(deep=True)df_mem_usage_print = round(df_mem_usage.sum() / (1024 * 1024), 6 # 1.85537 MBdf_size = dataframe.size # 208446Deployment info
App is deployed into Kubernetes using Helm with following resources set
resources: limits: cpu: 1000m memory: 6000Mi requests: cpu: 1000m memory: 4000MiI am using nodes with 4vCPU + 8 GB memory and the node not under performance pressure.
kubectl top node node-xxxNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%node-xxx 750m 19% 1693Mi 25%Pod info
kubectl describe pod xxx... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Sun, 24 Mar 2024 16:08:56 +0000 Finished: Sun, 24 Mar 2024 16:09:06 +0000...Here is CPU and memory consumption from Grafana. I am aware that very short Memory or CPU spikes will be hard to see, but from long term perspective, the app does not consume a lot of RAM. On the other hand, from my experience we are using the same pandas operations on containers with less RAM and dataframes are much much bigger with not problems.
How should I fix this?What else should I debug in order to prevent OOM?
Data and code example
Original dataframe (named df)
datetime open high low close volume0 2023-11-14 11:15:00 2.185 2.187 2.171 2.187 19897.8473141 2023-11-14 11:20:00 2.186 2.191 2.183 2.184 8884.6347282 2023-11-14 11:25:00 2.184 2.185 2.171 2.176 12106.1539543 2023-11-14 11:30:00 2.176 2.176 2.158 2.171 22904.3540824 2023-11-14 11:35:00 2.171 2.173 2.167 2.171 1691.211455New dataframe (named st).
Note: If trend_orientation = 1 => st_lower = NaN, if -1 => st_upper = NaN
datetime supertrend_ub supertrend_lb trend_orientation st_trend_segment0 2023-11-14 11:15:00 0.21495 NaN -1 11 2023-11-14 11:20:00 0.21495 NaN -10 12 2023-11-14 11:25:00 0.21495 NaN -11 13 2023-11-14 11:30:00 0.21495 NaN -12 14 2023-11-14 11:35:00 0.21495 NaN -13 1Code example
import pandas as pdimport multiprocessingimport numpy as npimport stockstatsdef add_supertrend(market): try: # Read data from file df = pd.read_parquet(market, engine="fastparquet") # Extract date columns date_column = df['datetime'] # Convert to stockstats object st_a = stockstats.wrap(df.copy()) # Generate supertrend st_a = st_a[['supertrend', 'supertrend_ub', 'supertrend_lb']] # Add back datetime columns st_a.insert(0, "datetime", date_column) # Add trend orientation using conditional columns conditions = [ st_a['supertrend_ub'] == st_a['supertrend'], st_a['supertrend_lb'] == st_a['supertrend'] ] values = [-1, 1] st_a['trend_orientation'] = np.select(conditions, values) # Remove not required supertrend values st_a.loc[st_a['trend_orientation'] < 0, 'st_lower'] = np.NaN st_a.loc[st_a['trend_orientation'] > 0, 'st_upper'] = np.NaN # Unwrap back to dataframe st = stockstats.unwrap(st_a) # Ensure correct date types are used st = st.astype({'supertrend': 'float32','supertrend_ub': 'float32','supertrend_lb': 'float32','trend_orientation': 'int8' }) # Add trend segments st_to = st[['trend_orientation']] st['st_trend_segment'] = st_to.ne(st_to.shift()).cumsum() # Remove trend value st.drop(columns=['supertrend'], inplace=True) # Merge ST with DF df = pd.merge(df, st, on=['datetime']) # Write back to parquet df.to_parquet(market, compression=None) except Exception as e: # Using proper logger in real code print(e) passdef main(): # Using fixed market as example, in real code market is fetched market = "BTCUSDT" # Using multiprocessing to free up memory after each iteration p = multiprocessing.Process(target=add_supertrend, args=(market,)) p.start() p.join()if __name__ == "__main__": main()Dockerfile
FROM python:3.10ENV PYTHONFAULTHANDLER=1 \ PYTHONHASHSEED=random \ PYTHONUNBUFFERED=1 \ PYTHONPATH=.# Adding vimRUN ["apt-get", "update"]# Get dependenciesCOPY requirements.txt .RUN pip3 install -r requirements.txt# Copy main appADD . .CMD main.py
