Quantcast
Channel: Active questions tagged python - Stack Overflow
Viewing all articles
Browse latest Browse all 23131

Python app keeps OOM crashing on Pandas merge

$
0
0

I have a ligh Python app which should perform a very simple task, but keeps crashing due to OOM.

What app should do

  1. Loads data from .parquet in to dataframe
  2. Calculate indicator using stockstats package
  3. Merge freshly calculated data into original dataframe -> here is crashes
  4. Store dataframe as .parquet

Where is crashes

df = pd.merge(df, st, on=['datetime'])

Using

  • Python 3.10
  • pandas~=2.1.4
  • stockstats~=0.4.1
  • Kubernetes 1.28.2-do.0 (running in Digital Ocean)

Here is the strange thing, the dataframe is very small (df.size is 208446, file size is 1.00337 MB, mem usage is 1.85537 MB).

Measured

import osfile_stats = os.stat(filename)file_size = file_stats.st_size / (1024 * 1024)  # 1.00337 MBdf_mem_usage = dataframe.memory_usage(deep=True)df_mem_usage_print = round(df_mem_usage.sum() / (1024 * 1024), 6   # 1.85537 MBdf_size = dataframe.size  # 208446

Deployment info

App is deployed into Kubernetes using Helm with following resources set

resources:  limits:    cpu: 1000m    memory: 6000Mi  requests:    cpu: 1000m    memory: 4000Mi

I am using nodes with 4vCPU + 8 GB memory and the node not under performance pressure.

kubectl top node node-xxxNAME              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%node-xxx          750m         19%    1693Mi          25%

Pod info

kubectl describe pod xxx...    State:          Waiting      Reason:       CrashLoopBackOff    Last State:     Terminated      Reason:       OOMKilled      Exit Code:    137      Started:      Sun, 24 Mar 2024 16:08:56 +0000      Finished:     Sun, 24 Mar 2024 16:09:06 +0000...

Here is CPU and memory consumption from Grafana. I am aware that very short Memory or CPU spikes will be hard to see, but from long term perspective, the app does not consume a lot of RAM. On the other hand, from my experience we are using the same pandas operations on containers with less RAM and dataframes are much much bigger with not problems.

Grafana stats

How should I fix this?What else should I debug in order to prevent OOM?

Data and code example

Original dataframe (named df)

              datetime   open   high    low  close        volume0  2023-11-14 11:15:00  2.185  2.187  2.171  2.187  19897.8473141  2023-11-14 11:20:00  2.186  2.191  2.183  2.184   8884.6347282  2023-11-14 11:25:00  2.184  2.185  2.171  2.176  12106.1539543  2023-11-14 11:30:00  2.176  2.176  2.158  2.171  22904.3540824  2023-11-14 11:35:00  2.171  2.173  2.167  2.171   1691.211455

New dataframe (named st).
Note: If trend_orientation = 1 => st_lower = NaN, if -1 => st_upper = NaN

              datetime   supertrend_ub  supertrend_lb    trend_orientation    st_trend_segment0  2023-11-14 11:15:00   0.21495        NaN              -1                   11  2023-11-14 11:20:00   0.21495        NaN              -10                  12  2023-11-14 11:25:00   0.21495        NaN              -11                  13  2023-11-14 11:30:00   0.21495        NaN              -12                  14  2023-11-14 11:35:00   0.21495        NaN              -13                  1

Code example

import pandas as pdimport multiprocessingimport numpy as npimport stockstatsdef add_supertrend(market):    try:        # Read data from file        df = pd.read_parquet(market, engine="fastparquet")        # Extract date columns        date_column = df['datetime']        # Convert to stockstats object        st_a = stockstats.wrap(df.copy())        # Generate supertrend        st_a = st_a[['supertrend', 'supertrend_ub', 'supertrend_lb']]        # Add back datetime columns        st_a.insert(0, "datetime", date_column)        # Add trend orientation using conditional columns        conditions = [            st_a['supertrend_ub'] == st_a['supertrend'],            st_a['supertrend_lb'] == st_a['supertrend']        ]        values = [-1, 1]        st_a['trend_orientation'] = np.select(conditions, values)        # Remove not required supertrend values        st_a.loc[st_a['trend_orientation'] < 0, 'st_lower'] = np.NaN        st_a.loc[st_a['trend_orientation'] > 0, 'st_upper'] = np.NaN        # Unwrap back to dataframe        st = stockstats.unwrap(st_a)        # Ensure correct date types are used        st = st.astype({'supertrend': 'float32','supertrend_ub': 'float32','supertrend_lb': 'float32','trend_orientation': 'int8'        })        # Add trend segments        st_to = st[['trend_orientation']]        st['st_trend_segment'] = st_to.ne(st_to.shift()).cumsum()        # Remove trend value        st.drop(columns=['supertrend'], inplace=True)        # Merge ST with DF        df = pd.merge(df, st, on=['datetime'])        # Write back to parquet        df.to_parquet(market, compression=None)    except Exception as e:        # Using proper logger in real code        print(e)        passdef main():    # Using fixed market as example, in real code market is fetched    market = "BTCUSDT"    # Using multiprocessing to free up memory after each iteration    p = multiprocessing.Process(target=add_supertrend, args=(market,))    p.start()    p.join()if __name__ == "__main__":    main()

Dockerfile

FROM python:3.10ENV PYTHONFAULTHANDLER=1 \    PYTHONHASHSEED=random \    PYTHONUNBUFFERED=1 \    PYTHONPATH=.# Adding vimRUN ["apt-get", "update"]# Get dependenciesCOPY requirements.txt .RUN pip3 install -r requirements.txt# Copy main appADD . .CMD main.py

Viewing all articles
Browse latest Browse all 23131

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>