Memory leakage in with_columns and computation of statistics #20851

AlexanderMerkel · 2025-01-22T17:04:02Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Description
When performing iterative operations on Polars DataFrames that involve complex transformations (e.g., rolling statistics with grouping), memory usage increases over time and is not fully released even after explicitly deleting the DataFrame and invoking gc.collect(). This issue persists across thousands of iterations, eventually leading to increased memory consumption.

Steps to Reproduce

import polars as pl
import gc
import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # Convert to MB

print(f"Initial RAM usage: {get_memory_usage():.2f} MB")
for i in range(10000):
    train_data = pl.DataFrame([
        pl.Series('COL_A', ['USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA']),
        pl.Series('COL_B', ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']),
        pl.Series('COL_C', ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']),
        pl.Series('num_sold', [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    ])
    train_data = (train_data.with_columns([
                    pl.col('num_sold').shift(1).rolling_mean(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('mean_sold'),
                    pl.col('num_sold').shift(1).rolling_var(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('var_sold'),
                    pl.col('num_sold').shift(1).rolling_min(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('min_sold'),
                    pl.col('num_sold').shift(1).rolling_max(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('max_sold'),
                    pl.col('num_sold').shift(1).rolling_std(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('std_sold'),
                    pl.col('num_sold').shift(1).rolling_median(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('median_sold'),
                    pl.col('num_sold').shift(1).rolling_quantile(window_size=train_data.height, quantile=0.25, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('q25_sold'),
                    pl.col('num_sold').shift(1).rolling_quantile(window_size=train_data.height, quantile=0.75, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('q75_sold'),
                    pl.col('num_sold').shift(1).rolling_skew(window_size=train_data.height)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('skew_sold')
                ])
            )
print(f"Iteration {i}: RAM usage after operations: {get_memory_usage():.2f} MB")
del train_data
gc.collect()
print(f"Iteration {i}: RAM usage after cleanup: {get_memory_usage():.2f} MB\n")

Log output

Initial RAM usage: 39.64 MB
Iteration 9999: RAM usage after operations: 62.79 MB
Iteration 9999: RAM usage after cleanup: 62.82 MB

Issue description

Memory usage increases gradually across iterations. Even after deleting the DataFrame and calling gc.collect(), memory is not released.

I have not found a workaround.
Enabling or disabling string caching (pl.enable_string_cache()) does not seem to affect the outcome.

Expected behavior

Memory usage should stabilize as the DataFrame is deleted and garbage collection is triggered. The memory should not grow continuously if no new data is being held.

Installed versions

-------Version info---------
Polars:              1.20.0
Index type:          UInt32
Platform:            Windows-11-10.0.26120-SP0
Python:              3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
LTS CPU:             True

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.12.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.0
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.37
torch                2.5.1+cpu
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

AlexanderMerkel added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leakage in with_columns and computation of statistics #20851

Memory leakage in with_columns and computation of statistics #20851

AlexanderMerkel commented Jan 22, 2025

Memory leakage in with_columns and computation of statistics #20851

Memory leakage in with_columns and computation of statistics #20851

Comments

AlexanderMerkel commented Jan 22, 2025

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions