Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leakage in with_columns and computation of statistics #20851

Open
2 tasks done
AlexanderMerkel opened this issue Jan 22, 2025 · 0 comments
Open
2 tasks done

Memory leakage in with_columns and computation of statistics #20851

AlexanderMerkel opened this issue Jan 22, 2025 · 0 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@AlexanderMerkel
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Description
When performing iterative operations on Polars DataFrames that involve complex transformations (e.g., rolling statistics with grouping), memory usage increases over time and is not fully released even after explicitly deleting the DataFrame and invoking gc.collect(). This issue persists across thousands of iterations, eventually leading to increased memory consumption.

Steps to Reproduce

import polars as pl
import gc
import psutil
import os

def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024  # Convert to MB

print(f"Initial RAM usage: {get_memory_usage():.2f} MB")
for i in range(10000):
    train_data = pl.DataFrame([
        pl.Series('COL_A', ['USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA', 'USA']),
        pl.Series('COL_B', ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']),
        pl.Series('COL_C', ['X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X', 'X']),
        pl.Series('num_sold', [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    ])
    train_data = (train_data.with_columns([
                    pl.col('num_sold').shift(1).rolling_mean(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('mean_sold'),
                    pl.col('num_sold').shift(1).rolling_var(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('var_sold'),
                    pl.col('num_sold').shift(1).rolling_min(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('min_sold'),
                    pl.col('num_sold').shift(1).rolling_max(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('max_sold'),
                    pl.col('num_sold').shift(1).rolling_std(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('std_sold'),
                    pl.col('num_sold').shift(1).rolling_median(window_size=train_data.height, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('median_sold'),
                    pl.col('num_sold').shift(1).rolling_quantile(window_size=train_data.height, quantile=0.25, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('q25_sold'),
                    pl.col('num_sold').shift(1).rolling_quantile(window_size=train_data.height, quantile=0.75, min_periods=1)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('q75_sold'),
                    pl.col('num_sold').shift(1).rolling_skew(window_size=train_data.height)
                        .over(['COL_A', 'COL_B', 'COL_C']).alias('skew_sold')
                ])
            )
print(f"Iteration {i}: RAM usage after operations: {get_memory_usage():.2f} MB")
del train_data
gc.collect()
print(f"Iteration {i}: RAM usage after cleanup: {get_memory_usage():.2f} MB\n")

Log output

Initial RAM usage: 39.64 MB
Iteration 9999: RAM usage after operations: 62.79 MB
Iteration 9999: RAM usage after cleanup: 62.82 MB

Issue description

Memory usage increases gradually across iterations. Even after deleting the DataFrame and calling gc.collect(), memory is not released.

I have not found a workaround.
Enabling or disabling string caching (pl.enable_string_cache()) does not seem to affect the outcome.

Expected behavior

Memory usage should stabilize as the DataFrame is deleted and garbage collection is triggered. The memory should not grow continuously if no new data is being held.

Installed versions

-------Version info---------
Polars:              1.20.0
Index type:          UInt32
Platform:            Windows-11-10.0.26120-SP0
Python:              3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)]
LTS CPU:             True

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.12.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.0
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           2.0.37
torch                2.5.1+cpu
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@AlexanderMerkel AlexanderMerkel added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant