[Bug]: Memory leaks with `load_dataset_multi_txn` #330

Ramlaoui · 2024-12-24T08:48:51Z

What happened?

Hi, I've been using the new feature to add HugginFace datasets inside a table. However, for large datasets it seems like the call to load_dataset_multi_txn crashes after a certain time because of OOM problems.
I've tried playing with the size of the batch and commit_every_n_batches but I still get the same issue.

Is there any way to mitigate this issue or at least to have a parameter setting where we want to start uploading from the dataset (eg. after 1000 batches).

pgai extension affected

0.6.0

pgai library affected

No response

PostgreSQL version used

17.1

What operating system did you use?

Ubuntu 24.04 32GB RAM

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

call ai.load_dataset_multi_txn('LeMaterial/LeMat-Bulk', 'compatible_pbe', table_name => 'lemat', if_table_exists => 'append', commit_every_n_batches => 100);

Are you going to work on the bugfix?

🆘 No, could someone else please work on the bugfix?

The text was updated successfully, but these errors were encountered:

Ramlaoui added bug Something isn't working community pgai labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Memory leaks with `load_dataset_multi_txn` #330

[Bug]: Memory leaks with `load_dataset_multi_txn` #330

Ramlaoui commented Dec 24, 2024 •

edited

Loading

[Bug]: Memory leaks with load_dataset_multi_txn #330

[Bug]: Memory leaks with load_dataset_multi_txn #330

Comments

Ramlaoui commented Dec 24, 2024 • edited Loading

What happened?

pgai extension affected

pgai library affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

Are you going to work on the bugfix?

[Bug]: Memory leaks with `load_dataset_multi_txn` #330

[Bug]: Memory leaks with `load_dataset_multi_txn` #330

Ramlaoui commented Dec 24, 2024 •

edited

Loading