Feature Request: Adding dataset deduplication process #1946

Weyaxi · 2024-10-05T11:07:05Z

⚠️ Please check that this feature request hasn't been suggested before.

I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.

✔️ Solution

In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.

❓ Alternatives

There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.

📝 Additional Context

No response

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.

Weyaxi added the enhancement New feature or request label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Adding dataset deduplication process #1946

Feature Request: Adding dataset deduplication process #1946

Weyaxi commented Oct 5, 2024 •

edited

Loading

Feature Request: Adding dataset deduplication process #1946

Feature Request: Adding dataset deduplication process #1946

Comments

Weyaxi commented Oct 5, 2024 • edited Loading

⚠️ Please check that this feature request hasn't been suggested before.

🔖 Feature description

✔️ Solution

❓ Alternatives

📝 Additional Context

Acknowledgements

Weyaxi commented Oct 5, 2024 •

edited

Loading