You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
⚠️ Please check that this feature request hasn't been suggested before.
I searched previous Ideas in Discussions didn't find any similar feature requests.
I searched previous Issues didn't find any similar feature requests.
🔖 Feature description
A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.
✔️ Solution
In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.
❓ Alternatives
There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.
📝 Additional Context
No response
Acknowledgements
My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this feature has not been requested yet.
I have provided enough information for the maintainers to understand and evaluate this request.
The text was updated successfully, but these errors were encountered:
🔖 Feature description
A dataset deduplication progress feature could be useful for Axolotl. Especially since many users input their datasets in various formats and configurations, having a deduplication process at the end when all these datasets are merged would be very beneficial for developers fine-tuning models.
✔️ Solution
In my use case, adding a 'dedup_datasets_in_end' (this variable name is only a example) variable and the necessary parameters for the deduplication process would be very beneficial.
❓ Alternatives
There are many algorithms, GitHub repositories, and tools for dataset deduplication. For example, the main algorithm that comes to mind is MinHash. Incorporating such algorithms over time would be very beneficial.
📝 Additional Context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: