Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count dataset tokens #1055

Draft
wants to merge 38 commits into
base: develop
Choose a base branch
from
Draft

Count dataset tokens #1055

wants to merge 38 commits into from

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Nov 8, 2024

Description

This PR adds a way of computing summary statistics of the distiset, and adds a table in the final README template card.
It computes the statistics per leaf node and shows a table per each one like the following:


Dataset Statistics

  • Summary statistics: default
mean std min max sum
input_tokens_statistics_generation 1881 1.41421 1880 1882 3762
output_tokens_statistics_generation 582 1.41421 581 583 1164

Should be merged after #1034

Closes #1046

…dd new merge_dicts to help merging user-assistant messages in magpie
@plaguss plaguss added the enhancement New feature or request label Nov 8, 2024
@plaguss plaguss added this to the 1.5.0 milestone Nov 8, 2024
@plaguss plaguss self-assigned this Nov 8, 2024
@plaguss plaguss linked an issue Nov 8, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Nov 8, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1055/

Copy link

codspeed-hq bot commented Nov 8, 2024

CodSpeed Performance Report

Merging #1055 will not alter performance

Comparing count-dataset-tokens (5741cd1) with develop (e830e25)

Summary

✅ 1 untouched benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Compute the input/output tokens of a dataset
1 participant