Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for Classifiers and Search tools #219

Draft
wants to merge 292 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
292 commits
Select commit Hold shift + click to select a range
d3efee4
Merge branch 'main' into soldni/backoff
soldni May 21, 2024
aee8ba6
adding support for batching
soldni May 22, 2024
ff1e496
better eval
soldni May 22, 2024
ed28a7a
fixed minor failure
soldni May 22, 2024
0953e80
merge
soldni May 22, 2024
3c4f17d
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni May 22, 2024
2cc4084
small fix in math for processors
soldni May 22, 2024
eda41c3
using native types when possible
soldni May 22, 2024
d93a54f
indent
soldni May 22, 2024
29dca70
copyright
soldni May 22, 2024
dc2fa98
better string
soldni May 22, 2024
97b3bd2
comment
soldni May 22, 2024
845072e
progressbar
soldni May 23, 2024
35719fc
added support for old-style retries_on_error
soldni May 23, 2024
67b3bda
added support for retries_on_error
soldni May 23, 2024
155319c
data
soldni May 23, 2024
d8cb681
deps
soldni May 23, 2024
e6270dc
get_annotations not available
soldni May 23, 2024
75a5b0d
fixes
soldni May 23, 2024
86371d6
quoting type aliases
soldni May 23, 2024
73aad08
3.8 compatibility
soldni May 23, 2024
b9ec3eb
more style
soldni May 23, 2024
e42f9fc
pyi
soldni May 23, 2024
d9cbac0
Merge branch 'soldni/pbar2' into soldni/backoff
soldni May 23, 2024
09aa96a
progress
soldni May 24, 2024
be6c984
viz pbar
soldni May 24, 2024
5c90e9f
fixes
soldni May 24, 2024
f5c696c
fixing small regression in tests
soldni May 24, 2024
e941f05
order from user
soldni May 24, 2024
99264b0
same order
soldni May 24, 2024
8f86b62
tests
soldni May 24, 2024
88a9e55
Merge branch 'soldni/pbar2' into soldni/backoff
soldni May 24, 2024
18a3f71
older
soldni May 24, 2024
f0e8af4
note for common runtime
soldni May 24, 2024
36c18d2
removing attempts
soldni May 24, 2024
708affc
progressbar
soldni May 25, 2024
824e11e
progressbar
soldni May 25, 2024
bb054ca
adding linearizers
soldni May 28, 2024
c2315e7
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni May 28, 2024
3cec13a
license script
soldni May 28, 2024
81c473a
script
soldni May 28, 2024
42e1514
science
soldni May 28, 2024
4336a46
added better tests
soldni May 28, 2024
e01b408
added tests
soldni May 28, 2024
46135ab
types
soldni May 28, 2024
a8803e8
sorting
soldni May 29, 2024
002611f
name
soldni May 29, 2024
b090d25
send
soldni May 29, 2024
486ef25
typo
soldni May 29, 2024
275eb95
spacing
soldni May 29, 2024
26b791b
skipping big tests
soldni May 29, 2024
c9f5888
optional tests w large download
soldni May 29, 2024
628dd14
corner case failure
soldni May 29, 2024
bc61e05
quantized
soldni May 29, 2024
36b4275
s3 destination
soldni May 29, 2024
d7d629f
commit
soldni May 29, 2024
47b00dc
style
soldni May 30, 2024
af2f820
owm
soldni May 30, 2024
19c13cf
science v2
soldni May 30, 2024
2678191
science v2
soldni May 30, 2024
851e767
science v1
soldni May 30, 2024
5401ae5
science v1
soldni May 30, 2024
7dc4ad9
edit
soldni May 31, 2024
5763cc3
config
soldni May 31, 2024
493a18f
fixes total
soldni May 31, 2024
c436c0d
minor fix
soldni Jun 1, 2024
537b7a7
linersizer
soldni Jun 1, 2024
7221a0b
fallback
soldni Jun 1, 2024
5d09315
added compression
soldni Jun 2, 2024
937ccc0
adding test data
soldni Jun 2, 2024
e93cc48
wip
soldni Jun 4, 2024
b7c5c59
tests
soldni Jun 4, 2024
f6bbf23
tests
soldni Jun 4, 2024
33da8dc
fixed tests
soldni Jun 4, 2024
fd041c1
fixes
soldni Jun 5, 2024
68d6b35
added flags in config
soldni Jun 5, 2024
84370ca
Merge branch 'soldni/zst' into soldni/backoff
soldni Jun 5, 2024
928b85a
better error handling
soldni Jun 7, 2024
f5767c2
Merge branch 'main' into soldni/backoff
soldni Jun 7, 2024
592b7a8
Merge branch 'soldni/mixer-fix' into soldni/backoff
soldni Jun 7, 2024
a3ddbd2
added files (to be removed)
soldni Jun 7, 2024
1c6a62a
new stuff
soldni Jun 7, 2024
16b8dc9
other names
soldni Jun 7, 2024
f182d0d
update
soldni Jun 7, 2024
3227a11
configs
soldni Jun 7, 2024
75d2938
small fix gopher tagger
soldni Jun 8, 2024
f8b771b
addding configs
soldni Jun 8, 2024
cea2d1e
wip
soldni Jun 8, 2024
426ef1a
optional
soldni Jun 8, 2024
ed9cf90
para
soldni Jun 9, 2024
8a4dbfc
new resolver
soldni Jun 9, 2024
5f5feeb
random delay
soldni Jun 9, 2024
2b20cd4
delay
soldni Jun 9, 2024
6e9fad2
jitter log
soldni Jun 9, 2024
76c4f6b
feix
soldni Jun 9, 2024
9a86e09
dedup
soldni Jun 9, 2024
a1fb0e2
test
soldni Jun 9, 2024
cd3f38d
new steps
soldni Jun 9, 2024
1ee4bed
fixes
soldni Jun 9, 2024
d818395
all
soldni Jun 9, 2024
ffd7ccd
all
soldni Jun 9, 2024
06ecd9c
indent
soldni Jun 9, 2024
6102ef8
keyword
soldni Jun 9, 2024
e42b7c3
wip
soldni Jun 9, 2024
5107c34
discarding fields
soldni Jun 9, 2024
eaca238
w
soldni Jun 9, 2024
ca29511
sizes
soldni Jun 10, 2024
965c053
lciense
soldni Jun 10, 2024
4af5f21
fixing paths
soldni Jun 10, 2024
1c27e33
scripts to get labels
soldni Jun 11, 2024
def2027
exp
soldni Jun 9, 2024
f1c877a
test, stats
soldni Jun 13, 2024
3e07e51
optional id
soldni Jun 21, 2024
38a3122
reverted
soldni Jun 21, 2024
b487104
ext
soldni Jun 21, 2024
7462aa8
missed configs
soldni Jun 24, 2024
c8e4d7c
added function to count top k tokens
soldni Jun 24, 2024
ba66c91
missing
soldni Jun 24, 2024
d7558db
count
soldni Jul 6, 2024
a6e74a7
added option for tokenizer to split on special tokens
soldni Jul 13, 2024
a576020
added configs
soldni Jul 13, 2024
f31662a
Merge branch 'soldni/tiktoken' into soldni/backoff
soldni Jul 13, 2024
c31fab3
encoding special tokens
soldni Jul 13, 2024
f405d47
more paths
soldni Jul 15, 2024
3498632
configs
soldni Jul 15, 2024
e560e99
Merge branch 'main' into soldni/backoff
soldni Aug 8, 2024
58a84d4
cc-news-new
soldni Aug 21, 2024
ab586a1
Merge branch 'main' into soldni/backoff
soldni Aug 21, 2024
d7998ff
version
soldni Aug 23, 2024
5dd7611
adding new lengths
soldni Aug 24, 2024
bd46c36
script
soldni Aug 24, 2024
04277c4
partitions
soldni Aug 25, 2024
1768ff0
small
soldni Aug 27, 2024
a50fcaa
100 chars
soldni Aug 27, 2024
de42c1a
datasets
soldni Aug 27, 2024
d34012e
reformatted
soldni Aug 27, 2024
15c3ca6
Merge branch 'soldni/backoff' of https://github.com/allenai/dolma int…
soldni Aug 27, 2024
e2337fe
adding acquisition script
soldni Sep 27, 2024
af13c63
.
soldni Oct 1, 2024
24428ee
added new source
soldni Oct 4, 2024
5c6aa3a
decon wip
soldni Oct 4, 2024
2782b19
Merge branch 'main' into solni/se
soldni Oct 4, 2024
86b71c0
Merge remote-tracking branch 'origin/soldni/se' into solni/se
soldni Oct 4, 2024
15e29a0
Update Cargo.lock
soldni Oct 4, 2024
7a3c254
new mathy sources
soldni Oct 5, 2024
5be1477
new math
soldni Oct 5, 2024
ccd5e12
fast forwarding
soldni Oct 5, 2024
3162f4d
alt tok
soldni Oct 5, 2024
2c4fd5a
wip
soldni Oct 5, 2024
5f26194
new ingestion
soldni Oct 7, 2024
1d31ed5
Merge branch 'solni/se' of https://github.com/allenai/dolma into soln…
soldni Oct 7, 2024
16e5483
changed script
soldni Oct 9, 2024
b0b9256
more tasks
soldni Oct 10, 2024
7229237
tests
soldni Oct 10, 2024
183dc36
search
soldni Oct 12, 2024
c34bf21
search wip
soldni Oct 15, 2024
8195171
search
soldni Oct 15, 2024
4470012
more search
soldni Oct 16, 2024
f4ab66f
wip
soldni Oct 16, 2024
c1d4cf4
just search
soldni Oct 22, 2024
58b9e88
removed unused files
soldni Oct 22, 2024
27cdffc
style
soldni Oct 22, 2024
bb1652c
added script to process eli5
soldni Oct 22, 2024
e3a460b
fixes, extra split
soldni Oct 22, 2024
123304b
processing
soldni Oct 22, 2024
346ceb0
current version
soldni Oct 22, 2024
c5d29aa
wip
soldni Oct 23, 2024
f6ab6d1
renamed
soldni Oct 23, 2024
731e5c7
gantry
soldni Oct 24, 2024
d2b016b
new secrets
soldni Oct 24, 2024
c5b9254
test
soldni Oct 24, 2024
ba721e2
files
soldni Oct 24, 2024
7366ad4
wandb
soldni Oct 24, 2024
13d3ebe
fwd main
soldni Oct 24, 2024
8bf43f2
fwd main 2
soldni Oct 24, 2024
a94f8ad
Merge branch 'main' into soldni/classify-search-sources
soldni Oct 24, 2024
fd9ea5d
spacing
soldni Oct 24, 2024
346a325
logging
soldni Oct 24, 2024
4e3fbf2
removed test
soldni Oct 24, 2024
a193813
style
soldni Oct 24, 2024
aa7b45c
off-process writing
soldni Oct 24, 2024
fec0770
improve?
soldni Oct 24, 2024
b29aed9
added compile flag
soldni Oct 24, 2024
93d3b6f
new configs
soldni Oct 24, 2024
42be092
deberta quality
soldni Oct 24, 2024
476d6d0
instructions
soldni Oct 24, 2024
ac0b839
more configs
soldni Oct 24, 2024
4d49ac4
few more
soldni Oct 24, 2024
4ec4b32
closing files
soldni Oct 24, 2024
7fc0db2
more nodes
soldni Oct 24, 2024
6872c31
added graceful skip
soldni Oct 24, 2024
4f0ab10
better closing
soldni Oct 24, 2024
a4de365
silly error
soldni Oct 24, 2024
7f3b2d7
defaultdict
soldni Oct 24, 2024
e14a36b
better stack
soldni Oct 24, 2024
8f22428
imports
soldni Oct 24, 2024
cd051ae
not returning item
soldni Oct 24, 2024
08a44f2
not returning item
soldni Oct 24, 2024
d89e24e
simplified writer
soldni Oct 24, 2024
eb8fc40
think i fixed it?
soldni Oct 24, 2024
e22cea5
fixed writing bugs
soldni Oct 24, 2024
10ae4f1
pipeline fix
soldni Oct 24, 2024
bb335e5
increased prefetch factor
soldni Oct 24, 2024
9428a75
increased workers
soldni Oct 24, 2024
21fcc20
increased workers
soldni Oct 24, 2024
b50456a
sources
soldni Oct 25, 2024
38cadda
ugly but fast
soldni Oct 25, 2024
c271273
style
soldni Oct 26, 2024
5d85400
fixes
soldni Oct 26, 2024
a80609f
smalfix
soldni Oct 26, 2024
9135994
full
soldni Oct 26, 2024
e1d2088
forgot to change paths
soldni Oct 27, 2024
973621f
clis
soldni Oct 29, 2024
39ca09a
configs
soldni Oct 29, 2024
e897c55
reduce ram
soldni Oct 29, 2024
b5271a6
tokens
soldni Oct 29, 2024
977377f
mixing
soldni Oct 29, 2024
d721002
config
soldni Oct 29, 2024
2d92936
accidentally deleted
soldni Oct 29, 2024
46e65a6
minor fixes
soldni Oct 29, 2024
2bfda88
sorting apis
soldni Oct 29, 2024
882d95f
wip
soldni Nov 1, 2024
3350209
importing script to verify tokenized data
soldni Nov 7, 2024
42f87f5
tokenization
soldni Nov 12, 2024
e836908
style
soldni Nov 14, 2024
41c753a
url tagger
soldni Nov 14, 2024
2feadc4
mmlu web
soldni Nov 17, 2024
d7ae39f
fortmatting
soldni Nov 18, 2024
70e87ec
adding new configs
soldni Nov 18, 2024
d22e973
more confs
soldni Nov 18, 2024
97998b5
.
soldni Nov 18, 2024
0b07b45
wips
soldni Dec 3, 2024
5b448f5
new tokenizer experiments
soldni Dec 4, 2024
25fe2a3
old tokenizer dolmino
soldni Dec 13, 2024
48424c1
Merge branch 'soldni/classify-search-sources' of https://github.com/a…
soldni Dec 13, 2024
6050bd9
dtype
soldni Dec 13, 2024
4cebe78
sampling
soldni Dec 13, 2024
c944b6a
news
soldni Dec 24, 2024
2b6b5f7
mix
soldni Dec 30, 2024
f5f1224
small tweaks
soldni Dec 30, 2024
3705a32
new langid
soldni Dec 30, 2024
4d839de
typo
soldni Dec 30, 2024
dadeabc
style
soldni Dec 30, 2024
5d56975
skipping
soldni Jan 1, 2025
df9fca7
skipping
soldni Jan 1, 2025
9d38f10
flags
soldni Jan 1, 2025
adca577
up
soldni Jan 2, 2025
79747b1
mixing langs
soldni Jan 2, 2025
fe1b554
configs
soldni Jan 2, 2025
59acbf7
Merge branch 'main' into soldni/classify-search-sources
soldni Jan 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .devcontainer/postInstall.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

PATH=/home/vscode/.cargo/bin:$PATH
cd dolma
source /home/vscode/miniforge3/bin/activate && pip install cmake "maturin[patchelf]>=1.1,<2.0"
source /home/vscode/miniforge3/bin/activate && pip install cmake "maturin>=1.5,<2.0"
1 change: 1 addition & 0 deletions .github/workflows/CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ permissions:
env:
DOLMA_TESTS_SKIP_AWS: ${{ secrets.AWS_ACCESS_KEY_ID == '' && 'true' || 'false' }}
DOLMA_TEST_S3_PREFIX: s3://dolma-tests
DOLMA_TEST_SKIP_LARGE_MODELS: "true"
RUST_CHANNEL: stable

jobs:
Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

30 changes: 30 additions & 0 deletions classifiers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Dolma Classifiers


## Getting Started

From root directory, install the package:

```bash
pip install -e classifiers
```

## Examples

Run [Huggingface FineWeb classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) on S3 data:

```bash
python -m dolma_classifiers.inference \
-s 's3://ai2-llm/pretraining-data/sources/dclm/v0/documents/40b-split/20b-01/*zstd' \
-m HuggingFaceFW/fineweb-edu-classifier
```

Run [NVIDIA's Deberta quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) on S3 data with model compilation:

```bash
python -m dolma_classifiers.inference \
-s 's3://ai2-llm/pretraining-data/sources/dclm/v0/documents/40b-split/*/*zstd' \
-m nvidia/quality-classifier-deberta \
--model-compile \
--max-length 1024
```
107 changes: 107 additions & 0 deletions classifiers/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
[project]
name = "dolma-classifiers"
version = "0.1.0"
description = "Toolkit for easy classification of data in Dolma format."
authors = [
{name = "Luca Soldaini", email = "[email protected]" }
]
license = {text = "Apache-2.0"}
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"msgspec",
"fsspec[s3]",
"smart_open[s3]>=7.0.4",
"tqdm",
"torch",
"transformers",
"wandb",
"jq"
]

[project.urls]
"Homepage" = "https://github.com/allenai/dolma"
"Repository" = "https://github.com/allenai/dolma"
"Bug Tracker" = "https://github.com/allenai/dolma/issues"


[tool.setuptools.packages.find]
where = ["src"]

[tool.setuptools.package-data]
dolma_classifiers = ["py.typed", "*.pyi"]


[build-system]
build-backend = "setuptools.build_meta"
requires = [
"setuptools >= 61.0.0",
"wheel"
]

[project.optional-dependencies]
dev = [
"black>=22.6.0",
"isort>=5.10.1",
"mypy>=0.971",
"pytest>=5.2",
"ipython>=8.4.0",
"autopep8>=1.7.0",
"flake8>=5.0",
"ipdb>=0.13.0",
"flake8-pyi>=22.8.1",
"Flake8-pyproject>=1.1.0",
"pytest-asyncio>=0.15.1",
"pytest-cov>=2.12.1",
"aioresponses>=0.7.2",
]

[tool.black]
line-length = 115
include = '\.pyi?$'
exclude = '''
(
__pycache__
| \.git
| \.mypy_cache
| \.pytest_cache
| \.vscode
| \.venv
| \bdist\b
| \bdoc\b
)
'''

[tool.isort]
profile = "black"
line_length = 115
multi_line_output = 3

[tool.autopep8]
max_line_length = 115
in-place = true
recursive = true
aggressive = 3

[tool.mypy]
python_version = "3.10"
ignore_missing_imports = true
no_site_packages = true
allow_redefinition = false
warn_unused_configs = true
warn_unused_ignores = true
warn_no_return = true
warn_return_any = false
warn_unreachable = true
show_error_codes = true
pretty = true

[tool.mypy-tests]
strict_optional = false

[tool.flake8]
per-file-ignores = [
'__init__.py:F401',
'*.pyi:E302,E305',
'*.py:E203'
]
45 changes: 45 additions & 0 deletions classifiers/scripts/fineweb_100b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#! /bin/bash

DOCUMENTS='s3://ai2-llm/pretraining-data/sources/dclm/v0/documents/100*/*.jsonl.zstd'

NUM_NODES=2
MODEL_NAME="HuggingFaceFW/fineweb-edu-classifier"
CLUSTER="ai2/jupiter*"
BATCH_SIZE=1024
PRIORITY="high"

# Generate a hash for the run name by combining model name and documents
RUN_HASH=$(echo -n "${MODEL_NAME}${DOCUMENTS}" | md5sum | awk '{print $1}')
RUN_NAME="fineweb_classifier_${RUN_HASH:0:8}"

# Set the run name as an environment variable
export BEAKER_EXPERIMENT_NAME="${RUN_NAME}"


gantry run \
--task-name "${RUN_NAME}" \
--description "Score ${DOCUMENTS} with ${MODEL_NAME}" \
--allow-dirty \
--workspace ai2/davidw-oe-annealing \
--beaker-image 'petew/olmo-torch23-gantry' \
--timeout -1 \
--show-logs \
--host-networking \
--venv 'base' \
--priority "${PRIORITY}" \
--leader-selection \
--gpus 8 \
--replicas ${NUM_NODES} \
--preemptible \
--cluster "${CLUSTER}" \
--budget ai2/oe-data \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env BEAKER_USER_ID=$(beaker account whoami --format json | jq '.[0].name' -cr) \
--env-secret AWS_ACCESS_KEY_ID=lucas-AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=lucas-AWS_SECRET_ACCESS_KEY \
--env-secret WANDB_API_KEY=lucas-WANDB_API_KEY \
--shared-memory 10GiB \
--install "pip install -e classifiers/" \
--yes \
-- /bin/bash -c "huggingface-cli download ${MODEL_NAME} && torchrun --nnodes "${NUM_NODES}:${NUM_NODES}" --nproc-per-node 8 --rdzv_id 12347 --rdzv_backend static --rdzv_endpoint "\${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" --node_rank "\${BEAKER_REPLICA_RANK}" --rdzv_conf 'read_timeout=3600' -m dolma_classifiers.inference --source-prefix ${DOCUMENTS} --batch-size ${BATCH_SIZE} --use-wandb --wandb-project 'dolma-classifiers' --wandb-entity ai2-llm --model-name ${MODEL_NAME} --num-workers 8 --prefetch-factor 8"
45 changes: 45 additions & 0 deletions classifiers/scripts/fineweb_40b.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#! /bin/bash

DOCUMENTS='s3://ai2-llm/pretraining-data/sources/dclm/v0/documents/40b-split/*/*zstd'
NUM_NODES=1
BATCH_SIZE=1024
CLUSTER="ai2/neptune*"
PRIORITY="high"
MODEL_NAME="HuggingFaceFW/fineweb-edu-classifier"


# Generate a hash for the run name by combining model name and documents
RUN_HASH=$(echo -n "${MODEL_NAME}${DOCUMENTS}" | md5sum | awk '{print $1}')
RUN_NAME="fineweb_classifier_${RUN_HASH:0:8}"

# Set the run name as an environment variable
export BEAKER_EXPERIMENT_NAME="${RUN_NAME}"


gantry run \
--task-name "${RUN_NAME}" \
--description "Score ${DOCUMENTS} with ${MODEL_NAME}" \
--allow-dirty \
--workspace ai2/davidw-oe-annealing \
--beaker-image 'petew/olmo-torch23-gantry' \
--timeout -1 \
--show-logs \
--host-networking \
--venv 'base' \
--priority "${PRIORITY}" \
--leader-selection \
--gpus 8 \
--replicas ${NUM_NODES} \
--preemptible \
--cluster "${CLUSTER}" \
--budget ai2/oe-data \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env BEAKER_USER_ID=$(beaker account whoami --format json | jq '.[0].name' -cr) \
--env-secret AWS_ACCESS_KEY_ID=lucas-AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=lucas-AWS_SECRET_ACCESS_KEY \
--env-secret WANDB_API_KEY=lucas-WANDB_API_KEY \
--shared-memory 10GiB \
--install "pip install -e classifiers/" \
--yes \
-- /bin/bash -c "huggingface-cli download ${MODEL_NAME} && torchrun --nnodes "${NUM_NODES}:${NUM_NODES}" --nproc-per-node 8 --rdzv_id 12347 --rdzv_backend static --rdzv_endpoint "\${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" --node_rank "\${BEAKER_REPLICA_RANK}" --rdzv_conf 'read_timeout=3600' -m dolma_classifiers.inference --source-prefix ${DOCUMENTS} --batch-size ${BATCH_SIZE} --use-wandb --wandb-project 'dolma-classifiers' --wandb-entity ai2-llm --model-name ${MODEL_NAME} --num-workers 4 --prefetch-factor 8"
45 changes: 45 additions & 0 deletions classifiers/scripts/fineweb_50b_extra.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#! /bin/bash

DOCUMENTS='s3://ai2-llm/pretraining-data/sources/dclm/v0/documents/20240909-50b/*zstd'
NUM_NODES=1
MODEL_NAME="HuggingFaceFW/fineweb-edu-classifier"
CLUSTER="ai2/jupiter*"
BATCH_SIZE=1024
PRIORITY="high"


# Generate a hash for the run name by combining model name and documents
RUN_HASH=$(echo -n "${MODEL_NAME}${DOCUMENTS}" | md5sum | awk '{print $1}')
RUN_NAME="fineweb_classifier_${RUN_HASH:0:8}"

# Set the run name as an environment variable
export BEAKER_EXPERIMENT_NAME="${RUN_NAME}"


gantry run \
--task-name "${RUN_NAME}" \
--description "Score ${DOCUMENTS} with ${MODEL_NAME}" \
--allow-dirty \
--workspace ai2/davidw-oe-annealing \
--beaker-image 'petew/olmo-torch23-gantry' \
--timeout -1 \
--show-logs \
--host-networking \
--venv 'base' \
--priority "${PRIORITY}" \
--leader-selection \
--gpus 8 \
--replicas ${NUM_NODES} \
--preemptible \
--cluster "${CLUSTER}" \
--budget ai2/oe-data \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env BEAKER_USER_ID=$(beaker account whoami --format json | jq '.[0].name' -cr) \
--env-secret AWS_ACCESS_KEY_ID=lucas-AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=lucas-AWS_SECRET_ACCESS_KEY \
--env-secret WANDB_API_KEY=lucas-WANDB_API_KEY \
--shared-memory 10GiB \
--install "pip install -e classifiers/" \
--yes \
-- /bin/bash -c "huggingface-cli download ${MODEL_NAME} && torchrun --nnodes "${NUM_NODES}:${NUM_NODES}" --nproc-per-node 8 --rdzv_id 12347 --rdzv_backend static --rdzv_endpoint "\${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" --node_rank "\${BEAKER_REPLICA_RANK}" --rdzv_conf 'read_timeout=3600' -m dolma_classifiers.inference --source-prefix ${DOCUMENTS} --batch-size ${BATCH_SIZE} --use-wandb --wandb-project 'dolma-classifiers' --wandb-entity ai2-llm --model-name ${MODEL_NAME} --num-workers 8 --prefetch-factor 8"
45 changes: 45 additions & 0 deletions classifiers/scripts/fineweb_automath_arxiv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#! /bin/bash

DOCUMENTS='s3://ai2-llm/pretraining-data/sources/math-ai_AutoMathText/v0/documents/arxiv/*/*.gz'

NUM_NODES=1
MODEL_NAME="HuggingFaceFW/fineweb-edu-classifier"
CLUSTER="ai2/jupiter*"
BATCH_SIZE=1024
PRIORITY="urgent"

# Generate a hash for the run name by combining model name and documents
RUN_HASH=$(echo -n "${MODEL_NAME}${DOCUMENTS}" | md5sum | awk '{print $1}')
RUN_NAME="fineweb_classifier_${RUN_HASH:0:8}"

# Set the run name as an environment variable
export BEAKER_EXPERIMENT_NAME="${RUN_NAME}"


gantry run \
--task-name "${RUN_NAME}" \
--description "Score ${DOCUMENTS} with ${MODEL_NAME}" \
--allow-dirty \
--workspace ai2/davidw-oe-annealing \
--beaker-image 'petew/olmo-torch23-gantry' \
--timeout -1 \
--show-logs \
--host-networking \
--venv 'base' \
--priority "${PRIORITY}" \
--leader-selection \
--gpus 8 \
--replicas ${NUM_NODES} \
--preemptible \
--cluster "${CLUSTER}" \
--budget ai2/oe-data \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env BEAKER_USER_ID=$(beaker account whoami --format json | jq '.[0].name' -cr) \
--env-secret AWS_ACCESS_KEY_ID=lucas-AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=lucas-AWS_SECRET_ACCESS_KEY \
--env-secret WANDB_API_KEY=lucas-WANDB_API_KEY \
--shared-memory 10GiB \
--install "pip install -e classifiers/" \
--yes \
-- /bin/bash -c "huggingface-cli download ${MODEL_NAME} && torchrun --nnodes "${NUM_NODES}:${NUM_NODES}" --nproc-per-node 8 --rdzv_id 12347 --rdzv_backend static --rdzv_endpoint "\${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" --node_rank "\${BEAKER_REPLICA_RANK}" --rdzv_conf 'read_timeout=3600' -m dolma_classifiers.inference --source-prefix ${DOCUMENTS} --batch-size ${BATCH_SIZE} --use-wandb --wandb-project 'dolma-classifiers' --wandb-entity ai2-llm --model-name ${MODEL_NAME} --num-workers 8 --prefetch-factor 8"
45 changes: 45 additions & 0 deletions classifiers/scripts/fineweb_automath_code.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#! /bin/bash

DOCUMENTS='s3://ai2-llm/pretraining-data/sources/math-ai_AutoMathText/v0/documents/code/*/*.gz'

NUM_NODES=1
MODEL_NAME="HuggingFaceFW/fineweb-edu-classifier"
CLUSTER="ai2/jupiter*"
BATCH_SIZE=1024
PRIORITY="urgent"

# Generate a hash for the run name by combining model name and documents
RUN_HASH=$(echo -n "${MODEL_NAME}${DOCUMENTS}" | md5sum | awk '{print $1}')
RUN_NAME="fineweb_classifier_${RUN_HASH:0:8}"

# Set the run name as an environment variable
export BEAKER_EXPERIMENT_NAME="${RUN_NAME}"


gantry run \
--task-name "${RUN_NAME}" \
--description "Score ${DOCUMENTS} with ${MODEL_NAME}" \
--allow-dirty \
--workspace ai2/davidw-oe-annealing \
--beaker-image 'petew/olmo-torch23-gantry' \
--timeout -1 \
--show-logs \
--host-networking \
--venv 'base' \
--priority "${PRIORITY}" \
--leader-selection \
--gpus 8 \
--replicas ${NUM_NODES} \
--preemptible \
--cluster "${CLUSTER}" \
--budget ai2/oe-data \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env BEAKER_USER_ID=$(beaker account whoami --format json | jq '.[0].name' -cr) \
--env-secret AWS_ACCESS_KEY_ID=lucas-AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=lucas-AWS_SECRET_ACCESS_KEY \
--env-secret WANDB_API_KEY=lucas-WANDB_API_KEY \
--shared-memory 10GiB \
--install "pip install -e classifiers/" \
--yes \
-- /bin/bash -c "huggingface-cli download ${MODEL_NAME} && torchrun --nnodes "${NUM_NODES}:${NUM_NODES}" --nproc-per-node 8 --rdzv_id 12347 --rdzv_backend static --rdzv_endpoint "\${BEAKER_LEADER_REPLICA_HOSTNAME}:29400" --node_rank "\${BEAKER_REPLICA_RANK}" --rdzv_conf 'read_timeout=3600' -m dolma_classifiers.inference --source-prefix ${DOCUMENTS} --batch-size ${BATCH_SIZE} --use-wandb --wandb-project 'dolma-classifiers' --wandb-entity ai2-llm --model-name ${MODEL_NAME} --num-workers 8 --prefetch-factor 8"
Loading
Loading