Skip to content

Latest commit

 

History

History
178 lines (161 loc) · 27.4 KB

allenai_OLMo_2_1124_7B.md

File metadata and controls

178 lines (161 loc) · 27.4 KB

Report for allenai/OLMo-2-1124-7B

Model info

  • Model Info:
    • Tied embeddings: False
    • LM head uses bias: False
    • Embeddings shape: [100352, 4096]
  • Tokenizer Info:
    • Vocab Size: 100278
    • Tokenizer Class: GPT2Tokenizer
    • Tokenizer Type: BPE
    • Bytes handling: Byte Input
    • Token for verification prompt building: abcdefghijklmnopqrstuvwxyz
    • Token id for verification prompt building: 68612
  • Indicator summary:
    • Indicator for under-trained tokens: E_{out} Cosine Distance
    • Overall distribution: 0.350 +/- 0.079
  • Detected Token Counts:
    • Number of tested under-trained tokens: 1992, 1973 non-special, 179 below p = 0.01 threshold, 82 below soft indicator threshold
    • Number of single byte tokens: 256, of which 13 below indicator threshold
    • Number of special tokens: 0, of which 0 below indicator threshold
    • Number of non-single-byte UTF-fragment tokens: 645, of which 3 below soft indicator threshold

Under-trained token indicators plot

Indicators scatter plots

Verification plot

Verification plot

Under-trained token verification results

82 entries below threshold of 0.010

token_id token indicator max_prob in_other_tokens
89472 useRalative -2.38419e-07 1.5e-09 useRalativeImagePath
89471 useRal -1.19209e-07 1.4e-11 useRalativeImagePath, useRalative
100262 |||EMAIL_ADDRESS||| -1.19209e-07 1.5e-10
33786 webElementProperties -1.19209e-07 7.3e-11
57779 \tRTLU 0 2e-10
85069 PostalCodesNL 0 8.4e-11 $PostalCodesNL
47072 webElementX 0 8.4e-11 webElementXpaths
85071 $PostalCodesNL 0 2.8e-11
95812 \tRTCK 0 7.5e-11
41550 \tRTHOOK 0 1.3e-10
80370 ▁ForCanBeConvertedToF 0 8.8e-11 ▁ForCanBeConvertedToForeach
80369 ▁ForCanBeConverted 1.19209e-07 5.2e-11 ▁ForCanBeConvertedToF, ▁ForCanBeConvertedToForeach
47073 webElementXpaths 1.19209e-07 1.9e-09
58508 :-------------</ 1.19209e-07 1.9e-09
100261 |||PHONE_NUMBER||| 1.19209e-07 1.4e-10
83315 richTextPanel 1.19209e-07 5.2e-11
95073 -vesm 1.19209e-07 7.3e-11
80154 \tRTLI 1.19209e-07 1.2e-10
73018 ▁StreamLazy 1.19209e-07 4.1e-10
79883 \tTokenNameIdentifier 1.19209e-07 1.7e-10
62 additional entries below threshold
token_id token indicator max_prob in_other_tokens
70784 Japgolly 1.19209e-07 1e-10 ▁typingsJapgolly
89475 elementGuidId 1.19209e-07 1.8e-10
98100 (stypy 1.19209e-07 1.9e-09
89473 useRalativeImagePath 1.78814e-07 1.7e-11
100263 |||IP_ADDRESS||| 1.78814e-07 1.9e-11
50325 adaptiveStyles 1.78814e-07 1.2e-10
67901 \tRTDBG 1.78814e-07 6.2e-11
52362 SpecWarn 2.98023e-07 8.7e-11
96656 methodPointerType 7.15256e-07 2.7e-09
99202 (statearr 8.9407e-07 3.4e-09
56930 \tRTLR 1.16229e-05 4.6e-11
81259 artisanlib 1.18017e-05 4.9e-11
91198 externalActionCode 1.9908e-05 8.9e-08
82929 CppMethodIntialized 2.54512e-05 7.6e-05
93905 ▁QtAws 2.65837e-05 1.1e-11
84576 ▁AppMethodBeat 3.3319e-05 7.8e-11
76371 LANGADM 5.98431e-05 5e-10
72740 ▁typingsJapgolly 8.30889e-05 1.3e-10
31960 quotelev 0.000137806 3e-06
90050 _ComCallableWrapper 0.00014472 2.8e-09
88023 /ayushman 0.000174642 8.3e-08
80612 MethodBeat 0.000183165 7.6e-11 ▁AppMethodBeat
71337 +lsi 0.000186622 4.1e-10
98668 );\r\r\r\n 0.000294089 6.8e-05
57361 _REALTYPE 0.00043869 2.1e-05
68896 ;\r\r\r\n 0.000684261 0.00014 );\r\r\r\n
97736 \tRTCT 0.000716388 7.8e-07
90412 selectorMethod 0.000768423 1.4e-10
56225 .sulake 0.000790775 2e-05
91817 (InitializedTypeInfo 0.000829816 9.5e-06
58944 /Subthreshold 0.000984609 7.3e-05
89496 _FieldOffsetTable 0.00121212 0.00021
73016 ▁EnumerableStream 0.00126624 0.00011
96737 departureday 0.00172448 0.0002
67750 _typeDefinitionSize 0.00231582 0.0023
73228 _InternalArray 0.00237793 0.0008
26009 methodVisitor 0.00238055 0.00031
88039 ♀♀♀♀ 0.0024671 0.0002
37370 \tEIF 0.00255948 0.00072
87551 CppGuid 0.00259966 0.00055
70316 erusform 0.00260186 0.00049 numerusform
67444 CppTypeDefinitionSizes 0.00339979 0.0026
39866 .xrLabel 0.00416869 0.0045
71390 ▁PodsDummy 0.00445569 2.5e-05
59839 ConstraintMaker 0.00497901 0.0039 MASConstraintMaker
67705 _typeDefinition 0.00510728 0.0012 _typeDefinitionSize
34956 ▁+#+#+#+ 0.00535917 5e-05 ▁+#+#+#+#+#+
87941 $fdata 0.00576878 6.7e-05
67727 |()\n 0.00612545 0.00015
66235 CppTypeDefinition 0.00619704 0.0023 CppTypeDefinitionSizes
84993 rPid 0.00621617 0.0016
85154 buttonShape 0.00623816 0.0084
24452 <lemma 0.00646198 0.0018
45146 %timeout 0.00674826 0.00023
75520 ▁NUITKA 0.00730926 0.0022
75630 雅黑 0.00752032 0.0016 微软雅黑, 软雅黑
76613 extracomment 0.00804365 0.022
43944 orThunk 0.00812399 0.0019 _AdjustorThunk
71227 ▁FINSEQ 0.00825447 0.002
81325 .bindingNavigatorMove 0.00914651 0.16
62761 .layoutControl 0.00955373 0.031
55557 ((&___ 0.00971556 0.0028

Tokens with partial UTF-8 sequences

3 entries below threshold of 0.010

token_id token indicator in_other_tokens
36225 <0xB7><0xBB>加 -2.38419e-07 添加, ▁添加
28587 <0x8E><0xB7>取 -1.19209e-07 ▁获取, 获取
52188 <0x9D>始化 1.78814e-07 初始化, ▁初始化

Byte tokens

13 entries below threshold of 0.017

token_id token indicator ord hex byte_type
181 <0xF9> -2.38419e-07 249 0xF9 unused_utf8
125 <0xC1> 0 193 0xC1 unused_utf8
183 <0xFB> 0 251 0xFB unused_utf8
180 <0xF8> 0 248 0xF8 unused_utf8
124 <0xC0> 1.19209e-07 192 0xC0 unused_utf8
187 <0xFF> 1.19209e-07 255 0xFF unused_utf8
186 <0xFE> 1.19209e-07 254 0xFE unused_utf8
179 <0xF7> 1.19209e-07 247 0xF7 unused_utf8
177 <0xF5> 1.19209e-07 245 0xF5 unused_utf8
178 <0xF6> 1.19209e-07 246 0xF6 unused_utf8
182 <0xFA> 1.78814e-07 250 0xFA unused_utf8
184 <0xFC> 1.78814e-07 252 0xFC unused_utf8
185 <0xFD> 2.98023e-07 253 0xFD unused_utf8

Special tokens

18 entries below threshold of 0.017

token_id token indicator max_prob
100272 <|extra_id_7|> -1.19209e-07 7.8e-11
100260 <|fim_suffix|> -1.19209e-07 9.8e-11
100275 <|extra_id_10|> -1.19209e-07 2.9e-11
100271 <|extra_id_6|> 0 1.3e-09
100267 <|extra_id_2|> 0 1.7e-10
100266 <|extra_id_1|> 0 3e-11
100277 <|pad|> 0 2.9e-11
100256 <|extra_id_0|> 1.19209e-07 8.4e-11
100276 <|endofprompt|> 1.19209e-07 4.7e-11
100273 <|extra_id_8|> 1.19209e-07 5.3e-11
100274 <|extra_id_9|> 1.19209e-07 3e-11
100258 <|fim_prefix|> 1.19209e-07 1e-10
100259 <|fim_middle|> 1.19209e-07 7.8e-11
100265 <|im_end|> 1.19209e-07 1.1e-10
100268 <|extra_id_3|> 1.19209e-07 2.3e-09
100269 <|extra_id_4|> 1.19209e-07 2.1e-11
100270 <|extra_id_5|> 1.19209e-07 8e-11
100264 <|im_start|> 1.78814e-07 7.4e-10