Skip to content

Latest commit

 

History

History
132 lines (111 loc) · 20.5 KB

chuxin_llm_Chuxin_1_6B_Base.md

File metadata and controls

132 lines (111 loc) · 20.5 KB

Report for chuxin-llm/Chuxin-1.6B-Base

Model info

  • Model Info:
    • Tied embeddings: False
    • LM head uses bias: False
    • Embeddings shape: [102400, 2048]
  • Tokenizer Info:
    • Vocab Size: 100015
    • Tokenizer Class: LlamaTokenizerFast
    • Tokenizer Type: BPE
    • Bytes handling: Byte Input
    • Token for verification prompt building: IllegalArgumentException
    • Token id for verification prompt building: 91253
  • Indicator summary:
    • Indicator for under-trained tokens: E_{in} L2 Norm
    • Overall distribution: 0.597 +/- 0.128
  • Detected Token Counts:
    • Number of tested under-trained tokens: 1990, 1983 non-special, 886 below p = 0.01 threshold, 38 below soft indicator threshold
    • Number of single byte tokens: 256, of which 4 below indicator threshold
    • Number of special tokens: 32, of which 20 below indicator threshold
    • Number of non-single-byte unreachable tokens: 32, of which 20 below indicator threshold
    • Number of non-single-byte UTF-fragment tokens: 438, of which 0 below soft indicator threshold

Under-trained token indicators plot

Indicators scatter plots

Verification plot

Verification plot

Under-trained token verification results

38 entries below threshold of 0.121

token_id token indicator max_prob in_other_tokens
87662 日内与新浪看点 0.0218784 3.5e-05 日内与新浪看点联系
87661 不代表新浪看点 0.0236809 2.6e-08 不代表新浪看点观点或立场
97672 基督教基督教基督教 0.0301265 0.0011
91136 controlcap 0.0321056 3.3e-06
74777 orangehilldev 0.0355834 5.6e-05
16238 кедония 0.0391545 3.2e-06 ▁Македония, Македония
90785 посолство 0.04666 0.0019
13009 lemanya 0.0497387 3.3e-05 ▁alemanya, ▁Alemanya, Alemanya
99639 亿亿亿亿亿亿亿亿亿亿亿亿亿亿亿亿 0.0535675 0.099
81096 ▁EDIPU 0.0570441 1.2e-05 ▁EDIPUCRS
84405 RecordedVote 0.0587952 0.0064
59771 基督教基督教 0.0625054 0.00024 基督教基督教基督教
71563 亿亿亿次 0.063932 0.042
60623 odeciclismo 0.0639408 3e-05 ▁sitiodeciclismo, iodeciclismo
50113 memItem 0.0682783 0.11 memItemRight, memItemLeft
5758 ългар 0.0703906 9.3e-05 ▁българ, ▁българите, Българ, ▁българския, България, ...
70532 \xa0veg 0.075737 0.0011 \xa0vegades
97018 Supamiu 0.0764349 0.0016
78552 ▁Междусъюз 0.0769008 0.00012 ▁Междусъюзническата
49112 iberament 0.0797149 5.8e-05 Alliberament, alliberament
18 additional entries below threshold
token_id token indicator max_prob in_other_tokens
49293 ▁lampister 0.0844076 3.7e-05 ▁lampisteria, ▁lampisteries
58888 亿亿亿亿亿亿亿亿 0.0848685 0.027 亿亿亿亿亿亿亿亿亿亿亿亿亿亿亿亿
49918 magatzem 0.0849909 0.00047 emmagatzematge, emmagatzem, ▁emmagatzem
51244 ecesito 0.0872311 0.00013 ▁Necesito, Necesito
86826 солство 0.0923582 0.00054 посолство
9710 ▁espany 0.0926319 0.00025 ▁espanyola, ▁espanyol, ▁espanyols, ▁espanyoles
73129 жентина 0.0972762 0.0018 ▁Аржентина
85684 товче 0.10107 4.8e-05 ▁братовче
52246 ▁опъл 0.106657 0.00019 ▁опълчение, ▁опълчен, ▁опълченец
7388 точници 0.107464 4e-05 Източници, ▁източници
51641 мъния 0.112711 0.001 ▁Румъния
9630 ългария 0.114074 0.0019 България, ▁България
93494 atrals 0.114798 0.0085 ▁teatrals
41580 ▁експе 0.115079 7.1e-05 ▁експери, ▁експеди, ▁експедиция
74713 photonui 0.115498 0.92
90292 битава 0.116288 6.6e-05 Обитава
72767 elrte 0.118752 0.42
24543 wlwifi 0.12032 0.51 ▁iwlwifi, iwlwifi

Tokens with partial UTF-8 sequences

0 entries below threshold of 0.121

Byte tokens

4 entries below threshold of 0.257

token_id token indicator ord hex byte_type
185 \n 0.188525 10 0x0A ascii
11 , 0.233417 44 0x2C ascii
207 0.237499 32 0x20 ascii
13 . 0.252735 46 0x2E ascii

Special tokens

1 entries below threshold of 0.257

token_id token indicator max_prob
100000 ¿<|begin▁of▁sentence|>? 2.22104e-08 1.4e-05

Unreachable tokens

20 entries below threshold of 0.257

token_id token indicator reencoded
95036 ▁ö 2.15243e-08 207: , 100003: <0xF6>
48308 Á 2.15282e-08 100012: <0xC1>
53854 ▁ü 2.16154e-08 207: , 100014: <0xFC>
15302 ▁À 2.17436e-08 207: , 100010: <0xC0>
1612 ú 2.17913e-08 100004: <0xFA>
77883 çõ 2.18156e-08 1337: ç, 100006: <0xF5>
31186 üí 2.19835e-08 100014: <0xFC>, 656: í
60840 ▁þ 2.20105e-08 207: , 100013: <0xFE>
15776 üè 2.21308e-08 100014: <0xFC>, 724: è
27924 ▁Á 2.2146e-08 207: , 100012: <0xC1>
56124 û 2.22393e-08 100008: <0xFB>
25486 ø 2.22396e-08 100002: <0xF8>
45981 õ 2.22878e-08 100006: <0xF5>
7962 ö 2.23045e-08 100003: <0xF6>
6496 ▁ú 2.23787e-08 207: , 100004: <0xFA>
84896 þ 2.24477e-08 100013: <0xFE>
12759 À 2.24954e-08 100010: <0xC0>
52272 ù 2.25231e-08 100011: <0xF9>
5021 ý 2.25858e-08 100009: <0xFD>
2874 ü 2.28658e-08 100014: <0xFC>