Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Encoder Hash n-gram Embeddings #29

Open
MalchuL opened this issue Jan 19, 2025 · 2 comments
Open

Question about Encoder Hash n-gram Embeddings #29

MalchuL opened this issue Jan 19, 2025 · 2 comments

Comments

@MalchuL
Copy link

MalchuL commented Jan 19, 2025

The paper said that the window size is selected from values ​​from 3 to 8. But in the code and in debug.yaml there is simply 4 and the enumeration is based on the primal number.

https://github.com/facebookresearch/blt/blob/main/bytelatent/model/blt.py#L756 and corresponding part 3.2.1 Encoder Hash n-gram Embeddings in paper

Params in debug.yaml:
encoder_hash_byte_group_nb_functions: 3
encoder_hash_byte_group_size: [4]

Am I right according to the paper should be:
encoder_hash_byte_group_nb_functions: 1
encoder_hash_byte_group_size: [3,4,5,6,7,8]

Is this a mistake because you did not share the correct config or does it train better this way?

@Vectorrent
Copy link
Contributor

Vectorrent commented Jan 20, 2025

I recently had this question as well. The paper seems clear to me:

"We use 500,000 hashes with a single hash function, with n-gram sizes ranging from 3 to 8, for all BLT models."

Indeed, my initial tests with encoder_hash_byte_group_nb_functions = 3 were terrible. Only after reducing this value to 1 did performance (and speed) improve.

I am curious: how do we calculate those 500k hashes, from the research? Take for example:

encoder_hash_byte_group_nb_functions=3,
encoder_hash_byte_group_size=[3, 4, 5],
encoder_hash_byte_group_vocab=100_000,

Is the correct formula like this?

encoder_hash_byte_group_nb_functions * range(encoder_hash_byte_group_size) * encoder_hash_byte_group_vocab = 600_000

What does the encoder_hash_byte_group_size part do? Is this how we define the allowed range of patch sizes?

@EntilZha
Copy link
Contributor

The debug config doesn't match the paper since its just a debug config. We're still working on making the OSS code more complete and able to reproduce training, and this includes providing configs that correspond to the paper, but its a WIP (e.g., I'm currently working on a script for entropy model training). For now, I'd say that the paper is definitive.

That said, you are correct in the settings for encoder_hash_byte_group_nb_functions=1 and encoder_hash_byte_group_size=[3,4,5,6,7,8]. We also set encoder_hash_byte_group_vocab to 500,002, to makeup for the offset values.

@Vectorrent as for the vocab sizes, we did hyper parameter sweeps for both hash ngrams and lookup ngrams (each alone and in combination with each other), sweeping from 100K to 500K (much above this and you start OOMing, certainly by 1M). I believe that we sweeped every 50K or 100K in values. I'm not exactly certain what you're asking, but I think you're asking how many total embeddings there are? If so, then yes, it is number of hash functions * number of group sizes * the vocab size. E.G., for your example, there should be 33100K embedding entries.

The original idea behind using multiple hash functions is to better handle cases where two patches hash to the same value. Although that is possible, it would be super unlikely for two patches to hash to the same value twice (or more) using two different hash functions. It seems like that didn't turn out to help hence it being set to 1. My hypothesis on this is that since say a patch of length 3 is encompassed in the patch length 4, even if there is a collision for the hash of one of those patches, its unlikely the length 3 and the length 4 patch collide, so in the end the model is still getting the "correct"/non-collision patch representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants