Question about Encoder Hash n-gram Embeddings #29

MalchuL · 2025-01-19T22:22:28Z

The paper said that the window size is selected from values from 3 to 8. But in the code and in debug.yaml there is simply 4 and the enumeration is based on the primal number.

https://github.com/facebookresearch/blt/blob/main/bytelatent/model/blt.py#L756 and corresponding part 3.2.1 Encoder Hash n-gram Embeddings in paper

Params in debug.yaml:
encoder_hash_byte_group_nb_functions: 3
encoder_hash_byte_group_size: [4]

Am I right according to the paper should be:
encoder_hash_byte_group_nb_functions: 1
encoder_hash_byte_group_size: [3,4,5,6,7,8]

Is this a mistake because you did not share the correct config or does it train better this way?

Vectorrent · 2025-01-20T03:04:32Z

I recently had this question as well. The paper seems clear to me:

"We use 500,000 hashes with a single hash function, with n-gram sizes ranging from 3 to 8, for all BLT models."

Indeed, my initial tests with encoder_hash_byte_group_nb_functions = 3 were terrible. Only after reducing this value to 1 did performance (and speed) improve.

I am curious: how do we calculate those 500k hashes, from the research? Take for example:

encoder_hash_byte_group_nb_functions=3,
encoder_hash_byte_group_size=[3, 4, 5],
encoder_hash_byte_group_vocab=100_000,

Is the correct formula like this?

encoder_hash_byte_group_nb_functions * range(encoder_hash_byte_group_size) * encoder_hash_byte_group_vocab = 600_000

What does the encoder_hash_byte_group_size part do? Is this how we define the allowed range of patch sizes?

EntilZha · 2025-01-22T01:35:25Z

The debug config doesn't match the paper since its just a debug config. We're still working on making the OSS code more complete and able to reproduce training, and this includes providing configs that correspond to the paper, but its a WIP (e.g., I'm currently working on a script for entropy model training). For now, I'd say that the paper is definitive.

That said, you are correct in the settings for encoder_hash_byte_group_nb_functions=1 and encoder_hash_byte_group_size=[3,4,5,6,7,8]. We also set encoder_hash_byte_group_vocab to 500,002, to makeup for the offset values.

@Vectorrent as for the vocab sizes, we did hyper parameter sweeps for both hash ngrams and lookup ngrams (each alone and in combination with each other), sweeping from 100K to 500K (much above this and you start OOMing, certainly by 1M). I believe that we sweeped every 50K or 100K in values. I'm not exactly certain what you're asking, but I think you're asking how many total embeddings there are? If so, then yes, it is number of hash functions * number of group sizes * the vocab size. E.G., for your example, there should be 33100K embedding entries.

The original idea behind using multiple hash functions is to better handle cases where two patches hash to the same value. Although that is possible, it would be super unlikely for two patches to hash to the same value twice (or more) using two different hash functions. It seems like that didn't turn out to help hence it being set to 1. My hypothesis on this is that since say a patch of length 3 is encompassed in the patch length 4, even if there is a collision for the hash of one of those patches, its unlikely the length 3 and the length 4 patch collide, so in the end the model is still getting the "correct"/non-collision patch representation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Encoder Hash n-gram Embeddings #29

Question about Encoder Hash n-gram Embeddings #29

MalchuL commented Jan 19, 2025 •

edited

Loading

Vectorrent commented Jan 20, 2025 •

edited

Loading

EntilZha commented Jan 22, 2025

Question about Encoder Hash n-gram Embeddings #29

Question about Encoder Hash n-gram Embeddings #29

Comments

MalchuL commented Jan 19, 2025 • edited Loading

Vectorrent commented Jan 20, 2025 • edited Loading

EntilZha commented Jan 22, 2025

MalchuL commented Jan 19, 2025 •

edited

Loading

Vectorrent commented Jan 20, 2025 •

edited

Loading