-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Encoder Hash n-gram Embeddings #29
Comments
I recently had this question as well. The paper seems clear to me: "We use 500,000 hashes with a single hash function, with n-gram sizes ranging from 3 to 8, for all BLT models." Indeed, my initial tests with I am curious: how do we calculate those 500k hashes, from the research? Take for example: encoder_hash_byte_group_nb_functions=3,
encoder_hash_byte_group_size=[3, 4, 5],
encoder_hash_byte_group_vocab=100_000, Is the correct formula like this? encoder_hash_byte_group_nb_functions * range(encoder_hash_byte_group_size) * encoder_hash_byte_group_vocab = 600_000 What does the |
The debug config doesn't match the paper since its just a debug config. We're still working on making the OSS code more complete and able to reproduce training, and this includes providing configs that correspond to the paper, but its a WIP (e.g., I'm currently working on a script for entropy model training). For now, I'd say that the paper is definitive. That said, you are correct in the settings for @Vectorrent as for the vocab sizes, we did hyper parameter sweeps for both hash ngrams and lookup ngrams (each alone and in combination with each other), sweeping from 100K to 500K (much above this and you start OOMing, certainly by 1M). I believe that we sweeped every 50K or 100K in values. I'm not exactly certain what you're asking, but I think you're asking how many total embeddings there are? If so, then yes, it is number of hash functions * number of group sizes * the vocab size. E.G., for your example, there should be 33100K embedding entries. The original idea behind using multiple hash functions is to better handle cases where two patches hash to the same value. Although that is possible, it would be super unlikely for two patches to hash to the same value twice (or more) using two different hash functions. It seems like that didn't turn out to help hence it being set to 1. My hypothesis on this is that since say a patch of length 3 is encompassed in the patch length 4, even if there is a collision for the hash of one of those patches, its unlikely the length 3 and the length 4 patch collide, so in the end the model is still getting the "correct"/non-collision patch representation. |
The paper said that the window size is selected from values from 3 to 8. But in the code and in debug.yaml there is simply 4 and the enumeration is based on the primal number.
https://github.com/facebookresearch/blt/blob/main/bytelatent/model/blt.py#L756 and corresponding part 3.2.1 Encoder Hash n-gram Embeddings in paper
Params in debug.yaml:
encoder_hash_byte_group_nb_functions: 3
encoder_hash_byte_group_size: [4]
Am I right according to the paper should be:
encoder_hash_byte_group_nb_functions: 1
encoder_hash_byte_group_size: [3,4,5,6,7,8]
Is this a mistake because you did not share the correct config or does it train better this way?
The text was updated successfully, but these errors were encountered: