Support for a mask during autoregressive generation with Key-Value Caching #292

Oufattole · 2024-11-12T19:54:29Z

Why isn't a mask supported when key-value caching is enabled here?

lucidrains · 2024-11-12T19:59:34Z

@Oufattole in what scenario would you need masking when doing autoregressive decoding?

Oufattole · 2024-11-12T20:12:01Z

I'm trying to do sliding window inference, but the lengths of the initial prompts are different in my transformer, so I think I should mask out the padding as that's what we do during autoregressive pretraining.

I'm applying transformers to medical trajectories as a part of this open source project providing ML tooling for modeling patient time-series data (where you tokenize a patient's irregularly sampled time series observations, such as medications, diagnoses, procedures, etc.). I'm interested in generating future trajectories and evaluating them. Here is the relevant code I am currently using for generating trajectories. I currently am just not caching key value pairs, so that I can apply masks, but that is prohibitively slow.

lucidrains · 2024-11-12T20:33:52Z

@Oufattole yes I see, so you are off the beaten path

sliding windows isn't supported here yet

lucidrains · 2024-11-12T20:35:10Z

@Oufattole you can do away with masking by slicing the cached key values before passing it back in

Oufattole · 2024-11-12T21:00:41Z

Ahhh I see thank you, I'll try that! With medical data, unlike in NLP and CV, many patient trajectories are very small and you don't need a long sequence length at all. For example, with my dataset 80% of patients are below the 512 max sequence length, but a small subset of patients are punching over 30k (this is after extreme reductions in the vocabulary -- i.e. which time-series variables we model, prior to which some of these patients hit over 300k).

I naively am trying to use sliding windows, but if there is a better approach you recommend for handling such extreme sequence length variations, I would be happy to try it.

Oufattole · 2024-11-13T03:31:43Z

Wait, actually, I think you do support masking the left padded tokens with the seq_start_pos arg here @lucidrains .

lucidrains · 2024-11-13T14:02:11Z

@Oufattole so that hyperparameter was actually built for variable prompt lengths iirc. i'll have to take a closer look to really know if it can be repurposed for what you are doing

during sliding window, you'll have to slice the cached key values as you decode out of the window length

lucidrains · 2024-11-13T14:58:23Z

@Oufattole what specialty is this and what exactly are you trending in the EMR that hits 300k in length?

Oufattole · 2024-11-13T16:53:08Z

Yes, I think you already do this kv-cache slicing during generation here when restricting to the max_seq_length (i.e. in the sliding window setting). Am I correct about this?

I'll send you an email in regard to the broader EHR modeling question, which I realize may be out of scope for this github issue.

lucidrains · 2024-11-13T23:28:17Z

@Oufattole it has been a while, let me review it tomorrow morning and see if it can be made to work for your issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for a mask during autoregressive generation with Key-Value Caching #292

Support for a mask during autoregressive generation with Key-Value Caching #292

Oufattole commented Nov 12, 2024

lucidrains commented Nov 12, 2024

Oufattole commented Nov 12, 2024 •

edited

Loading

lucidrains commented Nov 12, 2024 •

edited

Loading

lucidrains commented Nov 12, 2024

Oufattole commented Nov 12, 2024

Oufattole commented Nov 13, 2024

lucidrains commented Nov 13, 2024 •

edited

Loading

lucidrains commented Nov 13, 2024

Oufattole commented Nov 13, 2024 •

edited

Loading

lucidrains commented Nov 13, 2024

Support for a mask during autoregressive generation with Key-Value Caching #292

Support for a mask during autoregressive generation with Key-Value Caching #292

Comments

Oufattole commented Nov 12, 2024

lucidrains commented Nov 12, 2024

Oufattole commented Nov 12, 2024 • edited Loading

lucidrains commented Nov 12, 2024 • edited Loading

lucidrains commented Nov 12, 2024

Oufattole commented Nov 12, 2024

Oufattole commented Nov 13, 2024

lucidrains commented Nov 13, 2024 • edited Loading

lucidrains commented Nov 13, 2024

Oufattole commented Nov 13, 2024 • edited Loading

lucidrains commented Nov 13, 2024

Oufattole commented Nov 12, 2024 •

edited

Loading

lucidrains commented Nov 12, 2024 •

edited

Loading

lucidrains commented Nov 13, 2024 •

edited

Loading

Oufattole commented Nov 13, 2024 •

edited

Loading