Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for jinja based template rendering of the dataset #438

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Abhishek-TAMU
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU commented Jan 15, 2025

Description of the change

Added a handler apply_custom_data_formatting_jinja_template which does jinja based template rendering of the dataset.

Handling of edge case:
Example template:"### Input: {{Tweet text}} \n\n ### Response: {{text_label}}"
Jinja2 by default, does not support placeholders variable names with spaces (e.g., {{Tweet text}}), which will raise an error.

Hence additional preprocessing check (function: transform_placeholders) has been done. This checks if there is space between the placeholder variable and then process it accordingly (by modifying variable by {{element["Tweet text"]}}.

Related issue number

Issue: https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1470

How to verify the PR

Verify added test cases.

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Jan 15, 2025
@Abhishek-TAMU Abhishek-TAMU marked this pull request as ready for review January 21, 2025 14:33
except Exception as e:
raise KeyError(f"Dataset does not contain field in template. {e}") from e

rendered_text += tokenizer.eos_token
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashokponkumar Wanted to just confirm the removal of eos_token from dataset samples in this handler. In other handlers we add eos_token and don't expect users to add it. Hence in this handler where user passes Jinja template are we expecting user to pass eos_token too? I guess in case of non-pretokenized dataset not using eos_token when using DataCollatorForCompletionOnlyLM might affect F1 score on tuned models ?

2- @dushyantbehl Can I ask how Jinja templating could be used with pre-tokenized dataset (Having input_ids and labels as columns) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we need a proper documentation for now and a patch where we let users choose if they want an eos_token with the data handlers or not via one argument e.g. add a kwarg to the data handlers like add_eos_token this way we can let them choose what they want inside a data config.
    for a data config we should not assume things like what should we do while users want to do.
    for our data args we can have this added inside our code at the last data handler whatever we choose so that our data args usecases remain same.

if you feel can you take this up with this patch? to add the kwarg for eos_token to clean up the interface with users? else we can park this to a next patch.

  1. For pre tokenised datasets we can ignore the jinja template imo this should be applied only to non tokeniser data sets .
    We can add all these things to documentation and I request you to please add documentation with this patch.

return {dataset_text_field: rendered_text}


def transform_placeholders(template: str) -> str:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl @ashokponkumar Are we handling nested dataset use case also, as I see every other handler expects dataset element Dict[str, str] and not Dict[str, Dict] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were only handling non nested datasets apart from chat templates...can we test things out with this patch if our code works for nested datasets then can we have a change of the argument type here?

Copy link
Contributor

@dushyantbehl dushyantbehl Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if you can move to utils as we discussed in our last call. Thanks.

template = "### Input: {{not found}} \n\n ### Response: {{text_label}}"
formatted_dataset_field = "formatted_data_field"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
with pytest.raises((KeyError, TemplateSyntaxError)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we catch this error inside our code and give users a simple text error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants