Issues: apartresearch/Interpreting-Learned-Feedback-Patterns

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7 Open 0 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

Enhance: Support training on different of activations between base and reward models.

#15 opened Apr 7, 2024 by amirabdullah19852020

More data: Create two datasets optimized for Vader, and upload to datasets data

#14 opened Apr 7, 2024 by amirabdullah19852020

More reward models: Train a DPO model and autoencoders for unalignment dataset. model

#13 opened Apr 7, 2024 by amirabdullah19852020

More reward models: Create contrastive pairs dataset for unalignment dataset. data

#12 opened Apr 7, 2024 by amirabdullah19852020

More reward models: Create a contrastive pairs dataset for helpful / harmless data

#11 opened Apr 7, 2024 by amirabdullah19852020

More reward models: Train a DPO model and autoencoders for hh-hrlf model

#10 opened Apr 7, 2024 by amirabdullah19852020

Enhance: Integrate with SAELens for the sparse autoencoder training. enhancement

New feature or request

#9 opened Apr 7, 2024 by amirabdullah19852020

ProTip! Exclude everything labeled bug with -label:bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly