Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get reformulator working #7

Open
wants to merge 10 commits into
base: public
Choose a base branch
from

Conversation

Grazfather
Copy link
Contributor

This is my attempt to clean up the readme and code to make sense.

Has anyone gotten this to work? There are a bunch of problem, which I've tried to address, but have a few questions.

If you're willing/able to point me in the right direction, I can fix the code and update the readme.

Main issues:

  • Creating the loggers is broken (fixed)
  • It expects a database, but the schema isn't well defined. Did you start with a db? Maybe a method should be added to create an empty database.
  • It probably should not sync the database when things don't work, that's a good way to trash your database.
  • I think it expects custom note types with specific fields? If this is true, it should be documented.

Copy link
Owner

@thiswillbeyourgithub thiswillbeyourgithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thank you very much for taking the time. This really means a lot to me!

This is my attempt to clean up the readme and code to make sense.

Has anyone gotten this to work? There are a bunch of problem, which I've tried to address, but have a few questions.

As I said I might have broken a few things here and there during the publishing rush. Sorry about that. In general the illustrator should be the most recent code and have the least technical debt I believe.

If you're willing/able to point me in the right direction, I can fix the code and update the readme.

That's really great to hear and the help I was desperately looking for! I can find the time to point you in the good direction for sure but I'm stretched very thin otherwise so rather not code most things myself. You are a godsend :)

I did a review of your PR and left a few comments.

I just noticed that unfortunately my TODO parser I'm using to sync my README with logseq was missing quite a few bulletpoints so the outline in the roadmap was not clear. I fixed that, so now what do you think about the way forward to sensify the code? I believe the main thing is that those scripts should instead inherit from a common class and share many methods. That would make it both easier to add more tools if ever needed and more urgently to understand / maintain / unclutter and in the end make usable.

reformulator.py Outdated Show resolved Hide resolved
reformulator.py Outdated Show resolved Hide resolved
reformulator.py Outdated Show resolved Hide resolved
reformulator.py Outdated Show resolved Hide resolved
reformulator.py Show resolved Hide resolved
utils/llm.py Outdated Show resolved Hide resolved
utils/logger.py Show resolved Hide resolved
@Grazfather
Copy link
Contributor Author

Why is it that the example dataset uses clozes? Is that how you make all of yours cards? I found that the reformulator (using my cards, your dataset) will turn them into clozes, but that doesn't work unless the note type is cloze.

reformulator.py Outdated Show resolved Hide resolved
except AssertionError as e:
red(e)
except Exception as e:
red(e)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a personal thing, we can change it.

Basically, I make it so that if we fail an assert, it's probably an issue with the invocation, and we log it. If it's another type of exception, then we probably don't expect it, so we log it, but we also re-raise the exception to print the stack trace. I can remove it, but it's helping me with debugging.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever helps you helps all of us here so no biggy

utils/llm.py Outdated Show resolved Hide resolved
@thiswillbeyourgithub
Copy link
Owner

Why is it that the example dataset uses clozes? Is that how you make all of yours cards? I found that the reformulator (using my cards, your dataset) will turn them into clozes, but that doesn't work unless the note type is cloze.

See my reply here

@thiswillbeyourgithub
Copy link
Owner

Also note that my datasets were intially written in french and I used claude sonnet 3.5 to translate them to english hastily. I have not had the time to proofread them but the idea should be there still.

@thiswillbeyourgithub
Copy link
Owner

I think a good idea would be to use platformdirs to find a good location for the dbs, ideally with an abilitty to overload it using an env var

@Grazfather Grazfather changed the title wip: Make sense of everything Get reformulator working Jan 9, 2025
@@ -670,6 +678,7 @@ def apply_reformulate(self, log: Dict) -> None:
nid,
fields={
self.field_name: log["note_field_formattednewcontent"],
# TODO: Might be nice to not require this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iirc it was necessary to be rock solid sure that we can rollback easily. But yeah maybe we could just store the previous version and use only the db to handle rollbacks?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: here I was refering to the fact that where have many strings lile "note_field_*" and would be nice if we didn't

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure of the purpose of the db. Does it reflect changes that haven't been committed, or the latest version? having the db is nice for persistence, but it also opens up to some weird state where the notes don't match the database, so generally I prefer having a single source of truth.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the database is only to act as a kind of very reliable and easy to parse logfile. It does not matter if there are inconsistencies because, for example if the user modified itself since the last time they ran the script. But it ensures complete reliability because it allows to roll back. To me, working with LLM's, it's very important to be able to rollback. Say in six months there is a shiny new LLM that is very cheap and possibly very good. Well, there is only one way to find out if it's good enough to handle real world notes. And then it's only after a few hundred reviews that you can actually judge if it does not make some weird edge case mistakes.

I don't know, just to give an example, at some point I realized that some of my cards related to hours were wrongly parsed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, yeah I see that it's only a log and is outside of anything with the anki reformulator field.

utils/llm.py Outdated Show resolved Hide resolved
self.db_content = self.load_db()
if not self.db_content:
red("Empty database. If you have already ran anki_reformulator "
"before then something went wrong!")
whi("Trying to create a new database")
whi("Creating a empty database")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be "an"

if buffer:
try:
_ = rtoml.loads("".join(buffer + [line]))
# TODO: What are you trying to do here? Just check that adding the line keeps valid toml?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. Irrc the thing with toml was to have a human readable way to see what happened using the addon. As a lot of log is packed into it I also added code to try rolling back if --reset was used and in case the db failed to recover. I can be fine with only storing data to the db, but also with storing all metadata of all scripts into a single field.

Btw rtoml was better in some aspects than toml but I remember a bit having much trouble in some situation (lile dumping then loading resulting in different values especially when None are involved but can't remember more specifically.)

dictionary = json.loads(zlib.decompress(row[0]))
dictionaries.append(dictionary)
return dictionaries
# TODO: Why do you compress? This just makes it more difficult to debug
Copy link
Owner

@thiswillbeyourgithub thiswillbeyourgithub Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intially I just dumped json but the size got out of hand surprisingly quickly so I compressed it with zlib and found out I might as well use sqlite.

This was totally amateurish, if I had to do it again I would use sqlite only and enable the built in compressions of course. But still being technically an amateur I'm open to any suggestion of course

@thiswillbeyourgithub
Copy link
Owner

I think a good idea would be to use platformdirs to find a good location for the dbs, ideally with an abilitty to overload it using an env var

This would also make it easy to switch to storing the db into the addon package in anki. Well actually I don't know if that's the best way, I have faint memories of some addon storing their own data in their own table of the anki db file. That might be fine if text but storing media like images seems unreasonnably costly to the ankidev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants