Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I only pull messages matching a label? #236

Open
listx opened this issue Mar 29, 2023 · 14 comments
Open

How do I only pull messages matching a label? #236

listx opened this issue Mar 29, 2023 · 14 comments

Comments

@listx
Copy link

listx commented Mar 29, 2023

I want to use lieer to sync only those messages that have already been filtered by Gmail to a particular label. E.g., I have a git label which is the only group of emails I'm interested in synching with lieer. Is this feasible?

@gauteh
Copy link
Owner

gauteh commented Mar 29, 2023

Not at the moment, but something based on changing query: https://github.com/gauteh/lieer/blob/master/lieer/remote.py#L85 and probably the query here https://github.com/gauteh/lieer/blob/master/lieer/gmailieer.py#L333 .. but there's a danger that gmailieer somewhere will think that all those messages that don't match the query are deleted locally. Maybe that won't be an issue.

@listx
Copy link
Author

listx commented Mar 29, 2023

but there's a danger that gmailieer somewhere will think that all those messages that don't match the query are deleted locally. Maybe that won't be an issue.

Correct --- I would only be pulling those emails that match the git remote label, so I wouldn't care if any non-matching local messages are deleted.

@aspiers
Copy link
Contributor

aspiers commented Sep 22, 2024

I need this badly too. I'm trying to sync a gmail account which has something like 18 years of email (~1.5m messages) and it is taking multiple days to do the initial pull. This problem is made worse by the fact that I ran into the same symptoms as in #63 (probably due to #229 and #277), so I have 1.5m messages many of which have the unread and inbox tags even though they shouldn't. The urgent task is to get tags accurate for any email received in the last year or so. The older stuff can wait for a slow sync to finish.

@aspiers
Copy link
Contributor

aspiers commented Sep 22, 2024

I just saw that #129 was an attempt at a solution to a similar problem which got closed without merging, and #165 is a different approach which may also have solved it but seems to have stalled back in 2020. What's needed to make progress here?

@gauteh
Copy link
Owner

gauteh commented Sep 22, 2024

The difficulty is to make a remote query (remote.py) which is stable, that is probably possible but requires testing. Any changes to this query will cause an initialized repository to become massively out of sync.

@aspiers
Copy link
Contributor

aspiers commented Sep 22, 2024

Thanks for the quick reply! Why would it cause a repo to become out of sync? assuming that this kind of filtered pull would automatically not touch any messages or labels of messages which were not retrieved by the sync.

@gauteh
Copy link
Owner

gauteh commented Sep 22, 2024

In case someone changes the query. Or if you make a query that is since: last month, that is always changing. Maybe it works, but would be good to test.

@aspiers
Copy link
Contributor

aspiers commented Sep 22, 2024

Probably my understanding of how lieer works is woefully inadequate, so please pardon the stupid questions. Why would it be a problem if the query changed or returned different results? I would expect that to only change which emails get pulled down and their local labels updated, without touching ones which are already there.

@gauteh
Copy link
Owner

gauteh commented Sep 22, 2024

That would probably result in deleting the ones that are already there. Plus you need to do a full sync to be sure, otherwise you need to replicate the behavior through the incremental sync. And if you change the query in more radical ways this could result in a huge number of emails to require deleting or syncing.

@aspiers
Copy link
Contributor

aspiers commented Sep 22, 2024

@gauteh commented on September 22, 2024 5:21 PM:

That would probably result in deleting the ones that are already there.

That's why I wrote this:

assuming that this kind of filtered pull would automatically not touch any messages or labels of messages which were not retrieved by the sync.

Surely it would be possible to avoid any location deletions when only doing a partial pull?

Plus you need to do a full sync to be sure, otherwise you need to replicate the behavior through the incremental sync.

By full sync you mean a full sync of just the messages returned by the label filter, right? Couldn't this be enforced when pulling with a label filter active?

And if you change the query in more radical ways this could result in a huge number of emails to require deleting

Why would they require deleting? As per above, I wouldn't want a single deletion to ever happen in this filtered pull mode.

or syncing.

I don't see a problem with that. If I want to pull all messages matching label A, and then all matching label B, it should be up to me to assess how many messages are associated with both and decide whether syncing just that label makes sense. That's precisely the flexibility I need. (Actually not just with labels but ideally general search filters, but labels would be a great start.)

@gauteh
Copy link
Owner

gauteh commented Sep 22, 2024

@gauteh commented on September 22, 2024 5:21 PM:

That would probably result in deleting the ones that are already there.

That's why I wrote this:

assuming that this kind of filtered pull would automatically not touch any messages or labels of messages which were not retrieved by the sync.

Surely it would be possible to avoid any location deletions when only doing a partial pull?

Yes, but then partial pull no longer matches a full sync. It should always match, otherwise the behavior will be unpredictable to the user, and it is also an assumption in lieer so there may be other side-effects (what if you add an label on a message which is no longer synced, but exists locally).

Plus you need to do a full sync to be sure, otherwise you need to replicate the behavior through the incremental sync.

By full sync you mean a full sync of just the messages returned by the label filter, right? Couldn't this be enforced when pulling with a label filter active?

No. If you have a query with a limit on the number of messages or all the messages in the last month you have to implement that logic in lieer: that means that you have to implement any logic in lieer which you allow to be in the query, and if it mismatches things are going to be out of sync. This makes it more difficult to deal with any changes back and forth. Maybe it is possible to do (or avoid), but I'm not sure how.

And if you change the query in more radical ways this could result in a huge number of emails to require deleting

Why would they require deleting? As per above, I wouldn't want a single deletion to ever happen in this filtered pull mode.

or syncing.

Then you do not match partial sync and full sync, and you are able to modify messages in notmuch that are not in your query. Expect weird side-effects or infinite-sync-cycles. You would have to have another index of messages which are now actively synced, but for that I think you need a full-sync.

I don't see a problem with that. If I want to pull all messages matching label A, and then all matching label B, it should be up to me to assess how many messages are associated with both and decide whether syncing just that label makes sense. That's precisely the flexibility I need. (Actually not just with labels but ideally general search filters, but labels would be a great start.)

Doing full syncs are a huge chore and threshold to get started with lieer (as you have noticed here). I'm just saying that you should expect to have to do a full sync whenever you tweak your query.


By the way, I would also like this feature, syncing only the last month or so would be useful. It would also be helpful to sync and index everything and then be able to delete most of the messages on disk (so that they can be searched with notmuch, but openend in gmail).

@aspiers
Copy link
Contributor

aspiers commented Sep 22, 2024

@gauteh commented on September 22, 2024 6:34 PM:

@gauteh commented on September 22, 2024 5:21 PM:
Surely it would be possible to avoid any location deletions when only doing a partial pull?

Yes, but then partial pull no longer matches a full sync. It should always match, otherwise the behavior will be unpredictable to the user

Sorry, I don't follow. What wouldn't match, and what would be unpredictable? If the user chooses to pull only mails matching a label, then they know that only mails matching that label will be up to date, and others may be out of date. That doesn't strike me as problematic. Is there some other problem I'm missing?

and it is also an assumption in lieer so there may be other side-effects (what if you add an label on a message which is no longer synced, but exists locally).

What do you mean by "no longer synced" here? If you mean adding a label remotely on gmail to a mail which is not pulled because it doesn't match the label filter for the pull, then it would simply remain out of date until the filter is relaxed. As per above I don't see an issue with that.

Plus you need to do a full sync to be sure, otherwise you need to replicate the behavior through the incremental sync.

By full sync you mean a full sync of just the messages returned by the label filter, right? Couldn't this be enforced when pulling with a label filter active?

No. If you have a query with a limit on the number of messages or all the messages in the last month you have to implement that logic in lieer

Why? Isn't the filtering simply in the request to the gmail API? i.e. "give me all messages matching label X".

Perhaps I should check my understanding - when you refer to "query", I'm assuming you're talking about one/some of the queries which lieer sends to the gmail API to retrieve the remote mails - right?

that means that you have to implement any logic in lieer which you allow to be in the query, and if it mismatches things are going to be out of sync. This makes it more difficult to deal with any changes back and forth. Maybe it is possible to do (or avoid), but I'm not sure how.

I don't follow the point here :-( Feels like maybe I'm missing something fundamental about how lieer works.

And if you change the query in more radical ways this could result in a huge number of emails to require deleting

Why would they require deleting? As per above, I wouldn't want a single deletion to ever happen in this filtered pull mode.

or syncing.

Then you do not match partial sync and full sync

By "partial sync" do you mean the partial sync proposed by this feature request, or the incremental sync performed by default when the -f option is used with gmi pull?

Either way, what doesn't match and why does this matter? Sorry if I'm asking a load of stupid questions, but something's just not clicking in my brain so I think I'm missing the crux of your point!

and you are able to modify messages in notmuch that are not in your query.

I don't get why that would need to happen. When the query is limited to only emails with a remote label, then surely any other emails already locally should just be not touched at all? It doesn't matter whether they're modified locally or remotely; either way lieer wouldn't look at them at all.

Expect weird side-effects or infinite-sync-cycles. You would have to have another index of messages which are now actively synced, but for that I think you need a full-sync.

I'm hopelessly lost by this but hopefully I've written enough that you can pinpoint where my understanding is failing me 😅

I don't see a problem with that. If I want to pull all messages matching label A, and then all matching label B, it should be up to me to assess how many messages are associated with both and decide whether syncing just that label makes sense. That's precisely the flexibility I need. (Actually not just with labels but ideally general search filters, but labels would be a great start.)

Doing full syncs are a huge chore and threshold to get started with lieer (as you have noticed here). I'm just saying that you should expect to have to do a full sync whenever you tweak your query.

I still don't understand why this would be necessary. BTW, when you say "tweak your query", are you imagining that the query would be configured in the config? For the OP's use case this would make sense, but for my use case of trying to incrementally grab the most important (active, non-historical) areas of my Gmail account, I would more likely want some ephemeral filters, e.g.

gmi pull --label recent-work

to temporarily restrict the pull to a label just for a single run.

By the way, I would also like this feature, syncing only the last month or so would be useful.

You mean a query filter which could also be based on dates rather than just labels? Yes that would be great.

It would also be helpful to sync and index everything and then be able to delete most of the messages on disk (so that they can be searched with notmuch, but openend in gmail).

Cool idea! When you say "opened in gmail", how would that work from a UX perspective?

@gauteh
Copy link
Owner

gauteh commented Sep 22, 2024

Maybe it would be more useful to try and play around with the code to understand it, if you change the above reference line in remote.py you should be able to achieve what you want -- and possibly start to see some of the interesting side-effects :) it might mess up some of your labels though.


It would also be helpful to sync and index everything and then be able to delete most of the messages on disk (so that they can be searched with notmuch, but openend in gmail).

Cool idea! When you say "opened in gmail", how would that work from a UX perspective?

  1. fetch the message content on-demand
  2. open the regular gmail UI in a browser with that message opened.

@aspiers
Copy link
Contributor

aspiers commented Oct 6, 2024

Having got to know lieer a bit better, I think I'm understanding the above better now:

So the API doesn't support filtering of partial sync via a query, therefore that filtering would have to be done separately, which might not even be possible via the API. E.g. if you only wanted to synchronize only messages matching a label foo, then during a partial sync when iterating through the history of changes since the last sync, the following cases need to be handled:

  • If label foo is added to a mail which wasn't previously synced, you need to be able to figure out how to include it in this and future syncs.
  • If label foo is removed from a mail which wasn't previously synced, you need to be able to figure out how to remove it from this and future syncs.
  • If the history of changes both adds and removes this label, these need to be handled in chronological order so that the final change is the one which wins.
  • If another change is made to a mail, you need to be able to figure out whether that mail matches the query and consequently whether that change is relevant to the sync or can be ignored.

This seems more awkward than the per-mail lastmod versions which notmuch tracks.

Furthermore, if you change the query used for syncing, then as noted in previous comments above:

  • that would require downloading any new mails matching the new query
  • any mail existing locally matching the old query but not the new query would no longer be synced and could start to become out of date, so the user would need to be clear about that (and maybe given the option to prune the local copy of that mail).

I hope I got that right but corrections very welcome of course :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants