Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with List API reliability #156

Open
storema3 opened this issue Nov 27, 2024 · 2 comments
Open

Issues with List API reliability #156

storema3 opened this issue Nov 27, 2024 · 2 comments

Comments

@storema3
Copy link

The API POST /api/v1/list is the main method to get message data out of a Kiwi News node. Having worked with it over a longer time span, it shows more and more issues that make it harder to work with it.

The API allows retrieving the messages pagewise, by specifying the start index and the number of messages. To retrieve all messages of a node, the method must be called repeatedly.

Known limitations of the API

The API walks the trie and is expensive in terms of memory and computation. With smaller nodes (4GB RAM, see the Hetzner CAX11 for a reference) it is necessary to pause (15–30 seconds) between each invocation of the API; otherwise the node might hang or crash.

The API seems to highly dependent on the state/load of the system. Sometimes clients get responses without any data, although the next invocation returns data. On other occasions, one client gets repeatedly HTTP 500 responses about missing leaf nodes, while another client happily retrieves the same data the first one also called for.

The sequence of data items retrieved is not always guaranteed. Therefore, ffter node crashes or re-initialization of a node, a complete re-download of all messages is often necessary.

While the re-download was no problem when there were only a few thousand messages, with increasing messages numbers this is becoming a problem for systems relying on the data.

ETL process to retrieve message data and missing data

The ETL process to retrieve data consists of two steps:

  1. Get all current messages by downloading them with repeated invocations of the list API, until no data is returned anymore.
  2. Periodically, check for new messages. The ETL process knows how many messages it has retrieved, and uses this as the from (start index) parameter.

Step (1) normally works, occasional server errors can be corrected by repeating the API calls. On a CAX11 VM this process can take an hour.

However, it occurred that the number of messages retrieved after a complete re-initialization of a node is lower than its previous message count. A system, that previously had exported 25112 messages from a node, got only 25048 after the node data had been deleted and re-synched.

Step (2) is problematic. It gets all new messages from the node, but over time it also gets duplicate messages that should not be there:

kn-import-duplicate

The image shows the amount of new messages the ETL process exports (green line), per call. The yellow line is the number of duplicates the process received. Here, at 10:00, the ETL process received a message with an already existing message index, although it should have gotten only new items! Is there now somewhere new data, that changed the sequence?

These duplicates never go away, their number increases over time (1-4 messages per occasion). After 37 days of operation, there were 10 duplicates.

This behavior affects the download of amplify and comment messages.

@TimDaub
Copy link
Member

TimDaub commented Nov 28, 2024

thanks for reporting! This is great.

It might be that the /list endpoint has also gotten less reliable over time because we've done all sorts of caching and optimization to the store.posts function. But I've done this optimization to improve startup times and frontend response times, so I've given the /list endpoint little attention, admittedly!
E.g. since a 1-2 months store.posts now uses multi threading to validate the signatures of the messages, but this can, I imagine, create all sorts of indeterminism, which probably didn't exist before that change. It could also affect the sequentiality of messages returned!

Since other users have also asked for better APIs to work with Kiwi News, I was wondering: Should we make the /list API better, or may we need much better APIs in general? @storema3 it seems that for you the list endpoint serves its purpose.
What do you generally use it for? Initially, we didn't intend that endpoint to allow downloading the entire database of messages, but I see that this has indeed become a use case.
But maybe, in general, it's the wrong approach? E.g. Farcaster allows someone to spin up a sidecar postgres database from which users can read from. I know that Ethereum users love Erigon, and from what I know it also exposes the database as an interface to third party developers. Is this also something we should have? I also wonder how other systems work. What's the most reliable way to download all messages when e.g. running a Kiwi News node in reconcile mode?

Also: Do you actually need the server to send you the signatures and the identities of who signed the messages? I imagine yes, right? Because that is among the computationally effortful tasks. In general, here's why the /list endpoint is bad

  • It uses store.post. store.post uses store.leaves which does a depth first traversal over the entire trie to get a portion of the leaves requested (slow). However, e.g. for the last leaves, this may mean it has to traverse a lot of the trie first
  • The Kiwi News messages in the trie actually don't contain yet the address that were used to sign the messages. They only contain a signature field which needs to be converted to an address using the EIP712 ecrecover, a CPU-bound task that will scale less and less the more messages we store :(
  • The store.posts function is only minimally cached, e.g., the signatures are cached permanently, once they've been requested

@rvolz
Copy link
Contributor

rvolz commented Nov 28, 2024

Thanks for the prompt response. There would be two areas of improvement.

For everybody spinning up a new node, or re-synchronizing after a failure:

  1. A sync method that terminates automatically when it reaches a certain condition (message timestamp), e.g. the start of the sync request. Currently, we have to check for activity of npm sync.
  2. Some progress indication during that would be NTH.
  3. A statistics API that shows the number of messages currently held in a node. So the sync duration could be estimated, might also be useful for debugging sync problems.

As for the extraction of node data:

The push model of sending simplified data into a separate DB would be a nice alternative, if that DB could be remote and could be updated continually and reliably in a shortish interval (5-15 mins). This approach would make sense, if the list API stayed expensive as it is. The load would occur only once per update period, not for every client request.

Opening up the internal data structures might make sense for clients like Erigon, dealing with well-known data structures that change relatively slow (the chains). And even they offer Otterscan as an interface (and of course the RPCs). For a protocol like KN, constantly looking for new opportunities, it might be a hindrance.

The main requirements for ETL from a reconciliation node would be:

  1. a quick sync of the database with explicit stop criteria, probably a timestamp
  2. a periodic update facility based on time
  3. the ability to check if the ETL has gotten all the messages, or what is missing

About the identity in a message:

We export the identity per message, but not the signature etc. The identity is used for statistics (activity per user ...) and maybe also as a future search criteria. We export only the identity because we trust that we could verify the message sender via the node, if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants