-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding url-level receive_time to the blockchain payload #36
Comments
I agree that it might be good to transition to a format that allows including other attributes about the feed. I would suggest using a list of objects instead. We currently have some cases where the same feed URL will show up in the block multiple times.
In this format, maybe we change the top-level |
Works for me - under what circumstances would the same feed url show up in the same block batch? Two quick pings from the same source (source in this case means buzzsprout, etc), or two pings from different sources? If the former, should probably just coalesce it within a single block. Perhaps name the collection |
The pings are coming from the hosts. From what it looks like, they (ex: Buzzsprout) sends a ping anytime a change is saved. If 2 saves are done close together, they can end up in the same block when published to hive. In a thread the other day on podcastindex.social, we were discussing these duplicate pings and what to do with them. Right now, I don't think any filtering is being applied and all pings are written to the blocks. The app or service that is watching the index could decide to ignore a ping if received within a close time frame. Here is that thread. |
This all sounds fine to me. If we make the change now, future metadata additions will be easy. |
I'm very much in favor of allowing for more meta data in a ping but I don't think received time is the one to go for. We're talking here about under 30 seconds of resolution for an item which for the vast majority of podcasts has a resolution of 1 week or perhaps a day. We can solve the double post thing by delaying things for a minute or two and de-duplicating but it really seems to be better to just let almost everything through. I certainly could de-duplicate the URL list in any given custom_json. Right now the writer accepts URLs for up to 3s or until it reaches 130 URLs which I've tested and keeps us below the max size of a custom_json which is 8192 bytes. There's no point measuring exactly, 130 keeps us comfortably below and is huge!
3s is exactly the Hive block time and I haven't noticed us sending multiple pings in the same block (which isn't really a problem anyway and will happen later when multiple entities are writing directly to Hive). I've just added a deduplicating step within that 3s block, we could push that to 5s or 10s which would catch more but Dave is already buffering things on his side. If we had a received time that means Dave sending that all the way up the chain through his queue which means he sends two pieces of info per ping instead of one and hive-writer has to maintain those. However, in order to preserve the simplest list format when we're just pinging I would prefer something like the following having two separate lists one with just the URLs and a parallel list for
|
Philosophical point: We are writing to an immutable database. I'm trying to be mindful all the time of what we are doing and the impact on the whole chain. Podping isn't going to break Hive or even trouble it significantly and we will bring much more value to Hive in terms of proving its utility than we will take out. However, I'm trying to be as careful as I can with extra data. I'm even looking at version number now and wondering if that really gives us much when it is repeated 28,000 times per day because maybe it will change in a week or two! I don't think anyone is making use of that version number right now so I'm thinking to remove both For sure I foresee |
I really wouldn’t remove version and reason. It feels better to just shorten them and document the schema here in the repo. For instance just “v” for version and make it an integer. That’s incredibly tiny. And for “reason”, just use “r” and an integer value with the mapping of numbers to reason codes here in the repo. I already have a use for reason that I just discussed with someone yesterday. We’re only talking about 10 bytes of extra JSON data per block if we pare it down like that. |
Is the full current JSON schema even documented here yet? Maybe I missed it. |
I'll work on this. And documenting the custom_json properly. |
Just have to remember as additional meta-data is added the current MAX_URL_PER_CUSTOM_JSON may need to be reduced to accommodate. |
When a ping comes in from the host, it would be great to capture this request time along with the associated url when later writing to the block - this info is currently lost due to the queuing done before writing to the blockchain.
It's the truest age time we have, on which to base all downstream delay times.
This would require changing the payload to supporting url-level attributes, which could be useful for other attributes in the future.
e.g.
The text was updated successfully, but these errors were encountered: