Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend usage metrics (offline entities) for 2024.2 #1198

Merged
merged 8 commits into from
Sep 25, 2024

Conversation

ktuite
Copy link
Member

@ktuite ktuite commented Sep 24, 2024

Closes getodk/central#653

Example of new metrics:

    num_offline_entity_branches: 1,
    num_offline_entity_interrupted_branches: 1,
    num_offline_entity_submissions_reprocessed: 1,
    max_entity_submission_delay: 0.022,
    avg_entity_submission_delay: 0.015

This covers these metrics:

  • 1. Number of offline branches involving more than one update
  • 2. How many branches were interrupted in the middle of being processed?
  • 3. How many submissions were put in the backlog just long enough for updates to be processed in the correct order? In other words, how many submission.reprocess events have there been?
  • 6. How long does it take for an entity change to be processed? What is the max amount of time between when the submission was created and the entity version was created?
    • Matt: "What I'm personally interested in is some sense of how often there are long delays in entity processing due to a large number of unprocessed events in the audit log. How often are the workers working through the audit log overworked?"

It doesn't include

TODO:

  • tackle this one: (planning to leave til after Log when a create/update entity submission is force applied central#720)
    5. How long does it take for an entire branch to be processed? Specifically: what is the max amount of time between when the first submission for a branch is received and when the last submission is processed?
  • Hook up the rest of the analytics (send output to odk form, update analytics form, etc.)

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

Before submitting this PR, please make sure you have:

  • run make test and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

Copy link
Member

@matthew-white matthew-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, just leaving some initial thoughts!

lib/model/query/analytics.js Outdated Show resolved Hide resolved
lib/model/query/analytics.js Show resolved Hide resolved
lib/model/query/analytics.js Outdated Show resolved Hide resolved
// submission create loggedAt timestamp (when sub was created) to when
// the event was processed, which will be after the entity version was
// created.
const measureEntityProcessingTime = () => ({ one }) => one(sql`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this exclude offline updates that are force-processed from the backlog? I think those would skew the metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is another place where getodk/central#720 would be helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion below, I think this isn't an issue after all. Even if an offline update ends up in the backlog, that's different from how quickly the submission.create event is processed.

// submission create loggedAt timestamp (when sub was created) to when
// the event was processed, which will be after the entity version was
// created.
const measureEntityProcessingTime = () => ({ one }) => one(sql`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason, it'd be great if we could exclude submission.create events that were processed when the approvalRequired flag was toggled, as we expect a delay in that case. I'm not really sure how to identify that though. 🤔 If it helped, I think it'd be OK to exclude entity creations (entity defs whose version number is 1), since those are the only entity defs that the approvalRequired flag will affect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is kinda indirect in a way that I think is convenient for the metric we are getting at.

It's looking at the timestamps on the submission.create audit event which only gets updated by the jobs worker, the one that runs the first time the event comes in. The other things that process submission events to make entities (like the batch processing when toggling the approvalRequired flag and batch-processing the backlog) use the event to look up information, but they don't update anything about the underlying event.

So I think it's safe to to look at this time difference like this. It's not every submission create event, it's only the ones that ultimately turn into entity versions. It doesn't capture how long it takes overall for a submission to become an entity (potentially factoring in these other sources of delay), but it does capture how long a submission would be sitting around waiting for any initial entity processing (even if the result of that was it getting skipped to be reexamined later).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what you were expecting was a more explicit (entity def creation date - source submission def creation date). I could add a query for that, too, that could include these other ways in which entity processing can get delayed. Then we could see if there's a noticeable difference... which would mean these delay-related mechanism are activated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's looking at the timestamps on the submission.create audit event which only gets updated by the jobs worker, the one that runs the first time the event comes in.

Ah that's interesting, I hadn't put that together. This sounds like a good way to approach the metric then. 👍 I don't feel like we need anything more explicit: it makes sense to me to start with this.

@matthew-white
Copy link
Member

I think metric 5 is also one that would benefit from getodk/central#720.

WHERE action = 'submission.create'
`);

const measureElapsedEntityTime = () => ({ one }) => one(sql`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this query for fun but decided not to include it in the actual report. Partly because it feels slower than some of the other queries and the overall set of queries is already sooo slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I think as we discussed above, we would expect these metrics to be larger than those from measureEntityProcessingTime(), as these metrics will include the 5-day entity backlog wait and also the time between submission creation and submission approval (if submission approval is required).

@ktuite ktuite marked this pull request as ready for review September 25, 2024 21:05
@ktuite ktuite changed the title Offline entity usage metrics for 2024.2 Backend usage metrics (offline entities) for 2024.2 Sep 25, 2024

// Create form, set dataset people to approvalRequired = true
await createTestForm(service, container, testData.forms.simpleEntity, 1);
await approvalRequired(service, 1, 'people');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, I just removed this function in #1196, as it wasn't used anymore and the linter was complaining. I think merging now would break the master branch, but you could merge master into this PR, then add approvalRequired() back. Or if it's not used elsewhere, you could just send the request to toggle the flag without using approvalRequired(). Sorry about that!

@ktuite ktuite force-pushed the ktuite/usage_metrics branch from ebeb0ba to 16e33b5 Compare September 25, 2024 23:06
@ktuite ktuite merged commit 3e190d5 into master Sep 25, 2024
6 checks passed
@ktuite ktuite deleted the ktuite/usage_metrics branch September 25, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update usage information metrics for v2024.2
2 participants