Regression processing and analysis in KernelCI / KCIDB #335

r-c-n · 2024-04-23T09:48:05Z

r-c-n
Apr 23, 2024
Collaborator

Goals and current status

Now that KernelCI is getting closer to production and most of the design and infrastructure for an initial production spin is stabilizing, and with the initial KernelCI - KCIDB interoperability efforts into place, it's now a good moment to start driving the plans to provide higher-level data layers based on the primitive test results data produced by CI systems and submitted to KCIDB.

The goal is to take the initial data (test results, builds, etc) from a source DB and enable additional processes to extract additional information from them (post-processing and analysis) as well as incorporating information from other external sources to amplify this data and provide more elaborated results that can help us automate certain actions (triaging, classification) and also help users get to the root causes of test failures and keep track of them.

Key topics

Where to detect regressions

Regression detection is based on test results, so a "regression" is, essentially, a higher-level view of a test status when a breaking point was detected and the first abstraction defined on top of tests.

Currently, there's no convention on how to detect them, which criteria to use or how they should be defined, so each CI system is probably doing it their way. Considering that KCIDB will ultimately be sink to all the test results generated by multiple kernel CI systems, I see two options wrt where to detect regressions:

Option A: detect them in the source CI system

Each CI system detects when a regression has happened in a particular test by checking when a test instance has gone from PASS to FAIL and fleshes out a "regression" instance that will be ultimately submitted to KCIDB

Pros:

KernelCI already does this, although regressions are not published to KCIDB.
More straightforward to implement.
KernelCI test results are linear and ordered in time.

Cons:

Unknown: how to publish regression data to KCIDB
Unknown: how to translate regressions to KCIDB issues/incidents
We can't make any assumptions about other CI systems

Option B: detect them in KCIDB data

CI systems publish test results to KCIDB, and a process on top of KCIDB analyzes all the collected results to detect regressions.

Pros:

All test data from all CI systems will end up there: more homogeneous approach, single implementation.

Cons:

Concept of "regression" not defined in KCIDB, strictly speaking
Test results in KCIDB may not be linear nor ordered in time

Time-ordering and linearity of results in KCIDB

It looks like KCIDB results aren't meant to be processed nor interpreted as ordered in time. From a conversation on Slack:

Nikolai: I was thinking about regression tracking/triager for KCIDB, and I think we need to build it in such a way that it could handle incomplete or not completely certain results. I.e. it should provide some results even if e.g. it misses information on some commits, even if e.g. bisection is not complete yet. It should still present an estimation, some sort of result. Like: "there was a regression somewhere between these two commits", and "will be able to increase precision once we get more data".
[...]
I have only a vague idea so far, but the core of it comes from the fact that with a distributed system like KCIDB things tend to happen in a haphazard way, out of order, or incompletely

Ricardo: A series of test runs for a particular setup, platform and configuration is not unlike a continuous function (in theory), so if you can find two kernel commits, whatever they are in time, such that for the earliest one the test passed and for the most recent one the test failed, then you have a regression at some point with total certainty (something like Bolzano's theorem applied to test cases) although with plenty of vagueness.
Now, the "breaking point" could be between those two commits if they're directly sequential, or it could be in a narrower commit pair inside the initial commit segment, or it could be that the test passed and failed multiple times between that initial date frame, clarifying those things will yield better certainty and details about the detection. This is where I see the gradation. To my eyes, the initial regression detection, as vague and uninformative as it may be, is still a binary thing.

In theory, if we can identify a PASS and a FAIL result for the same test instance (setup, environment, platform, etc) such that the FAIL happened after the PASS, then we can affirm there was a regression somewhere between them.

The question is how to identify which test run happened before/after? If we're focusing only on regressions caused by kernel patches, then the kernel version / commit date should be the key to use for ordering them.

In order to do this we'd need to extract information from each repo, to be able to order two commits in time.

An advantage of this is that it makes the test run timestamp irrelevant, as it should be in this case: we only care about the result of the test ran on a particular kernel version, regardless of when the test ran.

On the other hand, there's a problem: it ignores the rest of the moving parts involved in a test run: kernel configuration and fragments (to fix this we'd have to make sure we're comparing test runs with the same
kbuild "signature") and test code version (a change in the test code may cause an unexpected change in its results under certain circumstances).

Another shortcoming is that this would only be feasible on branches that aren't rebased.

How to model regressions in KCIDB, if necessary

Currently, KCIDB models "builds", "checkouts", "incidents", "issues" and "tests", but AFAICT none of these cover the concept of a regression.

Depending on which direction we decide to go, we may need to model regressions on KCIDB for:

a) detecting regressions based on KCIDB test result data and storing them in KCIDB

or

b) accepting regressions as source data from CI systems

Alternatively, if we're set to detect regressions based on the test result data of KCIDB but we don't want to alter the KCIDB models, we can define and store the regression info separately, although this may not
be as useful, and could bring other problems (keep this data sync'd to KCIDB, reduced data cohesion).

How to extract "issues" from regressions and test failures

This applies both to source CIs, if they are supposed to submit issues, and to KCIDB if we're going to analyze an process the data from KCIDB itself.

How to "profile" what an issue is from a test failure or a regression? Any current ideas or guidelines?

Where to keep higher-level regression info

Regardless of the approach we follow with regards to regression analysis, where should we integrate the analysis data?

a) In the source DB (source CI system or KCIDB)
Pros:

Automatic integration and sync'ing of regression analysis with source test result data.
Analysis data can be used by the CI system or KCIDB to "amplify" the defined regressions: lifecycle information, trends, connection to external data sources (repos, mailing lists, etc).

Cons:

Analysis data may need to be properly modeled.
Increased DB size

b) In a separate DB
Pros:

Clear separation of source test result data and analysis: services interested only in low-level data don't need to care about analysis data.
Leaner test result DBs.
More leeway to define what to store and how to build relationships between data in the analysis-only DBs

Cons:

Not as easy to integrate/relate this data back to the low-level test result data in CI DBs / KCIDB. Would require an additional service to provide analysis data to clients (KernelCI / KCIDB) if needed.
More cumbersome to keep the data sync'd to the source DBs

r-c-n · 2024-04-23T09:50:21Z

r-c-n
Apr 23, 2024
Collaborator Author

@padovan @spbnick I think it's the right moment to address some topics to progress on the regression analysis side of things. I started this discussion to outline some of the ideas and issues that I believe should be tackled first and to brainstorm and plan for the next months. Let me know what you think and contribute to it in any way you want.

0 replies

padovan · 2024-04-23T14:45:10Z

padovan
Apr 23, 2024
Collaborator

Thanks for starting this.
For completeness, let me say here what I said in various places already. I believe "Option A" has a big cons of only solving the problem for one specific CI system which block all the possible collaboration in this area and keeps the door open for duplication. So, option B, bring us an avenue of possibilities to grow post-processing across the kernel test ecosystem.

0 replies

spbnick · 2024-04-23T17:52:28Z

spbnick
Apr 23, 2024
Collaborator

Thank you for the great summary, and a good description of the problem, @hardboprobot!

I'll try to answer everything, but tell me if I missed something.

Where to detect regressions

The plan for KCIDB is to support both options: detection in the CI system and in KCIDB, but I probably said that a thousand times already, and that alone doesn't get us closer. But the idea is to both have enough schema to accept regressions from the outside and to find them ourselves. We can start with accepting them from KernelCI API, and that's what I actively worked towards so far (only targeting CKI and Syzbot), as the minimal first step bringing value. However, I'm on @padovan's side, and would say we need to reach for the most impact and most exciting approach, that is processing KCIDB data (with all the schema changes that might require). In either case, we really need to define what a "regression" is from your POV, before we're sure we can accommodate it in the schema. I'll give it a shot at the end here.

Time-ordering and linearity of results in KCIDB

Results can arrive to KCIDB in any order, even from the same CI system. However, checkouts, builds, and tests all have the start_time field, which the CI system specifies. That could be used to order revisions from a single CI system for a particular branch. I.e. ordering by checkout's start_time.

I'm continuing my work on a PoC support for representing/querying large DAGs in PostgreSQL, and I hope we'll eventually have something workable, which would let us order revisions properly, but with the current state of it, and the time I can spare on it, it's unknown when it will be usable. I'm pretty sure, that it's possible to do, though.

However, if we want KCIDB to direct or simply process later bisection results that breaks down, because they will come with later timestamps. This makes me think maybe I should raise the priority on the DAG PoC and reach for the simplest and quickest solution.

If we have that, we can e.g. add a field containing a list of parent revisions to checkouts, so that contributing CI systems could submit relevant revision history, and augment that with a background scanner of known trees running on KCIDB side, submitting that data in the same way.

How to model regressions in KCIDB, if necessary

This I will need your help with. See below. However, KCIDB schema support was designed to be quickly-changeable and have backward compatibility. We absolutely should be changing it as much as we need.

The only criterion I have for what goes into the main KCIDB DB is it shouldn't be transient data, something that can change completely. Only the data that can increase in precision over time. Hope that makes sense. Information about regressions found in test/build results is the former. The state of the triager itself is perhaps the latter. OTOH, the current KCIDB schema allows us to say things like "this test result is being triaged for this issue". So the boundary is a bit fuzzy.

How to extract "issues" from regressions and test failures

OK, here's where I'll try to guess what you mean by a "regression". Please correct me as necessary.

Conceptually, a "regression" is when something was working and then broke. A falling edge, in digital logic terms, if you will. However in the context of CI and "regression tracking" we should also view the event of restoring the regressed functionality. Let's call it "recovery" (another possible term could be "restoration"). That would be the rising edge.

I guess that by "regression" you mean that falling edge. Right? And your concern is that KCIDB doesn't support recording those edges (although it has rudimentary support for triggering notifications on them). Right? Let's see what we can do about that. But first...

There are multiple dimensions to the problem finding regressions/recoveries:

Identifying the breakage in a test execution. Starting with simply looking at the reported test status, through looking for regular expressions in test output and dmesg, measuring performance, and up to an AI looking for subtle patterns in results, and perhaps even humans manually tagging the breakage in particular results.
Attaining sufficient confidence in test results (raising test quality (optional), recording test stability, and repeating execution appropriately, including over successive revisions).
Traversing and bisecting change history (a DAG) to locate the breakage boundary (the problem we're talking about above).

There's another half-dimension: determining regression/recovery scope - which targets, peripherals, and environments it reproduces in. It's not required strictly-speaking, if you can re-run tests exactly on the device we first noticed the problem on. But if you can't (which e.g. is a bit of a problem in CKI), an expanded scope would help by giving more hardware to test on, and by admitting results from other CI systems into the equation. And of course maintainers and developers would appreciate if we give them at least a hint of scope.

KCIDB provides an abstraction of an "issue", and its definition could be quite wide. Say, "a description of a problem". Whatever that problem is. It could be simply "a specific test is failing". But could also be "a specific test is failing in a particular way on this board". But could also be "a specific build is failing". Or "a test is passing, when it shouldn't". Or even "a CI system messed up here". And then there's the "incident", which is simply a link between an issue and a build and/or test result, saying "this issue occurred here" or "this issue didn't occur here".

Now, for simplicity's sake, imagine if we had all these types of issues recorded in the database, along with all their incidents, including "this test has failed", and "this test has not failed", all kinds. Then we could estimate the confidence of each issue occurrence/absence, based on historic data, request more testing from CI systems to raise the confidence and fill in the holes in the revision DAG, to finally find a regression/recovery edge.

In reality, we would not create issues and incidents for every test failing, and could e.g. implement "virtual" issues and incidents derived automatically from statuses, but nevertheless participating fully in the analysis.

Where to keep higher-level regression info

To me, this edge information seems to be falling neatly into the "data with increasing precision" category. It's redundant, since it's derived purely from the data already in the database, and so could be inconsistent with test results. However, what matters is that it's eventually consistent, nudged there by the continuous, and recursive triaging process. E.g. the triager could decide to triage a particular issue less often, if the database says it was fixed a while ago, and more often if its nearest edge is falling (we had a regression, but no recovery). We could also have other analysis results, including non-binary. E.g. an increase or decrease in failures of a particular flaky test. A change in performance metric (not supported by KCIDB yet).

I still feel icky about the idea of redundant data and I'll think more about this, but I think storing the analysis results in the main KCIDB database could be really convenient. And don't worry about the database size. It's basically an aggregation of a lot of other data, it will be considerably smaller than the rest of what we keep there.

I don't have a fully-formed idea how to best implement a schema for this data yet, but I'll keep thinking about it. It's just it's quite late here already 😂

Hope this helps.

0 replies

r-c-n · 2024-04-24T14:55:47Z

r-c-n
Apr 24, 2024
Collaborator Author

Thanks for pitching in and giving such detailed insights. Answers below:

the idea is to both have enough schema to accept regressions from the outside and to find them ourselves [...] However, I'm on @padovan's side, and would say we need to reach for the most impact and most exciting approach, that is processing KCIDB data

I agree, and IMO those two things are compatible. To me, they are at two different abstraction levels: we need regressions in order to do the processing on top. I don't care too much about where the regressions are detected, as long as they end up in KCIDB. The CI systems can offload that work themselves and submit the detected regressions, or they can submit only the test results and then KCIDB can inspect them to detect the regressions. The base end result (bare regression objects, low-level) should ideally be the same.

checkouts, builds, and tests all have the start_time field, which the CI system specifies. That could be used to order revisions from a single CI system for a particular branch. I.e. ordering by checkout's start_time.

As I commented previously, I think these timestamps could be misleading when detecting regressions. Or maybe they can be used as complementary data instead of the primary ordering key. The reason that comes to mind is that what we're searching for is a breaking point in a test result that's introduced by a commit in a repo. The fact that the commits are ordered in time doesn't mean that the respective test runs will be, so the start_time won't really give us the ordering we need. If we then introduce test retries, manually-triggered tests, etc. it'll get messy. However, commit timestamps will remain the same. The weakness of relying on commit timestamps is that rebases will break the ordering. This in itself (how to compare test runs ordered in time) is kind of a hard problem already, that's one of the reasons why detecting regressions may be easier in a CI system (in principle).

About regressions and issues:

Yes, that definition of a regression that you wrote sounds good to me. Wrt the "dimensions" of the problems as you described, I like to think in simple terms. All those ideas and dimensions can be implemented iteratively on top of more primitive data, so after we have a general idea of the high-level goal (and I think we do), I'd start from the most basic element and I'd then build the abstractions bottom-up until we get to that goal. That's what we sketched last year in the simple regression tracker that we wrote.

This is how I look at the problem and how I'd approach it:

At the most primitive level we have test results. A test result contains its own data attributes (result, setup, etc) and also links to other entities (kernel build, etc)
Based only on test results we can detect a breaking point (call it a proto-regression if you want), mechanically. Just checking if the test failed and the same test on the same setup on the previous known kernel commit passed. No problem here.
From these "proto-regressions", we do an initial filtering stage. Very basic low-pass filter to remove false positives, unstable results and repeated or misleading results. Basically, remove noise.
We may consider the remaining items as "regressions" with a very basic degree of certainty (ie. the test failed but we don't know anything else). On top of this we can start running different types of analysis:
- Failure trends and success rates based on the test timeline
- Regression profiling based on the test log and other hints (this is a project in itself)
- Fetching of data from external sources to fill in the gaps
- Incorporation of info about the current status/lifecycle
The more information we collect and add about a regression, the higher the chances of it having a higher certainty degree (which needs to be defined). In parallel to step 4 we can start working on reporting, user subscriptions, etc.
Based on the analysis collected about regressions, we can define rules to use that analysis to perform automatic triaging or to provide investigation aids to the developers.

So we don't have to tackle the whole problem as a big chunk, there are too many unknowns. But this looks like a reasonable plan to me, and we can always move in a different direction if the goals change.

About the concept of "issues", to me they sound like they're connected to step 4 above: how to profile a test failure. That is, understand what the problem looks like, give it a name and a signature and be able to compare and match it to other test failures. Is that more or less what you had in mind? This is also a problem that seems orthogonal to the core regression detection: an "issue", if defined like this, depends on a test failure. And then, a particular regression can be associated to an "issue" if it's detected to match it (in the end, a regression points to a specific test failure), providing more information to an initial regression object: a "shape" and a known identifier, so we can tell if it happened already, where and when it was fixed previously, etc.

The problem is I wouldn't even know where to start with that.

I can give you some ideas about how to define the regressions schema, based on the very basic definition that we have right now in KernelCI, although my intention is to make this in a CI-agnostic way. I'll follow up on this conversation during the week.

Other considerations

About storage and other practical considerations, I don't want to go there at this point. I'd rather keep experimenting and coming up with design ideas and we can figure out those details once things start to get a shape. I know next door to nothing about production deployments anyway.

1 reply

spbnick Apr 24, 2024
Collaborator

Thanks a lot, Ricardo! I'll read this and respond first thing tomorrow morning 👍

spbnick · 2024-04-25T10:52:48Z

spbnick
Apr 25, 2024
Collaborator

I agree, and IMO those two things are compatible. To me, they are at two different abstraction levels: we need regressions in order to do the processing on top. I don't care too much about where the regressions are detected, as long as they end up in KCIDB. The CI systems can offload that work themselves and submit the detected regressions, or they can submit only the test results and then KCIDB can inspect them to detect the regressions. The base end result (bare regression objects, low-level) should ideally be the same.

Yes. Although having regression detection in KCIDB frees CI systems from implementing it themselves, and makes it possible to leverage cross-CI data.

The fact that the commits are ordered in time doesn't mean that the respective test runs will be, so the start_time won't really give us the ordering we need.

Yes, that's what I tried to point out as well. As such it looks like "regression detection" in KCIDB will need to wait until we can traverse revision history DAG there. What we can add to KCIDB already is "known issue detection", that is extracting more information from build/test results than just a simple PASS/FAIL, and linking particular identified issues to them. In a way that would simply be more precise than a PASS/FAIL. And so "regression detection" can be applied on top of that, and would be orthogonal to "known issue detection". Basically what you said above:

About the concept of "issues", to me they sound like they're connected to step 4 above: how to profile a test failure. That is, understand what the problem looks like, give it a name and a signature and be able to compare and match it to other test failures. Is that more or less what you had in mind? This is also a problem that seems orthogonal to the core regression detection: an "issue", if defined like this, depends on a test failure. And then, a particular regression can be associated to an "issue" if it's detected to match it (in the end, a regression points to a specific test failure), providing more information to an initial regression object: a "shape" and a known identifier, so we can tell if it happened already, where and when it was fixed previously, etc.

The problem is I wouldn't even know where to start with that.

That's what we've been honing at CKI, and what e.g. Syzbot succeeded at. And it's what the current KCIDB schema is trying to accommodate with its "issues" and "incidents". Now, however, we would have to think how to accommodate regressions/recoveries as well, so that KernelCI can submit results of its regression detection. And I would love to hear what schema you have already implemented, please post that!

And yes, I absolutely agree, that we need to start simple and basic. It was just one more thing I didn't have time to write above 🙈 At the same time I think it's good to lay down a good theoretical foundation from the start, so we can head in a good direction, even if we start very small.

Totally understand about not wanting to think about deployment requirements 😁

2 replies

spbnick Apr 25, 2024
Collaborator

Oh, and BTW, how are you figuring out history for regression detection in KernelCI? Query a git repo?

spbnick Apr 25, 2024
Collaborator

As such it looks like "regression detection" in KCIDB will need to wait until we can traverse revision history DAG there.

Still, perhaps we can come up with a quick hack to get started.

spbnick · 2024-04-28T09:07:27Z

spbnick
Apr 28, 2024
Collaborator

Yesterday I poked at my DAG FDW PoC some more, and thought once again that perhaps a full extension (rather than an FDW) would be better for this, as we could use native facilities for storage and transaction management. So I went onto PGXN looking for references to see how hard would that be, and rediscovered Apache AGE ("A Graph Extension" for PostgreSQL). It's once again a general-purpose Graph database (extension), so performance for our entire dataset is going to likely be abysmal, but perhaps we could stuff something smaller in there. I'll research that.

I'll start with the six months of data we already have, fill in the missing commits between them, and see how it fares on my laptop, storing and querying just the connectivity information, leaving the rest for regular tables. If it's still too slow, we could try cutting it further down. One month would already be pretty good, I think.

I tried to reach out to the developers to ask how hard would it be to implement DAG optimizations there, but haven't heard from them yet, I'll try again later. Meanwhile I think we could accept and store revision connectivity data in the regular checkout entries, as we go, so we could recover it later, if Apache AGE works, but we want to move to another solution and/or would be able to host more than six months of data. Using it also unfortunately means that we would have to maintain our own PostgreSQL instance, as it's not in the officially-supported extensions on Google Cloud. But that would be the case for our custom FDW as well.

@VinceHillier, do you think you could help us implement deployment of our own PostgreSQL instance to Google Cloud, using the current deployment script?

1 reply

spbnick May 1, 2024
Collaborator

OK, I did experiment with Apache Age, and the results are disappointing, again, worse than with Neo4j. Even with just two months of data (47K commits and 49K links between them) across all the repositories we had reports for in the past six months, it falls flat on its face when I request e.g. just 30 commits back from a given one. Just runs out of 20GB of disc space it has. I'm not sure what it's doing exactly with all that space and CPU time. So I'm going back to the drawing board with my own DAG FDW.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression processing and analysis in KernelCI / KCIDB #335

{{title}}

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Regression processing and analysis in KernelCI / KCIDB #335

r-c-n Apr 23, 2024 Collaborator

Goals and current status

Key topics

Where to detect regressions

Option A: detect them in the source CI system

Option B: detect them in KCIDB data

Time-ordering and linearity of results in KCIDB

How to model regressions in KCIDB, if necessary

How to extract "issues" from regressions and test failures

Where to keep higher-level regression info

Replies: 6 comments · 4 replies

r-c-n Apr 23, 2024 Collaborator Author

padovan Apr 23, 2024 Collaborator

spbnick Apr 23, 2024 Collaborator

r-c-n Apr 24, 2024 Collaborator Author

About regressions and issues:

Other considerations

spbnick Apr 24, 2024 Collaborator

spbnick Apr 25, 2024 Collaborator

spbnick Apr 25, 2024 Collaborator

spbnick Apr 25, 2024 Collaborator

spbnick Apr 28, 2024 Collaborator

spbnick May 1, 2024 Collaborator

r-c-n
Apr 23, 2024
Collaborator

Replies: 6 comments 4 replies

r-c-n
Apr 23, 2024
Collaborator Author

padovan
Apr 23, 2024
Collaborator

spbnick
Apr 23, 2024
Collaborator

r-c-n
Apr 24, 2024
Collaborator Author

spbnick Apr 24, 2024
Collaborator

spbnick
Apr 25, 2024
Collaborator

spbnick Apr 25, 2024
Collaborator

spbnick Apr 25, 2024
Collaborator

spbnick
Apr 28, 2024
Collaborator

spbnick May 1, 2024
Collaborator