Replies: 6 comments 4 replies
-
@padovan @spbnick I think it's the right moment to address some topics to progress on the regression analysis side of things. I started this discussion to outline some of the ideas and issues that I believe should be tackled first and to brainstorm and plan for the next months. Let me know what you think and contribute to it in any way you want. |
Beta Was this translation helpful? Give feedback.
-
Thanks for starting this. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the great summary, and a good description of the problem, @hardboprobot! I'll try to answer everything, but tell me if I missed something.
The plan for KCIDB is to support both options: detection in the CI system and in KCIDB, but I probably said that a thousand times already, and that alone doesn't get us closer. But the idea is to both have enough schema to accept regressions from the outside and to find them ourselves. We can start with accepting them from KernelCI API, and that's what I actively worked towards so far (only targeting CKI and Syzbot), as the minimal first step bringing value. However, I'm on @padovan's side, and would say we need to reach for the most impact and most exciting approach, that is processing KCIDB data (with all the schema changes that might require). In either case, we really need to define what a "regression" is from your POV, before we're sure we can accommodate it in the schema. I'll give it a shot at the end here.
Results can arrive to KCIDB in any order, even from the same CI system. However, checkouts, builds, and tests all have the I'm continuing my work on a PoC support for representing/querying large DAGs in PostgreSQL, and I hope we'll eventually have something workable, which would let us order revisions properly, but with the current state of it, and the time I can spare on it, it's unknown when it will be usable. I'm pretty sure, that it's possible to do, though. However, if we want KCIDB to direct or simply process later bisection results that breaks down, because they will come with later timestamps. This makes me think maybe I should raise the priority on the DAG PoC and reach for the simplest and quickest solution. If we have that, we can e.g. add a field containing a list of parent revisions to checkouts, so that contributing CI systems could submit relevant revision history, and augment that with a background scanner of known trees running on KCIDB side, submitting that data in the same way.
This I will need your help with. See below. However, KCIDB schema support was designed to be quickly-changeable and have backward compatibility. We absolutely should be changing it as much as we need. The only criterion I have for what goes into the main KCIDB DB is it shouldn't be transient data, something that can change completely. Only the data that can increase in precision over time. Hope that makes sense. Information about regressions found in test/build results is the former. The state of the triager itself is perhaps the latter. OTOH, the current KCIDB schema allows us to say things like "this test result is being triaged for this issue". So the boundary is a bit fuzzy.
OK, here's where I'll try to guess what you mean by a "regression". Please correct me as necessary. Conceptually, a "regression" is when something was working and then broke. A falling edge, in digital logic terms, if you will. However in the context of CI and "regression tracking" we should also view the event of restoring the regressed functionality. Let's call it "recovery" (another possible term could be "restoration"). That would be the rising edge. I guess that by "regression" you mean that falling edge. Right? And your concern is that KCIDB doesn't support recording those edges (although it has rudimentary support for triggering notifications on them). Right? Let's see what we can do about that. But first... There are multiple dimensions to the problem finding regressions/recoveries:
There's another half-dimension: determining regression/recovery scope - which targets, peripherals, and environments it reproduces in. It's not required strictly-speaking, if you can re-run tests exactly on the device we first noticed the problem on. But if you can't (which e.g. is a bit of a problem in CKI), an expanded scope would help by giving more hardware to test on, and by admitting results from other CI systems into the equation. And of course maintainers and developers would appreciate if we give them at least a hint of scope. KCIDB provides an abstraction of an "issue", and its definition could be quite wide. Say, "a description of a problem". Whatever that problem is. It could be simply "a specific test is failing". But could also be "a specific test is failing in a particular way on this board". But could also be "a specific build is failing". Or "a test is passing, when it shouldn't". Or even "a CI system messed up here". And then there's the "incident", which is simply a link between an issue and a build and/or test result, saying "this issue occurred here" or "this issue didn't occur here". Now, for simplicity's sake, imagine if we had all these types of issues recorded in the database, along with all their incidents, including "this test has failed", and "this test has not failed", all kinds. Then we could estimate the confidence of each issue occurrence/absence, based on historic data, request more testing from CI systems to raise the confidence and fill in the holes in the revision DAG, to finally find a regression/recovery edge. In reality, we would not create issues and incidents for every test failing, and could e.g. implement "virtual" issues and incidents derived automatically from statuses, but nevertheless participating fully in the analysis.
To me, this edge information seems to be falling neatly into the "data with increasing precision" category. It's redundant, since it's derived purely from the data already in the database, and so could be inconsistent with test results. However, what matters is that it's eventually consistent, nudged there by the continuous, and recursive triaging process. E.g. the triager could decide to triage a particular issue less often, if the database says it was fixed a while ago, and more often if its nearest edge is falling (we had a regression, but no recovery). We could also have other analysis results, including non-binary. E.g. an increase or decrease in failures of a particular flaky test. A change in performance metric (not supported by KCIDB yet). I still feel icky about the idea of redundant data and I'll think more about this, but I think storing the analysis results in the main KCIDB database could be really convenient. And don't worry about the database size. It's basically an aggregation of a lot of other data, it will be considerably smaller than the rest of what we keep there. I don't have a fully-formed idea how to best implement a schema for this data yet, but I'll keep thinking about it. It's just it's quite late here already 😂 Hope this helps. |
Beta Was this translation helpful? Give feedback.
-
Thanks for pitching in and giving such detailed insights. Answers below:
I agree, and IMO those two things are compatible. To me, they are at two different abstraction levels: we need regressions in order to do the processing on top. I don't care too much about where the regressions are detected, as long as they end up in KCIDB. The CI systems can offload that work themselves and submit the detected regressions, or they can submit only the test results and then KCIDB can inspect them to detect the regressions. The base end result (bare regression objects, low-level) should ideally be the same.
As I commented previously, I think these timestamps could be misleading when detecting regressions. Or maybe they can be used as complementary data instead of the primary ordering key. The reason that comes to mind is that what we're searching for is a breaking point in a test result that's introduced by a commit in a repo. The fact that the commits are ordered in time doesn't mean that the respective test runs will be, so the About regressions and issues:Yes, that definition of a regression that you wrote sounds good to me. Wrt the "dimensions" of the problems as you described, I like to think in simple terms. All those ideas and dimensions can be implemented iteratively on top of more primitive data, so after we have a general idea of the high-level goal (and I think we do), I'd start from the most basic element and I'd then build the abstractions bottom-up until we get to that goal. That's what we sketched last year in the simple regression tracker that we wrote. This is how I look at the problem and how I'd approach it:
So we don't have to tackle the whole problem as a big chunk, there are too many unknowns. But this looks like a reasonable plan to me, and we can always move in a different direction if the goals change. About the concept of "issues", to me they sound like they're connected to step 4 above: how to profile a test failure. That is, understand what the problem looks like, give it a name and a signature and be able to compare and match it to other test failures. Is that more or less what you had in mind? This is also a problem that seems orthogonal to the core regression detection: an "issue", if defined like this, depends on a test failure. And then, a particular regression can be associated to an "issue" if it's detected to match it (in the end, a regression points to a specific test failure), providing more information to an initial regression object: a "shape" and a known identifier, so we can tell if it happened already, where and when it was fixed previously, etc. The problem is I wouldn't even know where to start with that. I can give you some ideas about how to define the regressions schema, based on the very basic definition that we have right now in KernelCI, although my intention is to make this in a CI-agnostic way. I'll follow up on this conversation during the week. Other considerationsAbout storage and other practical considerations, I don't want to go there at this point. I'd rather keep experimenting and coming up with design ideas and we can figure out those details once things start to get a shape. I know next door to nothing about production deployments anyway. |
Beta Was this translation helpful? Give feedback.
-
Yes. Although having regression detection in KCIDB frees CI systems from implementing it themselves, and makes it possible to leverage cross-CI data.
Yes, that's what I tried to point out as well. As such it looks like "regression detection" in KCIDB will need to wait until we can traverse revision history DAG there. What we can add to KCIDB already is "known issue detection", that is extracting more information from build/test results than just a simple PASS/FAIL, and linking particular identified issues to them. In a way that would simply be more precise than a PASS/FAIL. And so "regression detection" can be applied on top of that, and would be orthogonal to "known issue detection". Basically what you said above:
That's what we've been honing at CKI, and what e.g. Syzbot succeeded at. And it's what the current KCIDB schema is trying to accommodate with its "issues" and "incidents". Now, however, we would have to think how to accommodate regressions/recoveries as well, so that KernelCI can submit results of its regression detection. And I would love to hear what schema you have already implemented, please post that! And yes, I absolutely agree, that we need to start simple and basic. It was just one more thing I didn't have time to write above 🙈 At the same time I think it's good to lay down a good theoretical foundation from the start, so we can head in a good direction, even if we start very small. Totally understand about not wanting to think about deployment requirements 😁 |
Beta Was this translation helpful? Give feedback.
-
Yesterday I poked at my DAG FDW PoC some more, and thought once again that perhaps a full extension (rather than an FDW) would be better for this, as we could use native facilities for storage and transaction management. So I went onto PGXN looking for references to see how hard would that be, and rediscovered Apache AGE ("A Graph Extension" for PostgreSQL). It's once again a general-purpose Graph database (extension), so performance for our entire dataset is going to likely be abysmal, but perhaps we could stuff something smaller in there. I'll research that. I'll start with the six months of data we already have, fill in the missing commits between them, and see how it fares on my laptop, storing and querying just the connectivity information, leaving the rest for regular tables. If it's still too slow, we could try cutting it further down. One month would already be pretty good, I think. I tried to reach out to the developers to ask how hard would it be to implement DAG optimizations there, but haven't heard from them yet, I'll try again later. Meanwhile I think we could accept and store revision connectivity data in the regular checkout entries, as we go, so we could recover it later, if Apache AGE works, but we want to move to another solution and/or would be able to host more than six months of data. Using it also unfortunately means that we would have to maintain our own PostgreSQL instance, as it's not in the officially-supported extensions on Google Cloud. But that would be the case for our custom FDW as well. @VinceHillier, do you think you could help us implement deployment of our own PostgreSQL instance to Google Cloud, using the current deployment script? |
Beta Was this translation helpful? Give feedback.
-
Goals and current status
Now that KernelCI is getting closer to production and most of the design and infrastructure for an initial production spin is stabilizing, and with the initial KernelCI - KCIDB interoperability efforts into place, it's now a good moment to start driving the plans to provide higher-level data layers based on the primitive test results data produced by CI systems and submitted to KCIDB.
The goal is to take the initial data (test results, builds, etc) from a source DB and enable additional processes to extract additional information from them (post-processing and analysis) as well as incorporating information from other external sources to amplify this data and provide more elaborated results that can help us automate certain actions (triaging, classification) and also help users get to the root causes of test failures and keep track of them.
Key topics
Where to detect regressions
Regression detection is based on test results, so a "regression" is, essentially, a higher-level view of a test status when a breaking point was detected and the first abstraction defined on top of tests.
Currently, there's no convention on how to detect them, which criteria to use or how they should be defined, so each CI system is probably doing it their way. Considering that KCIDB will ultimately be sink to all the test results generated by multiple kernel CI systems, I see two options wrt where to detect regressions:
Option A: detect them in the source CI system
Each CI system detects when a regression has happened in a particular test by checking when a test instance has gone from PASS to FAIL and fleshes out a "regression" instance that will be ultimately submitted to KCIDB
Pros:
Cons:
Option B: detect them in KCIDB data
CI systems publish test results to KCIDB, and a process on top of KCIDB analyzes all the collected results to detect regressions.
Pros:
Cons:
Time-ordering and linearity of results in KCIDB
It looks like KCIDB results aren't meant to be processed nor interpreted as ordered in time. From a conversation on Slack:
In theory, if we can identify a PASS and a FAIL result for the same test instance (setup, environment, platform, etc) such that the FAIL happened after the PASS, then we can affirm there was a regression somewhere between them.
The question is how to identify which test run happened before/after? If we're focusing only on regressions caused by kernel patches, then the kernel version / commit date should be the key to use for ordering them.
In order to do this we'd need to extract information from each repo, to be able to order two commits in time.
An advantage of this is that it makes the test run timestamp irrelevant, as it should be in this case: we only care about the result of the test ran on a particular kernel version, regardless of when the test ran.
On the other hand, there's a problem: it ignores the rest of the moving parts involved in a test run: kernel configuration and fragments (to fix this we'd have to make sure we're comparing test runs with the same
kbuild "signature") and test code version (a change in the test code may cause an unexpected change in its results under certain circumstances).
Another shortcoming is that this would only be feasible on branches that aren't rebased.
How to model regressions in KCIDB, if necessary
Currently, KCIDB models "builds", "checkouts", "incidents", "issues" and "tests", but AFAICT none of these cover the concept of a regression.
Depending on which direction we decide to go, we may need to model regressions on KCIDB for:
a) detecting regressions based on KCIDB test result data and storing them in KCIDB
or
b) accepting regressions as source data from CI systems
Alternatively, if we're set to detect regressions based on the test result data of KCIDB but we don't want to alter the KCIDB models, we can define and store the regression info separately, although this may not
be as useful, and could bring other problems (keep this data sync'd to KCIDB, reduced data cohesion).
How to extract "issues" from regressions and test failures
This applies both to source CIs, if they are supposed to submit issues, and to KCIDB if we're going to analyze an process the data from KCIDB itself.
How to "profile" what an issue is from a test failure or a regression? Any current ideas or guidelines?
Where to keep higher-level regression info
Regardless of the approach we follow with regards to regression analysis, where should we integrate the analysis data?
a) In the source DB (source CI system or KCIDB)
Pros:
Cons:
b) In a separate DB
Pros:
Cons:
Beta Was this translation helpful? Give feedback.
All reactions