You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CI runtimes are increasingly becoming a bottleneck for development in RAPIDS. There are numerous reasons for this, including (but not limited to):
An increasingly large matrix as we aim to support more platforms, installation mechanisms, etc.
Additional new libraries requiring new test frameworks.
More tests being added
In the past, our primary focus has been in reducing the load on our GPU runners because those are in the shortest supply, which in turn has meant a focus on more carefully pruning the test matrix (since only test jobs require GPU runners). While this has helped alleviate pressure in the short term, it is clear that we need to take more expansive steps to address the problem in a more comprehensive way. Some notes that should guide some thinking:
When thinking through solutions to this problem, we need to consider both the throughput of CI for a single PR and global throughput across all jobs running at all times. Historically we have been reticent to consider any solutions that slow down a single PR (such as reducing the number of parallel jobs running) even if it would reduce global load. I strongly think we need to reconsider this notion. With the current approach, due to global load a given PR often ends up waiting for test runners anyway, so by maximizing parallelism on each PR we have in fact slowed down every PR.
While the focus on test jobs makes sense in a global sense because we are almost never bottlenecked on being able to spin up CPU runners, on a per-PR basis build jobs are also important to accelerate because they have substantial effects on the end-to-end runtime of CI pipelines of each PR.
This meta-issue aims to catalog a number of the different efforts we could undertake going forward. I have organized solutions into a few different classes.
Tooling
These improvements have a cost to implement, but once implemented will have only positive impacts since they do not involve making any compromises in testing coverage or frequency.
So far we have considered this blocked on the NVKS migration. Perhaps we should consider ways to get this working sooner given the difficulties with the migration.
More judicious selection of jobs
These improvements have a cost to implement and will also have nonzero ongoing maintenance cost to ensure that test coverage remains correct. If implemented correctly, there will be no loss in coverage, but correct implementation will require some care.
These are easy to implement, but without careful monitoring of nightly results could have significant costs if issues are only uncovered later.
Only test one Python version in PRs
Only test one architecture (arm or x86) per PR
Given that we have both arm and x86 runners, we could consider using a round-robin in PRs to get both better coverage and better utilization of available GPU resources
Working with library teams to reduce compile times and build sizes. This is going to be an important ongoing task, and likely something the build team should be dedicating little to no solo effort to.
We recently discovered that pytest's traceback handling for xfailed tests is quite expensive, and switching over to using the native traceback mode with --tb=native therefore shaves off substantial time (10-20%) in total test suite runs since many of our repos have a large number of xfailed tests (Switch to using native traceback cudf#16851). Similarly, in the past we've observed significant improvements by switching the mode by which pytest-xdist distributes to avoid idle workers. There may be other similar optimizations in our pytest usage to be considered.
The text was updated successfully, but these errors were encountered:
CI runtimes are increasingly becoming a bottleneck for development in RAPIDS. There are numerous reasons for this, including (but not limited to):
In the past, our primary focus has been in reducing the load on our GPU runners because those are in the shortest supply, which in turn has meant a focus on more carefully pruning the test matrix (since only test jobs require GPU runners). While this has helped alleviate pressure in the short term, it is clear that we need to take more expansive steps to address the problem in a more comprehensive way. Some notes that should guide some thinking:
This meta-issue aims to catalog a number of the different efforts we could undertake going forward. I have organized solutions into a few different classes.
Tooling
These improvements have a cost to implement, but once implemented will have only positive impacts since they do not involve making any compromises in testing coverage or frequency.
More judicious selection of jobs
These improvements have a cost to implement and will also have nonzero ongoing maintenance cost to ensure that test coverage remains correct. If implemented correctly, there will be no loss in coverage, but correct implementation will require some care.
Running more jobs only in nightlies
These are easy to implement, but without careful monitoring of nightly results could have significant costs if issues are only uncovered later.
Other
Miscellaneous other improvements that will help without being directly focused on improving build times.
--tb=native
therefore shaves off substantial time (10-20%) in total test suite runs since many of our repos have a large number of xfailed tests (Switch to using nativetraceback
cudf#16851). Similarly, in the past we've observed significant improvements by switching the mode by which pytest-xdist distributes to avoid idle workers. There may be other similar optimizations in our pytest usage to be considered.The text was updated successfully, but these errors were encountered: