Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Make the OpenSearch Release Process more Efficient as well as Accessible and Executable by External Contributors #5171

Open
gaiksaya opened this issue Nov 7, 2024 · 13 comments
Assignees
Labels

Comments

@gaiksaya
Copy link
Member

gaiksaya commented Nov 7, 2024

Overview

As of 2.18.0 release of OpenSearch and OpenSearch-Dashboards, we have most processes in the release automated individually. This includes almost all major release steps such as building, assembling, testing, signing and promoting artifacts. We also have one step release for publishing the artifacts to all the platforms with one click.
However, what we currently lack is the ability to link these automations end-to-end, streamline communication between stakeholders throughout the process and the ability of an external non-amazonian community member to be a release manager.

This issue describes the current process, the gaps as well as an approach to make the release process more efficient and hands-free. It also argues how 1-click release process is not a feasible option for OpenSearch distribution releases.

What is a 1-click release

1-click release terminology came from a universal release process that was introduced for standalone components in the OpenSearch-project. Check the Github issue and the onboarding guide for more details.
TL;DR: When a component is ready to be released, the maintainer of the component repository initiates a release by pushing a tag to the repository. This triggers a 2 person review of the release using a GitHub issue. Once approved, a draft release is created on the GitHub with the release artifacts attached to it. The draft release triggers the component jenkins workflow that helps to sign and publish the release to the right platform.

Analysis

OpenSearch and OpenSearch-Dashboards consist of multiple components which includes core + plugins. As of 2.18.0, we have 24 backend plugins and 15 front-end plugins that are bundled together to form various distributions.
Comparing to the existing 1-click release process where the process only involves publishing to the right platform after the product has been tested and validated, the release process for OpenSearch is fairly complicated.
The overall release process involves a series of steps, including building, assembling, thorough testing, and meeting various criteria across multiple components. A fully automated, one-click release process may be challenging to achieve, but a more realistic and valuable goal would be to create a streamlined release process that any external community member can manage with minimal effort. This approach ensures accessibility and efficiency while still maintaining control and quality.

What we have

Below is the list of all the automation we have w.r.t release:

  • A robust build and assembly system for OpenSearch and OpenSearch-Dashboards that supports incremental and continuous builds for the given version, platform, architecture and distribution.
  • A signing system that supports PGP, Windows and Macos Signing.
  • End to end integration test framework that tests all (most of) the components bundled in the distributions.
  • Workflow that checks the release notes status for all components and posts the status in the release issue as a comment.
  • Release branch creation workflow
  • Release manifest locking workflow
  • Release tag creation workflow
  • 1-click central release promotion workflow that publishes all the artifacts to all the platforms.
  • A build and test failure notification system via GitHub issue.
  • Thorough documentation of the release process, with each step explained in details.

What we lack

As per the entrance and exit criteria:

Entrance Criteria

Entrance Criteria Automation Status
Each component release issue has an assigned owner Can be tracked in metrics dashboard but release manager needs to go and get the status and update it manually.
Documentation draft PRs are up and in tech review for all component changes We do not have an automated mechanism to check this today. Need to go and check manually.
Sanity testing is done for all components We rely on component teams
Code coverage has not decreased (all new code has tests) Added recently to metrics opensearch-project/opensearch-metrics#90 but lacks comparison as of now
Release notes are ready and available for all components We do have a release notes checker automation in place but a release manager needs to run the workflow and monitor the status via GH comment.
Roadmap is up-to-date (information is available to create release highlights) Has to be manual.
Release ticket is cut, and there's a forum post announcing the start of the window Release ticket cut is automated. Verification of the same is manual via metrics portal. Any kind of communication, forum or otherwise is manual.
Any necessary security reviews are complete Has to be manual.

Exit Criteria

Exit Criteria Automation Status
Performance tests are run, results are posted to the release ticket and there no unexpected regressions Release manager has to get the performance data from data-store cluster and post it in the release issue.
Documentation has been fully reviewed and signed off by the documentation community Manual check required
All integration tests are passing We have automated the testing end to end. However, a release manager needs to go and check for all green status in the metrics dashboards
Release blog is ready @jhmcintyre is looking into the automation for the same

Overall gaps in the current process:

  • Updating release page on website with release manager and release dates
  • Automatic status updates for above exit and entrance criteria.
  • Automatically merging version increment PRs.
  • BWC testings is not tracked and needs to fixed/on-boarded by multiple components.
  • Automate posting performance test results on the release issue.
  • Re-running flaky integration tests until they pass (keeping some kind of threshold for the number of runs).
  • Automatic way of notifying the developers or community members about the new RC build. This includes commenting on the GH issue, slack notifications (internal as well as external), forum posts if required.
  • Release Candidate needs to be validated. We are skipping this step as of now.
  • Access to run jenkins workflows by external community member.
  • Communication in public channels.
  • PR to update opensearch metrics release dashboards Close the 2.18.0 release opensearch-metrics#101 as well as project-website Onboard 3.0.0 Release project-website#3443
  • PR to update nightly playground
  • Infra flakiness (agent nodes and set up)
  • PGP Signature verification from website
  • Native plugin installation
  • Distribution smoke testing

Approach

os-release
uml gist

The above diagram represents an overview of the approach. By closing the gaps in the current process, the OpenSearch release process can be made more hands-free and efficient. At the high level, below changes can go in below phases:

Phase 1: Access to CI system

In order to be facilitate release process smoothly, the release manager needs to be able to view, run and debug the release workflows. This phase will take care of providing the fine grained access control to release specific workflows.

  1. Switch jenkins OIDC login from Amazon federate to GitHub(recommended) or basic auth(not recommended).
  2. Release manager should have access to only release specific workflows. Currently, it is all (admin) or nothing access.
  3. Predefined access timeline. Enable access at the start of the release cycle and revoke release manager access once the release is out.
  4. Detail SOPs for possible workflow and process failures. There are alot of "the gotchas" of the release process. This is heavily experience based knowledge. We do have documentation explaining each step along the way in the release wiki. What we need is in the form of actionable SOPs as well.

Phase 2: Automate existing manual steps

This includes even the smallest step like creating a pull request to update release manager on the website. The current release documentation is very detailed but contains a lot of information that be easily missed. The steps involved in the release process are minor but important. By automating them, the risk of skipping those steps due to human error can be minimized.

Phase 3: Link the automations end-to-end

With smallest automation in place, we need an orchestrator to link these automations together. It would coordinate and manage various components or workflows to ensure that they work together in a structured and automated manner to achieve the goal. At the high level, this would include coordination between workflows/tasks, dependency management, error handling and recovery, monitoring and reporting, etc.

Conclusion

Given the comprehensive nature of the release process, which spans over two weeks, it isn't feasible to have a fully automated, one-click solution from start to finish for OpenSearch and OpenSearch Dashboards. The distribution release process has been always facilitated by the maintainers of this repo. The process can be enhanced by closing the missing gaps listed above (and more). Keeping in mind the OpenSearch’s move to the Linux Foundation, it should be possible for any LF member or an OpenSearch maintainer to release OpenSearch and OpenSearch-Dashboards in the future.

Next steps

  • Get feedback for the mentioned approach from the community and members of OpenSearch
  • Analyze if we have better solutions than the one mentioned above
  • Forge a detail implementation plan for each phase with estimates

Edit:

Execution Plan

Milestones Features Known Dependencies Effort in points Priority
[M1] Access to CI system Switch jenkins OIDC from Amazon federate to GitHub 3 P0
M1 Fine grain access control for release workflows 5 P0
M1 Short-term CI privileges for non-admin release managers Release orchestrator 3 P1
M1 Detail actionable SOPs for possible workflow or process failures Subject to change as process evolves in next phase 5 P0
[M2] Automate the existing manual steps Update release manager and release issue on the project website as soon as release manager is assigned to the release issue 1 P0
M2 Version increments in OpenSearch repo 3 P0
M2 Version increments in OpenSearch-Dashboards repo 3 P0
M2 Auto-merge version increment PRs in all repos Automation app integration P0
M2 Validate release candidate 1 P0
M2 Run distribution smoke tests P0
M2 Run BWC tests for all components Onboarding all components to BWC framework 1 P0
M2 Re-running flaky integration tests until they pass 2 P0
M2 Single source of release communication. All comments on GH release should be propogated to slack channel so that RM does not have to post in multiple places 3 P0
M2 Auto update entrance criteria status 3 P1
M2 Auto update exit criteria status 3 P1
M2 Timely notify release manager about release status, entrance/exit criterias 2 P0
M2 [Entrance Criteria] Auto assign the component release owner is one is missing 1 P0
M2 [Entrance Criteria] Automate checking whether documentation draft PRs are up and in tech review for all component changes. Notify the issue owner if not. 3 P0
M2 [Entrance Criteria] Code coverage has not decreased (all new code has tests) 3 P0
M2 [Entrance Criteria] Automate checking release notes and notify release owners about missing ones 2 P0
M2 [Exit Criteria] Notify maintainers about failing and flaky tests 1 P0
M2 [Exit Criteria] Post performance test results on release issue when ready 2 P0
M2 [Exit Criteria] Release blog readiness Release blog automation P0
M2 [Post] Validate native plugins installation 3 P0
M2 [Post] PGP Signature verification and validation from website 2 P0
M2 [Post] Update opensearch metrics 1 P1
M2 [Post] Update nightly playgrounds 1 P1
[M3] Link the automations end-to-end Orchestrator that connects all automations together 5 P0
M3 Monitoring and reporting of release workflows 3 P0
@gaiksaya
Copy link
Member Author

gaiksaya commented Nov 7, 2024

Adding @getsaurabh02 @peterzhuamazon @prudhvigodithi @rishabh6788 @dblock @reta @andrross to get some input.
Thank you!

@dblock
Copy link
Member

dblock commented Nov 7, 2024

Thank you for the very thorough analysis of the release workflow!

I think we may be mixing two problems:

  1. The release can be managed by a non-Amazon, community member.
  2. The release process lacks automation.

Of course, if you do (2), then (1) becomes easier, but you have shown that the gap is fairly wide. Would it be possible to add the short list of what the must have's are for (1)? For example, "Access to run jenkins workflows by external community member." would be required, but "Sanity testing is done for all components" can continue being done the same way as today. In my opinion those must have's should be Phase 1, with a clear goal of having a community member manage an X.Y version release.

@reta
Copy link
Contributor

reta commented Nov 7, 2024

Thanks a lot for this very detailed description of the current release process. The suggestions to improve it are sound (and the subjects that @dblock brought in are super relevant). It definitely makes sense to me to start with automating the existing manual steps (at least we could start with OpenSearch Core / Dashboard), there are too many manual approvals (PRs, etc) at the moment which definitely contribute to the time and friction of the release.

@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label Nov 7, 2024
@getsaurabh02
Copy link
Member

Thanks @gaiksaya this is super deep analysis of the release workflow! Excited to see this coming soon.

@rishabh6788
Copy link
Collaborator

Thank you @gaiksaya for the detailed analysis. I believe it will be super helpful to have a runbook that has details on what jobs are relevant to releasing artifact and at which stage are they useful. Also lists the steps and gotchas we have for building, testing and releasing the artifact.
Till we have achieved maximum automation this will be handy for any community contributor to make sense of our release process and avoid back and forth during release process.

@gaiksaya
Copy link
Member Author

gaiksaya commented Nov 7, 2024

Thank you for the feedback everyone.

@dblock

Would it be possible to add the short list of what the must have's are for "The release can be managed by a non-Amazon, community member."

The must have at this point are:

  1. Switch jenkins OIDC login from Amazon federate to GitHub(recommended) or basic auth(not recommended).
  2. Release manager should have access to only release specific workflows. Currently, it is all (admin) or nothing access.
  3. Predefined access timeline. Enable access at the start of the release cycle and revoke release manager access once the release is out.
  4. Detail SOPs for possible workflow and process failures. As @rishabh6788 suggested "the gotchas" of the release process. This is heavily experience based knowledge. We do have documentation explaining each step along the way in the release wiki. What we need is in the form of actionable SOPs as well.
    Keeping in mind the upcoming automations, point 4 might be something that we would keep on iterating.

If everyone agrees I can add these must haves to Access Control phase and make it as Phase 1 as per the suggestion.

Thanks for the reminder @reta about merging PRs. We do have an open issue for auto-merging PRs opensearch-project/automation-app#37 that needs to be addressed. For code complete and feature freeze I believe they still depend on the repo maintainers what needs to get in before the mentioned dates. Tracking issues/PRs based on version label can be done but not sure how much of overhead that would be on the release manager instead of trusting on the maintainers to get the changes in. Regarding opensearch-project/automation-app#37 @peterzhuamazon do you think this should be moved to automation-app or we should think of an alternative approach such as automatic-merges workflows?

@reta
Copy link
Contributor

reta commented Nov 8, 2024

Thanks @gaiksaya !

For code complete and feature freeze I believe they still depend on the repo maintainers what needs to get in before the mentioned dates.

I think we could focus on OpenSearch core first, in many regards this is a blocker for any other non-core component. The way I would envision that (for core only):

  • cut the release branch from 2.x (add manifests, (auto) merge all PRs)
  • bump 2.x version, add BWC, the released one, etc (update manifests, (auto) merge all PRs)
  • update main, add BWC, the released one, etc (update manifests if needed, (auto) merge all PRs)

With core out of the way, all external components are unblocked (at least with respect to dependencies on core). That would tremendously reduce the amount of manual work we do now.

@gaiksaya
Copy link
Member Author

gaiksaya commented Nov 8, 2024

Thanks @reta. Trying to avoid going in implementation details but I believe we do have the mentioned steps semi-automated. Just need to link them together. Is it possible for you to link the current workflows that creates those PR from core repo?
Will add as a missing gap about linking them together in the issue body.

@peterzhuamazon
Copy link
Member

I agree with @rishabh6788 that a runbook or SOP would definitely be helping.
I also hope the steps for external contributors would be using existing github mechanics as much as possible.

Thanks.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Nov 9, 2024

Thanks for the reminder @reta about merging PRs. We do have an open issue for auto-merging PRs opensearch-project/automation-app#37 that needs to be addressed. For code complete and feature freeze I believe they still depend on the repo maintainers what needs to get in before the mentioned dates. Tracking issues/PRs based on version label can be done but not sure how much of overhead that would be on the release manager instead of trusting on the maintainers to get the changes in. Regarding opensearch-project/automation-app#37 @peterzhuamazon do you think this should be moved to automation-app or we should think of an alternative approach such as automatic-merges workflows?

It could be moved to automation-app as a centralized approach.
Running automatic-merges still highly depend on github actions on individual repo.
I would suggest we move to automation-app and do some analysis before start implementation. Thanks.

@reta
Copy link
Contributor

reta commented Nov 11, 2024

Thanks @reta. Trying to avoid going in implementation details but I believe we do have the mentioned steps semi-automated

Exactly @gaiksaya , they are semi automated but it needs a lot of work to finish this semi automation, let me give you just a few example when it falls short:

  • when we cut any 2.x release, the newly release branch and 2.x keep the same version (but 2.x should be moved to next one)
  • it takes time for build manifests for new release to get merge (even with PRs created), as such plugin are blocked
  • plus, new release needs to be propagated to 2.x and main (BWC), the pull requests are created for all these branches but the problem is that 2.x has to get this change first, main could only be handled after, as such the checks fail and need to be manually retriggered

The semi automation definitely helps, but last mile is still has to be done manually. AFAIK, this is managed by opensearch-trigger-bot, not repo workflows

@prudhvigodithi
Copy link
Member

  • when we cut any 2.x release, the newly release branch and 2.x keep the same version (but 2.x should be moved to next one)

Currently, plugin version updates for the .x branch, based on the core repository, are automated via the Gradle task updateVersion. We could establish a similar automated approach for core repositories BWC update and version increments. Additionally, we could leverage the Metrics Datastore to manage this process,ReleaseInputs.java. Once the core repos version increment is accurate and up-to date, with todays automation plugins version increments PR's are created accordingly.

@gaiksaya
Copy link
Member Author

gaiksaya commented Nov 12, 2024

Thank you! I added an execution plan in the issue body above. Let me know if it makes sense. Will start creating issues for Milestone 1 for now in respective repositories.
cc: @getsaurabh02

@gaiksaya gaiksaya changed the title Make the OpenSearch Release Process Accessible and Executable by External Contributors [META] Make the OpenSearch Release Process Accessible and Executable by External Contributors Nov 13, 2024
@gaiksaya gaiksaya added the Meta label Nov 13, 2024
@gaiksaya gaiksaya changed the title [META] Make the OpenSearch Release Process Accessible and Executable by External Contributors [META] Make the OpenSearch Release Process more Efficient as well as Accessible and Executable by External Contributors Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 New
Status: New
Development

No branches or pull requests

7 participants