Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM][Hosts] Rollout Default APM to remaining IDAs #573

Closed
16 of 18 tasks
timmc-edx opened this issue Mar 13, 2024 · 1 comment
Closed
16 of 18 tasks

[APM][Hosts] Rollout Default APM to remaining IDAs #573

timmc-edx opened this issue Mar 13, 2024 · 1 comment
Assignees

Comments

@timmc-edx
Copy link
Member

timmc-edx commented Mar 13, 2024

Goal is to be able to turn on instrumentation for all services at once for some limited time period, and then disable. This is to gather APM host data for the contract.

AC:

  • Ensure as many APM hosts as possible are enabled in DD for some agreed upon window (TBD with Shannon).
    • Confirm APM Hosts seen in DD.
  • Use New Relic to find data for missing services.
  • Leave some documentation trail

Some implementation steps:

  • Deploy ddtrace-run to Kubernetes IDAs
  • Deploy ddtrace-run to EC2 IDAs
  • See additional todo items down below.

Notes:

  • Communications:
    • Expect a perf hit
    • Disable pymongo integration
  • Goal is to get a coordinated view of all traffic at some point.
  • Need a list of services and infrastructure type.
  • Docs can go under https://2u-internal.atlassian.net/wiki/spaces/SRE/pages/861372422/New+Relic+-+Datadog+Migration
  • Need a planned date for turning on/off.
  • We will do what we can for teams on this to keep this unblocked.
  • It would be great to know how we want to tag the services appropriately.
  • Do we need a separate ticket for Ruby instrumentation? Completed by SRE

Deployment process for Kubernetes web applications:

  • Add ddtrace to the internal dockerfiles (https://github.com/edx/internal-dockerfiles/pull/102 and https://github.com/edx/internal-dockerfiles/pull/103)
    • pip install ddtrace
    • Ensure Dockerfile does USER root before and USER app afterwards, at least if the base Dockerfile is using USER app. Otherwise there will be a permissions error. (The existing pip install newrelic only "works" because newrelic is already installed.)
  • Update service's Helm chart to use django-ida version 0.8.27 or higher and set datadog.enabled=true in the configs, or 0.9.0 to enable by default.

Additional Notes:

  • See and update Tracking spreadsheet
  • Use New Relic list of services to check if migration spreadsheet is complete: See https://onenr.io/0dQeorq8NRe
  • Here is a link for NR hosts per service: https://onenr.io/0BR6Z5bg1RO. You may use this to help prioritize/filter services with low host counts.
  • Find Tim’s past Slack threads communicating changes for example communications, including potential impact to performance, etc.

Containers

  • [Diana] Confirm helm version upgrade status.
  • Confirm auto deployment to Stage, and whether we see data in DD.
  • Add DD links to spreadsheet. Do we need a different link per environment?
  • Coordinate with owners for GoCD deployment to Prod per service.
  • Crons and workers? Where did this go and how far did we get? See Tim’s Slack messages like this one.
  • Ensure Kafka consumer workers have APM. (Note that Kafka Spans will be ticketed separately. This is just about APM Host counts.)

EC2

  • Check status of all ec2 services
  • Confirm DD and get links in spreadsheet per service.

Skipped hosts:

  • For services we are skipping, find relative usage from New Relic and add to spreadsheet.
  • Mark in spreadsheet what will not be enabled

Notes for Skipped Hosts:

@timmc-edx timmc-edx converted this from a draft issue Mar 13, 2024
@robrap robrap changed the title Rollout plan for moving remaining IDAs to Datadog Rollout Datadog APM to remaining IDAs Mar 29, 2024
@robrap robrap moved this to Groomed in Arch-BOM Mar 29, 2024
@robrap robrap changed the title Rollout Datadog APM to remaining IDAs Rollout Datadog APM capabiity to remaining IDAs Mar 29, 2024
@robrap robrap changed the title Rollout Datadog APM capabiity to remaining IDAs Rollout Datadog APM to remaining IDAs Mar 29, 2024
@timmc-edx timmc-edx self-assigned this Apr 2, 2024
@timmc-edx timmc-edx moved this from Groomed to In Progress in Arch-BOM Apr 2, 2024
@jristau1984 jristau1984 changed the title Rollout Datadog APM to remaining IDAs Rollout Default APM to remaining IDAs Apr 2, 2024
timmc-edx added a commit to openedx-unsupported/configuration that referenced this issue Apr 3, 2024
This expands the changes in edxapp to other Django services.

See edx/edx-arch-experiments#573
timmc-edx added a commit to openedx-unsupported/configuration that referenced this issue Apr 5, 2024
This expands the changes in edxapp to other Django services.

See edx/edx-arch-experiments#573
@robrap robrap moved this from In Progress to Prioritized in Arch-BOM Apr 9, 2024
@robrap robrap changed the title Rollout Default APM to remaining IDAs [APM][Hosts] Rollout Default APM to remaining IDAs Apr 9, 2024
@robrap robrap assigned dianakhuang and timmc-edx and unassigned timmc-edx Apr 9, 2024
@dianakhuang dianakhuang moved this from Prioritized to In Progress in Arch-BOM Apr 16, 2024
@robrap
Copy link
Contributor

robrap commented Apr 26, 2024

Done as much as needed for this ticket. Thanks team!

@robrap robrap closed this as completed Apr 26, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Arch-BOM Apr 26, 2024
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

3 participants