Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database status notifications to Slack #9

Open
schradert opened this issue May 24, 2024 · 0 comments
Open

Database status notifications to Slack #9

schradert opened this issue May 24, 2024 · 0 comments
Assignees
Labels
feature Implementation of a feature

Comments

@schradert
Copy link

schradert commented May 24, 2024

Motivation

Continuing work in #4052, api.couchers.org/status is live, but would be even more beneficial to the team if we were notified of down time and when the API and database come back up.

Implementation

We can run a containerized python script alongside our other network tools that will periodically query this URL for database connectivity. We'll need to ensure the script can't fail so that downtime is not attributed to this broken script.

Inspiration

@aapeliv uses the following notification script currently to detect infrastructure issues, which we can adapt and improve upon:

import logging
from datetime import datetime, timedelta, timezone
from traceback import format_exception

import requests

logger = logging.getLogger(__name__)


FAIL_NOTIFICATION_INTERVAL = timedelta(minutes=5)

RATE_LIMIT_PERIOD = timedelta(minutes=120)
RATE_LIMIT = 12


def now():
    return datetime.now(timezone.utc)


def now_stamp():
    return now().strftime("%y-%m-%d %H:%M:%S UTC")


def gen_alert():
    try:
        r = requests.get("https://api.couchers.org/status", timeout=5)
        j = r.json()
        if not int(j["coucherCount"]) > 10_000:
            raise Exception("coucher_count does not exceed 10k")
        logger.info(f"Couchers API seems OK")
    except Exception as e:
        traceback = "".join(format_exception(type(e), e, e.__traceback__))
        return f"Couchers API seems down as of {now_stamp()}, traceback:\n\n{traceback}"


notifications = []


def should_rate_limit_notifs():
    global notifications
    cutoff = now() - RATE_LIMIT_PERIOD
    notifications = [n for n in notifications if n >= cutoff]
    # whether to rate limit, whether to send rate limit notif
    return len(notifications) >= RATE_LIMIT, len(notifications) == RATE_LIMIT


def alert(message):
    global notifications

    def _send(msg):
        # TODO: send notification
        pass

    should_limit, should_inform = should_rate_limit_notifs()
    if not should_limit or should_inform:
        _send(message)
        if should_inform:
            _send("Rate limiting notifications.")


fail_start = None
last_notify = None


def run():
    global last_notify, fail_start
    alert_msg = gen_alert()
    if alert_msg:
        if not fail_start:
            fail_start = now()
        logger.error(alert_msg)
        if not last_notify or now() - last_notify > FAIL_NOTIFICATION_INTERVAL:
            alert(alert_msg)
            last_notify = now()
        else:
            logger.info(f"Last sent message at {last_notify}, so not re-sending yet")
    else:
        if fail_start:
            alert("Couchers API back up")
            last_notify = None
        fail_start = None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Implementation of a feature
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant