Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/automate stop words list pg #294

Merged
merged 25 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
3c61774
wip
polomarcus Nov 28, 2024
9e93c89
wip
polomarcus Dec 2, 2024
b427f77
Merge branch 'main' into feat/automate-stop-words-list-pg
polomarcus Dec 2, 2024
38252c7
wip
polomarcus Dec 2, 2024
7b479c4
Merge branch 'main' into feat/automate-stop-words-list-pg
polomarcus Dec 19, 2024
9ca24eb
ci/cd: docker
polomarcus Dec 19, 2024
a006e3d
wip: pytest -k test_get_top_keywords_by_channel
polomarcus Dec 19, 2024
e79b32c
wip
polomarcus Jan 6, 2025
9bbb953
Merge branch 'main' into feat/automate-stop-words-list-pg
polomarcus Jan 6, 2025
7695ba9
wip
polomarcus Jan 7, 2025
9b035e1
Merge branch 'main' into feat/automate-stop-words-list-pg
polomarcus Jan 7, 2025
a6f3394
Merge branch 'main' into feat/automate-stop-words-list-pg
polomarcus Jan 7, 2025
c827652
wip
polomarcus Jan 7, 2025
ce9943f
wip: test_get_top_keywords_by_channel done
polomarcus Jan 7, 2025
9cfb184
wip: context - utf8 issue
polomarcus Jan 8, 2025
71cd581
wip: fix utf8 issue - need to improve substring case
polomarcus Jan 8, 2025
a620bbd
wip: slq queries done, need to code upsert
polomarcus Jan 9, 2025
0a7e2d3
wip: get context and save it - need to tests edge cases
polomarcus Jan 9, 2025
92a5a32
ci: test stop words separately
polomarcus Jan 13, 2025
876cd26
ci: test stop words separately
polomarcus Jan 13, 2025
8f976d4
test: api import drop tables to separate test with stop words
polomarcus Jan 13, 2025
74a45cd
feat: data quality: add keyword id, start date
polomarcus Jan 14, 2025
221bc41
fix: test ci
polomarcus Jan 14, 2025
bf3e903
doc: unvalidated a stop word
polomarcus Jan 14, 2025
850f561
cd: stop words typo
polomarcus Jan 14, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 23 additions & 10 deletions .github/workflows/deploy-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,33 +55,46 @@ jobs:
- name: Push mediatree_import Image
run: docker push --all-tags ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/mediatree_import

# Not used anymore
# - name: Build ingest_to_db image
# run: docker build -f Dockerfile_ingest . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db
# - name: Push ingest_to_db Image
# run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/ingest_to_db

- name: update scaleway job definition with version mediatree_import
uses: jawher/[email protected]
env:
SCW_ACCESS_KEY: ${{ secrets.SCW_ACCESS_KEY }}
SCW_SECRET_KEY: ${{ secrets.SCW_SECRET_KEY }}
SCW_ORGANIZATION_ID: ${{ secrets.SCW_ORGANIZATION_ID }}
SCW_ZONE: ${{ secrets.SCW_ZONE }}
with:
args: jobs definition update ${{ secrets.SCALEWAY_JOB_IMPORT_ID }} image-uri=${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/mediatree_import:${{ env.PROJECT_VERSION }}

- name: Build s3 image
run: docker build -f Dockerfile_api_to_s3 . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3:${{ env.PROJECT_VERSION }}
- name: Tag s3 latest image
run: docker tag ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3:${{ env.PROJECT_VERSION }} ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3:latest
- name: Push s3 Image
run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3:${{ env.PROJECT_VERSION }}
- name: update scaleway job definition with version mediatree_import

- name: update scaleway job definition with version s3
uses: jawher/[email protected]
env:
SCW_ACCESS_KEY: ${{ secrets.SCW_ACCESS_KEY }}
SCW_SECRET_KEY: ${{ secrets.SCW_SECRET_KEY }}
SCW_ORGANIZATION_ID: ${{ secrets.SCW_ORGANIZATION_ID }}
SCW_ZONE: ${{ secrets.SCW_ZONE }}
with:
args: jobs definition update ${{ secrets.SCALEWAY_JOB_IMPORT_ID }} image-uri=${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/mediatree_import:${{ env.PROJECT_VERSION }}
- name: update scaleway job definition with version s3
args: jobs definition update ${{ secrets.SCALEWAY_JOB_S3_ID }} image-uri=${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3:${{ env.PROJECT_VERSION }}

- name: Build stop_word image
run: docker build -f Dockerfile_stop_word . -t ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/stop_word:${{ env.PROJECT_VERSION }}
- name: Tag stop_word latest image
run: docker tag ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/stop_word:${{ env.PROJECT_VERSION }} ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/stop_word:latest
- name: Push stop_word Image
run: docker push ${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/stop_word:${{ env.PROJECT_VERSION }}

- name: update scaleway job definition with version stopwords
uses: jawher/[email protected]
env:
SCW_ACCESS_KEY: ${{ secrets.SCW_ACCESS_KEY }}
SCW_SECRET_KEY: ${{ secrets.SCW_SECRET_KEY }}
SCW_ORGANIZATION_ID: ${{ secrets.SCW_ORGANIZATION_ID }}
SCW_ZONE: ${{ secrets.SCW_ZONE }}
with:
args: jobs definition update ${{ secrets.SCALEWAY_JOB_S3_ID }} image-uri=${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/s3:${{ env.PROJECT_VERSION }}
args: jobs definition update ${{ secrets.SCALEWAY_STOP_WORDS_ID }} image-uri=${{ secrets.CONTAINER_REGISTRY_ENDPOINT }}/stop_word:${{ env.PROJECT_VERSION }}
57 changes: 55 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,59 @@ jobs:
POSTGRES_PORT: 5432
COMPARE_DURATION: "true"

stop_word:
needs: build
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_PASSWORD: postgres
POSTGRES_USER: user
POSTGRES_DB: postgres
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
id: setup-python
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: ${{ env.POETRY_VERSION }}
virtualenvs-create: true
virtualenvs-in-project: true
virtualenvs-path: .venv
installer-parallel: true
- name: Load cached venv
id: cached-poetry-dependencies
uses: actions/cache@v4
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}

- name: pytest run stop_word
run: |
set -o pipefail
source .venv/bin/activate
poetry run pytest -k 'stop_word'
env:
ENV: dev
POSTGRES_USER: user
POSTGRES_DB: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_HOST: localhost
POSTGRES_PORT: 5432
COMPARE_DURATION: "true"


test_everything_else:
needs: build
runs-on: ubuntu-latest
Expand Down Expand Up @@ -134,7 +187,7 @@ jobs:
run: |
set -o pipefail
source .venv/bin/activate
poetry run pytest -k 'not test_update_pg_keywords' --junitxml=pytest.xml \
poetry run pytest -k 'not test_update_pg_keywords and not stop_word' --junitxml=pytest.xml \
--cov-report=term-missing:skip-covered \
--cov=quotaclimat --cov=postgres test/ | \
tee pytest-coverage.txt
Expand Down Expand Up @@ -184,7 +237,7 @@ jobs:
run: poetry check --lock

coverage_and_reporting:
needs: [test_first_update_keywords, test_everything_else, verify_poetry_lock]
needs: [test_first_update_keywords, test_everything_else, stop_word, verify_poetry_lock]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
Expand Down
46 changes: 46 additions & 0 deletions Dockerfile_stop_word
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
#from https://medium.com/@albertazzir/blazing-fast-python-docker-builds-with-poetry-a78a66f5aed0
FROM python:3.12.7 as builder

ENV VIRTUAL_ENV=/app/.venv

ENV POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_VIRTUALENVS_CREATE=1 \
POETRY_CACHE_DIR=/tmp/poetry_cache

WORKDIR /app

COPY pyproject.toml poetry.lock ./

RUN pip install poetry==1.8.3

RUN poetry install

# The runtime image, used to just run the code provided its virtual environment
FROM python:3.12.7-slim as runtime

WORKDIR /app

ENV VIRTUAL_ENV=/app/.venv
ENV PATH="/app/.venv/bin:$PATH"
ENV PATH="$PYENV_ROOT/bin:$PATH"
ENV PYTHONPATH=/app

COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}

# App code is include with docker-compose as well
COPY quotaclimat ./quotaclimat
COPY postgres ./postgres
COPY pyproject.toml pyproject.toml
COPY alembic/ ./alembic
COPY alembic.ini ./alembic.ini
COPY transform_program.py ./transform_program.py

# healthcheck
EXPOSE 5050

# Use a separate script to handle migrations and start the application
COPY docker-entrypoint_stop_word.sh ./docker-entrypoint_stop_word.sh
RUN chmod +x ./docker-entrypoint_stop_word.sh

ENTRYPOINT ["./docker-entrypoint_stop_word.sh"]
23 changes: 22 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ We should use env variable `UPDATE` like in docker compose (should be set to "t

In order to see actual change in the local DB, run the test first `docker compose up test` and then these commands :
```
docker exec -ti quotaclimat-postgres_db-1 bash
docker exec -ti quotaclimat-postgres_db-1 bash # or docker compose exec postgres_db bash
psql -h localhost --port 5432 -d barometre -U user
--> enter password : password
UPDATE keywords set number_of_keywords=1000 WHERE id = '71b8126a50c1ed2e5cb1eab00e4481c33587db478472c2c0e74325abb872bef6';
Expand Down Expand Up @@ -395,6 +395,27 @@ Env variable used :
* BUCKET_SECRET : Scaleway Secret key
* BUCKET_NAME

# Stop words
To prevent advertising keywords to blow up statistics, we remove stop words based on the number of times a keyword is said in the same context.

The result will be saved inside postgresql table: stop_word.

This table is read by the service "mediatree" to remove stop words from the field "plaintext" to avoid to count them.

Env variables used :
* START_DATE (integer) (unixtimestamp such as mediatree service)
* NUMBER_OF_PREVIOUS_DAYS (integer): default 7 days
* MIN_REPETITION (integer) : default 15 - Number of minimum repetition of a stop word

## Remove a stop word
To remove a false positive, we set to false the `validated` attribute :
```
docker exec -ti quotaclimat-postgres_db-1 bash # or docker compose exec postgres_db bash
psql -h localhost --port 5432 -d barometre -U user
--> enter password : password
UPDATE stop_word set validated=false WHERE id = 'MY_ID';
```

## Production monitoring
* Use scaleway
* Use [Ray dashboard] on port 8265
Expand Down
35 changes: 34 additions & 1 deletion docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ services:
postgres_db:
condition: service_healthy


nginxtest: # to test locally webpages
container_name: nginxtest
image: nginx:latest
Expand Down Expand Up @@ -117,6 +118,38 @@ services:
postgres_db:
condition: service_healthy


stop_word:
ports:
- 5002:5002
build:
context: ./
dockerfile: Dockerfile_stop_word
#entrypoint: ["sleep", "1200"] # use to debug the container if needed
environment:
ENV: docker # change me to prod for real cases
LOGLEVEL: INFO # Change me to info (debug, info, warning, error) to have less log
PYTHONPATH: /app
POSTGRES_USER: user
POSTGRES_DB: barometre
POSTGRES_PASSWORD: password
POSTGRES_HOST: postgres_db
POSTGRES_PORT: 5432
PORT: 5000
HEALTHCHECK_SERVER: "0.0.0.0"
# NUMBER_OF_PREVIOUS_DAYS: 30
# MIN_REPETITION: 15
# START_DATE: 1731683697
volumes:
- ./quotaclimat/:/app/quotaclimat/
- ./postgres/:/app/postgres/
- ./test/:/app/test/
depends_on:
nginxtest:
condition: service_healthy
postgres_db:
condition: service_healthy

postgres_db:
image: postgres:15
ports:
Expand Down Expand Up @@ -259,4 +292,4 @@ secrets: # https://docs.docker.com/compose/use-secrets/
bucket:
file: secrets/scw_bucket.txt
bucket_secret:
file: secrets/scw_bucket_secret.txt
file: secrets/scw_bucket_secret.txt
14 changes: 14 additions & 0 deletions docker-entrypoint_stop_word.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash

# Run migrations before starting the application
echo "Running migrations with alembic if exists"
poetry run alembic upgrade head

if [[ $? -eq 0 ]]; then
echo "Command succeeded"
else
echo "Command failed"
fi

echo "starting stop_word import app"
python quotaclimat/data_processing/mediatree/stop_word/main.py
16 changes: 15 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion postgres/insert_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from sqlalchemy import DateTime
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import JSON
from postgres.schemas.models import sitemap_table, Keywords
from postgres.schemas.models import sitemap_table, Keywords, Stop_Word

def clean_data(df: pd.DataFrame):
df = df.drop_duplicates(subset="id")
Expand Down Expand Up @@ -37,6 +37,7 @@ def show_sitemaps_dataframe(df: pd.DataFrame):
except Exception as err:
logging.warning("Could show sitemap before saving : \n %s \n %s" % (err, df.head(1).to_string()))


def save_to_pg(df, table, conn):
number_of_elements = len(df)
logging.info(f"Saving {number_of_elements} elements to PG table '{table}'")
Expand Down
Loading
Loading