CODiS Crawler is an open source web crawler. This package can help you to download the data from CODiS platform. You may download the data via selenium
-based or httpx
-based approach.
pip install -r requirements.txt
- Editable mode
pip install -e .
- Install to system
pip install build
python -m build
python -m pip install dist/codis_crawler-0.1.0-py3-none-any.whl
Currently, we provide a command to install the chrome web driver.
python -m codis_crawler install-driver
Due to the network latency and other issues, sometime the element will not appears as expected. If there is an error occur, try to execute the script again.
When using this approach, the script will interact with CODiS via web driver and selenium. The script will input the informations, open the popup dashboard, then click the download button. A csv
file will be downlaoded if the script executed without error.
- current date
python -m codis_crawler selenium-crawler --station 467490
- Given a specific date
python -m codis_crawler selenium-crawler --station 467490 --date 2023-10-01
python -m codis_crawler selenium-crawler --station 467490 --headless
When using httpx-based approach, target date must be provided.
python -m codis_crawler httpx-crawler --station 467490 --date <target date>
The script will download a json from the CODiS API. The JSON file will be transformed in to a dataframe and export as a CSV file through a parse function.
We provide an alternative way for selenium-based approch: execute via remote chrome executor. To use this method, you must install docker and run a selenium/standalone-chrome
container as a remote executor.
To get your file, you must mount a folder into the container. Create a folder and change its permission before start a container:
mkdir /home/ubuntu/files
chown 1200:1201 /home/ubuntu/files
Then start a container using the following command:
docker run -d -p 4444:4444 --shm-size="2g" -v /home/ubuntu/files:/home/seluser/Downloads selenium/standalone-chrome:4.18.1-20240224
Please make sure you change the permission, otherwise the file cannot be writed into the correct place.
Run the script with the following command for remote executor:
python -m codis_crawler selenium-crawler --station 467490 --mode remote --url http://127.0.0.1/wd/hub
For more information, please visit the selenium repository.