Biggie is a tool dedicated to:
- quickly get data from (external) APIs into either a Mongo or Postgres database
- have it exposed/searchable via a dedicated API
- investigate it from a Jupyter Lab server
It is a Docker compose setup including:
- an orchestrator container based on Celery (including Beat) and asyncio
- a Spark cluster made of a master and 2 workers containers
- an API endpoints container based on FastAPI
- 2 containers for MongoDB and PostgreSQL databases
- 3 containers for Flower, Mongo-Express and DBeaver for monitoring purpose
- a Jupyter Lab container to play around with all this
It is currently set up to fetch data from the GitHub Events API then stream it into Mongo and have it exposed/analysed via several dedicated APIs. It was fetching data from the Marvel API until version 0.4.0.
The repository itself is based on the 'Papel' repository.
- Installation
- Run
- Teleport and Terraform
- Nginx deployment
- Local URLs
- Default Ports
- Development
- Contribute
You have to create the .env
environment file and use/create a Github token for enabling the test buid sequence.
[wip] Eventually tweak the schedule parameter for the cleaning task (see "Data streaming" section below.).
If you plan to use the same Github-actions CI file, you need to create the same secrets
as in the jobs > env
section in .github/workflows/docker-ci.yaml
(see line 31).
NB:
- For all files embedded with secrets, you'll find the
<file>.example
ready to adapt.
docker compose up api-test orchestrator-test
OR
docker compose --profile test up
docker compose up orchestrator-prod
This command will spin up the Orchestrator container and:
- download all required data and save them as files locally
- read these files and load Postgres with relevant data
- delete all local files once their data is successfully in Mongo
These tasks are scheduled every minute with a crontab setting, and a custom parameter is implemented to separately schedule the cleaning step while keeping it in sync with the rest of the chain.
See kwargs={"wait_minutes": 30}
in the github_events_stream
schedule in .../tasks/schedules.py
.
docker compose \
-f compose.yaml \
--profile prod_acquisition \
up
You can just pass the monitoring configuration to include the pending containers
with any profile or container command:
-f compose.monitoring.yaml
For Data acquisition, this will spin up both the Mongo-Express and Flower containers along the production containers.
docker compose -f compose.yaml -f compose.monitoring.yaml --profile monitoring up
docker compose \
-f compose.yaml \
-f compose.monitoring.yaml \
--profile prod_acquisition \
up
Just to have the FastAPI container up
docker compose \
-f compose.yaml \
-f compose.monitoring.yaml \
--profile prod_analytics \
up
Spin up the whole shewbang with:
docker compose \
-f compose.yaml \
-f compose.monitoring.yaml \
--profile prod_full \
up
The repository includes a Terraform configuration, currently tuned to work with Scaleway. It prepares and checks the plan, then connects to the server via Teleport (required to be installed and set up on the host machine) and deploy the repository with the secrets stored in Github Actions. This is all handled via the CD pipeline.
In this configuration, you need to have the necessary sub-domains setup on your domain provider side. You also need:
- Nginx installed on the host machine
- a certificate generated by
certbot
without any changes to nginx configuration (see documentation)sudo certbot certonly --nginx # example command for ubuntu 24
Then create the required files and change the volumes
path accordingly in the compose
files.
The nginx
configuration files are:
nginx/certificate.conf
nginx/containers.conf
- List of PR events count per repo; default limit at 50 repos
- PR events count for a given repo
- Events total counts for all gathered repos with offset in minutes from now; default offset is 0 minute
- Events total counts for a given repo with offset; same default as above
- PR average delta for a given repository
- Timeline of PR deltas for a given repository (dataviz); default size is 3
- Dashboard (list of counts for 50 most active repos)
- Detailed UI for a given repo, based on several endpoints and dataviz above.
NB: The repository name parameter takes the full repository name
including the actor name such as pierrz/biggie
.
Monitoring
For all the services, these are the default ports (backend/UI):
- Mongo: 27017
- Postgres: 5432
- Jupyterlab: 8888
- FastAPI: 8100
- Celery: 5678
- RabbitMQ: 5672/15672
- Spark
- master: 7077/8080
- worker: 8081
- Flower: 5555
- Mongo-Express: 8081
- DBeaver: 8978
If you want to make some changes in this repo while following the same environment tooling, you can run the following command from the root directory:
poetry config virtualenvs.in-project true
poetry install && poetry shell
pre-commit install
To change the code of the core containers, you need to cd
to the related directory
and either:
- run
poetry update
to simply install the required dependencies - run the previous command to create a dedicated virtualenv
You can always propose a PR based on a TODO
, just don't forget to update the release version
that you can find in ci.yaml
and all pyproject.toml
files.