Skip to content
/ biggie Public
generated from pierrz/papel

Jupyter, (Py)Spark and FastAPI toolkit

License

Notifications You must be signed in to change notification settings

pierrz/biggie

Repository files navigation

biggie

Biggie is a tool dedicated to:

  • quickly get data from (external) APIs into either a Mongo or Postgres database
  • have it exposed/searchable via a dedicated API
  • investigate it from a Jupyter Lab server

It is a Docker compose setup including:

  • an orchestrator container based on Celery (including Beat) and asyncio
  • a Spark cluster made of a master and 2 workers containers
  • an API endpoints container based on FastAPI
  • 2 containers for MongoDB and PostgreSQL databases
  • 3 containers for Flower, Mongo-Express and DBeaver for monitoring purpose
  • a Jupyter Lab container to play around with all this

It is currently set up to fetch data from the GitHub Events API then stream it into Mongo and have it exposed/analysed via several dedicated APIs. It was fetching data from the Marvel API until version 0.4.0.

The repository itself is based on the 'Papel' repository.


Table of Contents



Installation

Environment

You have to create the .env environment file and use/create a Github token for enabling the test buid sequence.

[wip] Eventually tweak the schedule parameter for the cleaning task (see "Data streaming" section below.).

If you plan to use the same Github-actions CI file, you need to create the same secrets as in the jobs > env section in .github/workflows/docker-ci.yaml (see line 31).

NB:

  • For all files embedded with secrets, you'll find the <file>.example ready to adapt.

Test

docker compose up api-test orchestrator-test
OR
docker compose --profile test up

Run

Data acquisition

docker compose up orchestrator-prod

This command will spin up the Orchestrator container and:

  • download all required data and save them as files locally
  • read these files and load Postgres with relevant data
  • delete all local files once their data is successfully in Mongo

These tasks are scheduled every minute with a crontab setting, and a custom parameter is implemented to separately schedule the cleaning step while keeping it in sync with the rest of the chain.

See kwargs={"wait_minutes": 30} in the github_events_stream schedule in .../tasks/schedules.py.

docker compose \
  -f compose.yaml \
  --profile prod_acquisition \
  up

Data acquisition with monitoring

You can just pass the monitoring configuration to include the pending containers with any profile or container command: -f compose.monitoring.yaml

For Data acquisition, this will spin up both the Mongo-Express and Flower containers along the production containers.

docker compose -f compose.yaml -f compose.monitoring.yaml --profile monitoring up
docker compose \
  -f compose.yaml \
  -f compose.monitoring.yaml \
  --profile prod_acquisition \
  up

API container with monitoring

Just to have the FastAPI container up

docker compose \
  -f compose.yaml \
  -f compose.monitoring.yaml \
  --profile prod_analytics \
  up

Acquisition, analytics/API and Jupyter containers with monitoring

Spin up the whole shewbang with:

docker compose \
  -f compose.yaml \
  -f compose.monitoring.yaml \
  --profile prod_full \
  up

Teleport and Terraform

The repository includes a Terraform configuration, currently tuned to work with Scaleway. It prepares and checks the plan, then connects to the server via Teleport (required to be installed and set up on the host machine) and deploy the repository with the secrets stored in Github Actions. This is all handled via the CD pipeline.


Nginx deployment (only exposing the API, Jupyter and monitoring containers)

In this configuration, you need to have the necessary sub-domains setup on your domain provider side. You also need:

  • Nginx installed on the host machine
  • a certificate generated by certbot without any changes to nginx configuration (see documentation)
    sudo certbot certonly --nginx    # example command for ubuntu 24
    

Then create the required files and change the volumes path accordingly in the compose files. The nginx configuration files are:

nginx/certificate.conf nginx/containers.conf


Local URLs

API docs

NB: The repository name parameter takes the full repository name including the actor name such as pierrz/biggie.

Monitoring


Default Ports

For all the services, these are the default ports (backend/UI):

  • Mongo: 27017
  • Postgres: 5432
  • Jupyterlab: 8888
  • FastAPI: 8100
  • Celery: 5678
  • RabbitMQ: 5672/15672
  • Spark
    • master: 7077/8080
    • worker: 8081
  • Flower: 5555
  • Mongo-Express: 8081
  • DBeaver: 8978

Development

If you want to make some changes in this repo while following the same environment tooling, you can run the following command from the root directory:

poetry config virtualenvs.in-project true
poetry install && poetry shell
pre-commit install

To change the code of the core containers, you need to cd to the related directory and either:

  • run poetry update to simply install the required dependencies
  • run the previous command to create a dedicated virtualenv

Contribute

You can always propose a PR based on a TODO, just don't forget to update the release version that you can find in ci.yaml and all pyproject.toml files.