biggie

Biggie is a tool dedicated to:

quickly get data from (external) APIs into either a Mongo or Postgres database
have it exposed/searchable via a dedicated API
investigate it from a Jupyter Lab server

It is a Docker compose setup including:

an orchestrator container based on Celery (including Beat) and asyncio
a Spark cluster made of a master and 2 workers containers
an API endpoints container based on FastAPI
2 containers for MongoDB and PostgreSQL databases
3 containers for Flower, Mongo-Express and DBeaver for monitoring purpose
a Jupyter Lab container to play around with all this

It is currently set up to fetch data from the GitHub Events API then stream it into Mongo and have it exposed/analysed via several dedicated APIs. It was fetching data from the Marvel API until version 0.4.0.

The repository itself is based on the 'Papel' repository.

Installation

Environment

You have to create the .env environment file and use/create a Github token for enabling the test buid sequence.

[wip] Eventually tweak the schedule parameter for the cleaning task (see "Data streaming" section below.).

If you plan to use the same Github-actions CI file, you need to create the same secrets as in the jobs > env section in .github/workflows/docker-ci.yaml (see line 31).

NB:

For all files embedded with secrets, you'll find the <file>.example ready to adapt.

Test

docker compose up api-test orchestrator-test
OR
docker compose --profile test up

Run

Data acquisition

docker compose up orchestrator-prod

This command will spin up the Orchestrator container and:

download all required data and save them as files locally
read these files and load Postgres with relevant data
delete all local files once their data is successfully in Mongo

These tasks are scheduled every minute with a crontab setting, and a custom parameter is implemented to separately schedule the cleaning step while keeping it in sync with the rest of the chain.

See kwargs={"wait_minutes": 30} in the github_events_stream schedule in .../tasks/schedules.py.

docker compose \
  -f compose.yaml \
  --profile prod_acquisition \
  up

Data acquisition with monitoring

You can just pass the monitoring configuration to include the pending containers with any profile or container command: -f compose.monitoring.yaml

For Data acquisition, this will spin up both the Mongo-Express and Flower containers along the production containers.

docker compose -f compose.yaml -f compose.monitoring.yaml --profile monitoring up
docker compose \
  -f compose.yaml \
  -f compose.monitoring.yaml \
  --profile prod_acquisition \
  up

API container with monitoring

Just to have the FastAPI container up

docker compose \
  -f compose.yaml \
  -f compose.monitoring.yaml \
  --profile prod_analytics \
  up

Acquisition, analytics/API and Jupyter containers with monitoring

Spin up the whole shewbang with:

docker compose \
  -f compose.yaml \
  -f compose.monitoring.yaml \
  --profile prod_full \
  up

Teleport and Terraform

The repository includes a Terraform configuration, currently tuned to work with Scaleway. It prepares and checks the plan, then connects to the server via Teleport (required to be installed and set up on the host machine) and deploy the repository with the secrets stored in Github Actions. This is all handled via the CD pipeline.

Nginx deployment (only exposing the API, Jupyter and monitoring containers)

In this configuration, you need to have the necessary sub-domains setup on your domain provider side. You also need:

Nginx installed on the host machine
a certificate generated by certbot without any changes to nginx configuration (see documentation)
```
sudo certbot certonly --nginx    # example command for ubuntu 24
```

Then create the required files and change the volumes path accordingly in the compose files. The nginx configuration files are:

nginx/certificate.conf nginx/containers.conf

Local URLs

API docs

NB: The repository name parameter takes the full repository name including the actor name such as pierrz/biggie.

Monitoring

Default Ports

For all the services, these are the default ports (backend/UI):

Mongo: 27017
Postgres: 5432
Jupyterlab: 8888
FastAPI: 8100
Celery: 5678
RabbitMQ: 5672/15672
Spark
- master: 7077/8080
- worker: 8081
Flower: 5555
Mongo-Express: 8081
DBeaver: 8978

Development

If you want to make some changes in this repo while following the same environment tooling, you can run the following command from the root directory:

poetry config virtualenvs.in-project true
poetry install && poetry shell
pre-commit install

To change the code of the core containers, you need to cd to the related directory and either:

run poetry update to simply install the required dependencies
run the previous command to create a dedicated virtualenv

Contribute

You can always propose a PR based on a TODO, just don't forget to update the release version that you can find in ci.yaml and all pyproject.toml files.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
.vscode		.vscode
api		api
commons		commons
db		db
nginx		nginx
notebook		notebook
orchestrator		orchestrator
spark_cluster		spark_cluster
terraform		terraform
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
compose.monitoring.yaml		compose.monitoring.yaml
compose.yaml		compose.yaml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biggie

Table of Contents

Installation

Environment

Test

Run

Data acquisition

Data acquisition with monitoring

API container with monitoring

Acquisition, analytics/API and Jupyter containers with monitoring

Teleport and Terraform

Nginx deployment (only exposing the API, Jupyter and monitoring containers)

Local URLs

Default Ports

Development

Contribute

About

Releases 23

Packages

Languages

License

pierrz/biggie

Folders and files

Latest commit

History

Repository files navigation

biggie

Table of Contents

Installation

Environment

Test

Run

Data acquisition

Data acquisition with monitoring

API container with monitoring

Acquisition, analytics/API and Jupyter containers with monitoring

Teleport and Terraform

Nginx deployment (only exposing the API, Jupyter and monitoring containers)

Local URLs

Default Ports

Development

Contribute

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 23

Packages 0

Languages

Packages