Skip to content

Commit

Permalink
feat: integrate BERTopic for topic modeling
Browse files Browse the repository at this point in the history
Added dependencies for BERTopic, llvmlite, numba, and Neo4j.
Implemented incremental topic modeling with BERTopic in main.py,
 including model initialization, data loading, fitting, saving,
  and updating topics in Neo4j.
Added FastAPI and ConnectRPC
  • Loading branch information
Septimus4 committed Oct 30, 2024
1 parent fd9cb88 commit 365bd77
Show file tree
Hide file tree
Showing 20 changed files with 2,646 additions and 228 deletions.
28 changes: 25 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,28 @@
name: CI

on:
push:
branches: [ main ]
pull_request:
types: [ opened, synchronize, reopened, ready_for_review ]
branches: [ main ]

jobs:
build:
name: Build and Test
runs-on: ubuntu-latest
if: ${{ !github.event.pull_request.draft }}
services:
neo4j:
image: neo4j:5
ports:
- 7687:7687 # Bolt port
- 7474:7474 # HTTP port
env:
NEO4J_AUTH: ${{ secrets.NEO4J_AUTH }}
options: >-
--health-cmd="curl -f http://localhost:7474 || exit 1"
--health-interval=10s
--health-timeout=5s
--health-retries=5
strategy:
matrix:
Expand All @@ -23,13 +36,22 @@ jobs:
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- uses: Gr1N/setup-poetry@v9

- name: Install Poetry
uses: Gr1N/setup-poetry@v9

- name: Install dependencies
run: poetry install

- name: Run pre-commit hooks
run: poetry run pre-commit run --all-files --show-diff-on-failure

- name: Create .env file
run: |
echo "NEO4J_AUTH=${{ secrets.NEO4J_AUTH }}" > .env
echo "DATABASE_URL=${{ secrets.DATABASE_URL }}" >> .env
- name: Run tests
env:
DATABASE_URL: ${{ secrets.DATABASE_URL }}
run: poetry run pytest
6 changes: 2 additions & 4 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
name: "CodeQL"

on:
push:
branches: [ main ]
pull_request:
types: [ opened, synchronize, reopened, ready_for_review ]
branches: [ main ]
schedule:
- cron: '35 6 * * 3' # Runs at 06:35 every Wednesday

jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
if: ${{ !github.event.pull_request.draft }}

permissions:
actions: read
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,4 @@ cython_debug/
.idea/Concord.iml
.idea/modules.xml
.idea/vcs.xml
/.idea/developer-tools.xml
4 changes: 1 addition & 3 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 8 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
repos:
- repo: local
hooks:
- id: black
name: black
entry: poetry run black
- id: yapf
name: yapf
entry: poetry run yapf
language: system
types: [ python ]
args: [ --check, --diff ]
pass_filenames: false
args: [ "-i", "-r", "concord/", "tests/" ]
- id: flake8
name: flake8
entry: poetry run flake8
language: system
types: [ python ]
pass_filenames: false
args: [ "concord/", "tests/" ]
- id: pytest
name: pytest
entry: poetry run pytest
language: system
pass_filenames: false
pass_filenames: false
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,10 +93,10 @@ We use **GitHub Actions** to automatically run tests and linters on all pull req
poetry run flake8 concord/
```

- **Black**: For automatic code formatting.
- **YAPF**: For automatic code formatting.

```bash
poetry run black concord/
poetry run YAPF concord/
```

- **Pre-Commit Hooks**: Set up pre-commit hooks to automate linting and testing before each commit.
Expand Down
142 changes: 142 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,143 @@
# Concord

Concord is a Python project that leverages FastAPI, Neo4j, and BERTopic for advanced text analysis. It provides a
platform for analyzing and visualizing text data using state-of-the-art machine learning techniques.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Clone the Repository](#clone-the-repository)
- [Set Up Dependencies](#set-up-dependencies)
- [Debian-based Systems](#debian-based-systems)
- [Windows](#windows)
- [Running the Application](#running-the-application)
- [Start Docker Containers](#start-docker-containers)
- [Run Pre-commit Hooks and Tests](#run-pre-commit-hooks-and-tests)

## Prerequisites

- **Python 3.12+**
- **Poetry** for dependency management
- **Docker** and **Docker Compose**
- **Git**

## Installation

### Clone the Repository

```bash
git clone https://github.com/yourusername/concord.git
cd concord
```

### Set Up Dependencies

#### Debian-based Systems

1. **Update Package Lists**

```bash
sudo apt update
```

2. **Install Required Packages**

```bash
sudo apt install -y software-properties-common curl git
```

3. **Install Python 3.12**

Add the Deadsnakes PPA and install Python 3.12:

```bash
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install -y python3.12 python3.12-venv python3.12-dev
```

4. **Install Poetry**

```bash
curl -sSL https://install.python-poetry.org | python3 -
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
```

5. **Install Docker and Docker Compose**

```bash
sudo apt install -y docker.io docker-compose
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
```

Log out and log back in for the group changes to take effect.

6. **Install Project Dependencies**

```bash
poetry install
poetry run pre-commit install
```

#### Windows

1. **Install Python 3.12**

Download and install Python 3.12 from the [official website](https://www.python.org/downloads/windows/). During
installation, make sure to check the box **"Add Python to PATH"**.

2. **Install Git**

Download and install Git from the [official website](https://git-scm.com/download/win).

3. **Install Poetry**

Open Command Prompt or PowerShell and run:

```powershell
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
```

Add Poetry to your PATH by adding the following line to your PowerShell profile:

```powershell
$env:Path += ";$env:APPDATA\Python\Scripts"
```

4. **Install Docker Desktop**

Download and install Docker Desktop from the [official website](https://www.docker.com/products/docker-desktop).
Ensure that it is running before proceeding.

5. **Install Project Dependencies**

```powershell
poetry install
poetry run pre-commit install
```

## Running the Application

### Start Docker Containers

Set up a temporary Neo4j database:

```bash
docker-compose up -d
```

> **Note:** On Windows, ensure Docker Desktop is running and has sufficient resources allocated.
### Run Pre-commit Hooks and Tests

```bash
poetry run pre-commit run -a
```

### License

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](LICENSE.md)
Empty file added concord/bert/__init__.py
Empty file.
44 changes: 44 additions & 0 deletions concord/bert/bert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# bert.py

import os

import joblib
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer


def initialize_model():
"""
Initialize the BERTopic model.
You can customize the model with different parameters as needed.
"""
# Using a specific embedding model for better performance
embedding_model = SentenceTransformer("all-mpnet-base-v2")

topic_model = BERTopic(
embedding_model=embedding_model,
verbose=True,
# You can add more parameters here
)
return topic_model


def save_model(model, path):
"""
Save the BERTopic model to disk.
"""
joblib.dump(model, path)
print(f"Model saved to {path}")


def load_model(path):
"""
Load the BERTopic model from disk.
"""
if os.path.exists(path):
model = joblib.load(path)
print(f"Model loaded from {path}")
return model
else:
print(f"No existing model found at {path}. Initializing a new model.")
return None
Loading

0 comments on commit 365bd77

Please sign in to comment.