Skip to content

Commit

Permalink
feat: integrate BERTopic for topic modeling
Browse files Browse the repository at this point in the history
Added dependencies for BERTopic, llvmlite, numba, and Neo4j.
Implemented incremental topic modeling with BERTopic in main.py,
 including model initialization, data loading, fitting, saving,
  and updating topics in Neo4j.
Added FastAPI and ConnectRPC
  • Loading branch information
Septimus4 committed Oct 30, 2024
1 parent fd9cb88 commit fc269a5
Show file tree
Hide file tree
Showing 19 changed files with 2,641 additions and 224 deletions.
21 changes: 19 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,28 @@
name: CI

on:
push:
branches: [ main ]
pull_request:
types: [ opened, synchronize, reopened, ready_for_review ]
branches: [ main ]

jobs:
build:
name: Build and Test
runs-on: ubuntu-latest
if: ${{ !github.event.pull_request.draft }}
services:
neo4j:
image: neo4j:5
ports:
- 7687:7687 # Bolt port
- 7474:7474 # HTTP port
env:
NEO4J_AUTH: neo4j:${{ secrets.NEO4J_PASSWORD }}
options: >-
--health-cmd="curl -f http://localhost:7474 || exit 1"
--health-interval=10s
--health-timeout=5s
--health-retries=5
strategy:
matrix:
Expand All @@ -32,4 +45,8 @@ jobs:
run: poetry run pre-commit run --all-files --show-diff-on-failure

- name: Run tests
env:
NEO4J_URI: bolt://localhost:7687
NEO4J_USER: neo4j
NEO4J_PASSWORD: ${{ secrets.NEO4J_PASSWORD }}
run: poetry run pytest
5 changes: 1 addition & 4 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,14 @@
name: "CodeQL"

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
schedule:
- cron: '35 6 * * 3' # Runs at 06:35 every Wednesday

jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
if: ${{ !github.event.pull_request.draft }}

permissions:
actions: read
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,3 +164,4 @@ cython_debug/
.idea/Concord.iml
.idea/modules.xml
.idea/vcs.xml
/.idea/developer-tools.xml
4 changes: 1 addition & 3 deletions .idea/misc.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 8 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
repos:
- repo: local
hooks:
- id: black
name: black
entry: poetry run black
- id: yapf
name: yapf
entry: poetry run yapf
language: system
types: [ python ]
args: [ --check, --diff ]
pass_filenames: false
args: [ "-i", "-r", "concord/", "tests/" ]
- id: flake8
name: flake8
entry: poetry run flake8
language: system
types: [ python ]
pass_filenames: false
args: [ "concord/", "tests/" ]
- id: pytest
name: pytest
entry: poetry run pytest
language: system
pass_filenames: false
pass_filenames: false
148 changes: 148 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1 +1,149 @@
# Concord

Concord is a Python project that leverages FastAPI, Neo4j, and BERTopic for advanced text analysis. It provides a
platform for analyzing and visualizing text data using state-of-the-art machine learning techniques.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Installation](#installation)
- [Clone the Repository](#clone-the-repository)
- [Set Up Dependencies](#set-up-dependencies)
- [Debian-based Systems](#debian-based-systems)
- [Windows](#windows)
- [Running the Application](#running-the-application)
- [Start Docker Containers](#start-docker-containers)
- [Run Pre-commit Hooks and Tests](#run-pre-commit-hooks-and-tests)
- [Contributing](#contributing)
- [License](#license)

## Prerequisites

- **Python 3.12+**
- **Poetry** for dependency management
- **Docker** and **Docker Compose**
- **Git**

## Installation

### Clone the Repository

```bash
git clone https://github.com/yourusername/concord.git
cd concord
```

### Set Up Dependencies

#### Debian-based Systems

1. **Update Package Lists**

```bash
sudo apt update
```

2. **Install Required Packages**

```bash
sudo apt install -y software-properties-common curl git
```

3. **Install Python 3.12**

Add the Deadsnakes PPA and install Python 3.12:

```bash
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install -y python3.12 python3.12-venv python3.12-dev
```

4. **Install Poetry**

```bash
curl -sSL https://install.python-poetry.org | python3 -
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
```

5. **Install Docker and Docker Compose**

```bash
sudo apt install -y docker.io docker-compose
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
```

Log out and log back in for the group changes to take effect.

6. **Install Project Dependencies**

```bash
poetry install
poetry run pre-commit install
```

#### Windows

1. **Install Python 3.12**

Download and install Python 3.12 from the [official website](https://www.python.org/downloads/windows/). During
installation, make sure to check the box **"Add Python to PATH"**.

2. **Install Git**

Download and install Git from the [official website](https://git-scm.com/download/win).

3. **Install Poetry**

Open Command Prompt or PowerShell and run:

```powershell
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
```

Add Poetry to your PATH by adding the following line to your PowerShell profile:

```powershell
$env:Path += ";$env:APPDATA\Python\Scripts"
```

4. **Install Docker Desktop**

Download and install Docker Desktop from the [official website](https://www.docker.com/products/docker-desktop).
Ensure that it is running before proceeding.

5. **Install Project Dependencies**

```powershell
poetry install
poetry run pre-commit install
```

## Running the Application

### Start Docker Containers

Set up a temporary Neo4j database:

```bash
docker-compose up -d
```

> **Note:** On Windows, ensure Docker Desktop is running and has sufficient resources allocated.
### Run Pre-commit Hooks and Tests

```bash
poetry run pre-commit run -a
```

## Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or suggestions.

## License

This project is licensed under the [MIT License](LICENSE).
Empty file added concord/bert/__init__.py
Empty file.
44 changes: 44 additions & 0 deletions concord/bert/bert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# bert.py

import os

import joblib
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer


def initialize_model():
"""
Initialize the BERTopic model.
You can customize the model with different parameters as needed.
"""
# Using a specific embedding model for better performance
embedding_model = SentenceTransformer("all-mpnet-base-v2")

topic_model = BERTopic(
embedding_model=embedding_model,
verbose=True,
# You can add more parameters here
)
return topic_model


def save_model(model, path):
"""
Save the BERTopic model to disk.
"""
joblib.dump(model, path)
print(f"Model saved to {path}")


def load_model(path):
"""
Load the BERTopic model from disk.
"""
if os.path.exists(path):
model = joblib.load(path)
print(f"Model loaded from {path}")
return model
else:
print(f"No existing model found at {path}. Initializing a new model.")
return None
84 changes: 84 additions & 0 deletions concord/concord.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# concord.py

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

from bert.bert import load_model, save_model

# Configuration
MODEL_SAVE_PATH = "bertopic_model.pkl"
INITIAL_BATCH_SIZE = 100 # Number of documents in the initial batch
BATCH_SIZE = 50 # Number of documents per new batch
TOTAL_BATCHES = 10 # Total number of batches to simulate


def load_data(batch_number, batch_size):
"""
Simulate loading a batch of data.
For demonstration, we're using the 20 Newsgroups dataset.
In a real-world scenario, replace this with your data loading mechanism.
"""
newsgroups = fetch_20newsgroups(subset="all",
remove=("headers", "footers", "quotes"))
start = batch_number * batch_size
end = start + batch_size
if start >= len(newsgroups.data):
return []
return newsgroups.data[start:end]


def setup():
# Load existing model or initialize a new one
topic_model = load_model(MODEL_SAVE_PATH)

# Initialize data storage
all_documents = []

if topic_model is None:
# Initial batch
print("Processing initial batch...")
initial_batch = load_data(batch_number=0,
batch_size=INITIAL_BATCH_SIZE)
all_documents.extend(initial_batch)

# Fit the model with the initial batch
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(all_documents)

# Save the model
save_model(topic_model, MODEL_SAVE_PATH)
else:
# If model exists, load existing documents if stored
# For simplicity, we're not storing all_documents. In a real application, consider storing them.
print("Existing model loaded. Starting incremental updates...")
# Optionally, load existing documents from a file or database
# all_documents = load_existing_documents()

# Simulate incremental updates with new batches
for batch_num in range(1, TOTAL_BATCHES + 1):
print(f"\nProcessing batch {batch_num}...")
new_batch = load_data(batch_number=batch_num, batch_size=BATCH_SIZE)

if not new_batch:
print("No more data to process.")
break

all_documents.extend(new_batch)

# Refit the model with all documents (existing + new)
print("Refitting the BERTopic model with the updated dataset...")
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(all_documents)

# Optionally, analyze topics
print(
f"Number of topics after batch {batch_num}: {len(set(topics)) - (1 if -1 in topics else 0)}"
)

# Save the updated model
save_model(topic_model, MODEL_SAVE_PATH)

# Optional: Integrate with Neo4j to store/update topics
# store_topics_in_neo4j(topic_model, batch_num)

print("\nTopic modeling completed.")
10 changes: 10 additions & 0 deletions concord/fast_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# fast_api.py

from fastapi import FastAPI

app = FastAPI(title="Concord API", version="1.0")


@app.get("/")
def read_root():
return {"message": "Welcome to Concord API"}
Empty file added concord/graph/__init__.py
Empty file.
Loading

0 comments on commit fc269a5

Please sign in to comment.