-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: integrate BERTopic for topic modeling
Added dependencies for BERTopic, llvmlite, numba, and Neo4j. Implemented incremental topic modeling with BERTopic in main.py, including model initialization, data loading, fitting, saving, and updating topics in Neo4j. Added FastAPI and ConnectRPC
- Loading branch information
Showing
20 changed files
with
2,646 additions
and
228 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -164,3 +164,4 @@ cython_debug/ | |
.idea/Concord.iml | ||
.idea/modules.xml | ||
.idea/vcs.xml | ||
/.idea/developer-tools.xml |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,19 +1,20 @@ | ||
repos: | ||
- repo: local | ||
hooks: | ||
- id: black | ||
name: black | ||
entry: poetry run black | ||
- id: yapf | ||
name: yapf | ||
entry: poetry run yapf | ||
language: system | ||
types: [ python ] | ||
args: [ --check, --diff ] | ||
pass_filenames: false | ||
args: [ "-i", "-r", "concord/", "tests/" ] | ||
- id: flake8 | ||
name: flake8 | ||
entry: poetry run flake8 | ||
language: system | ||
types: [ python ] | ||
pass_filenames: false | ||
args: [ "concord/", "tests/" ] | ||
- id: pytest | ||
name: pytest | ||
entry: poetry run pytest | ||
language: system | ||
pass_filenames: false | ||
pass_filenames: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,143 @@ | ||
# Concord | ||
|
||
Concord is a Python project that leverages FastAPI, Neo4j, and BERTopic for advanced text analysis. It provides a | ||
platform for analyzing and visualizing text data using state-of-the-art machine learning techniques. | ||
|
||
## Table of Contents | ||
|
||
- [Prerequisites](#prerequisites) | ||
- [Installation](#installation) | ||
- [Clone the Repository](#clone-the-repository) | ||
- [Set Up Dependencies](#set-up-dependencies) | ||
- [Debian-based Systems](#debian-based-systems) | ||
- [Windows](#windows) | ||
- [Running the Application](#running-the-application) | ||
- [Start Docker Containers](#start-docker-containers) | ||
- [Run Pre-commit Hooks and Tests](#run-pre-commit-hooks-and-tests) | ||
|
||
## Prerequisites | ||
|
||
- **Python 3.12+** | ||
- **Poetry** for dependency management | ||
- **Docker** and **Docker Compose** | ||
- **Git** | ||
|
||
## Installation | ||
|
||
### Clone the Repository | ||
|
||
```bash | ||
git clone https://github.com/yourusername/concord.git | ||
cd concord | ||
``` | ||
|
||
### Set Up Dependencies | ||
|
||
#### Debian-based Systems | ||
|
||
1. **Update Package Lists** | ||
|
||
```bash | ||
sudo apt update | ||
``` | ||
|
||
2. **Install Required Packages** | ||
|
||
```bash | ||
sudo apt install -y software-properties-common curl git | ||
``` | ||
|
||
3. **Install Python 3.12** | ||
|
||
Add the Deadsnakes PPA and install Python 3.12: | ||
|
||
```bash | ||
sudo add-apt-repository ppa:deadsnakes/ppa | ||
sudo apt update | ||
sudo apt install -y python3.12 python3.12-venv python3.12-dev | ||
``` | ||
|
||
4. **Install Poetry** | ||
|
||
```bash | ||
curl -sSL https://install.python-poetry.org | python3 - | ||
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc | ||
source ~/.bashrc | ||
``` | ||
|
||
5. **Install Docker and Docker Compose** | ||
|
||
```bash | ||
sudo apt install -y docker.io docker-compose | ||
sudo systemctl start docker | ||
sudo systemctl enable docker | ||
sudo usermod -aG docker $USER | ||
``` | ||
|
||
Log out and log back in for the group changes to take effect. | ||
|
||
6. **Install Project Dependencies** | ||
|
||
```bash | ||
poetry install | ||
poetry run pre-commit install | ||
``` | ||
|
||
#### Windows | ||
|
||
1. **Install Python 3.12** | ||
|
||
Download and install Python 3.12 from the [official website](https://www.python.org/downloads/windows/). During | ||
installation, make sure to check the box **"Add Python to PATH"**. | ||
|
||
2. **Install Git** | ||
|
||
Download and install Git from the [official website](https://git-scm.com/download/win). | ||
|
||
3. **Install Poetry** | ||
|
||
Open Command Prompt or PowerShell and run: | ||
|
||
```powershell | ||
(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python - | ||
``` | ||
|
||
Add Poetry to your PATH by adding the following line to your PowerShell profile: | ||
|
||
```powershell | ||
$env:Path += ";$env:APPDATA\Python\Scripts" | ||
``` | ||
|
||
4. **Install Docker Desktop** | ||
|
||
Download and install Docker Desktop from the [official website](https://www.docker.com/products/docker-desktop). | ||
Ensure that it is running before proceeding. | ||
|
||
5. **Install Project Dependencies** | ||
|
||
```powershell | ||
poetry install | ||
poetry run pre-commit install | ||
``` | ||
|
||
## Running the Application | ||
|
||
### Start Docker Containers | ||
|
||
Set up a temporary Neo4j database: | ||
|
||
```bash | ||
docker-compose up -d | ||
``` | ||
|
||
> **Note:** On Windows, ensure Docker Desktop is running and has sufficient resources allocated. | ||
### Run Pre-commit Hooks and Tests | ||
|
||
```bash | ||
poetry run pre-commit run -a | ||
``` | ||
|
||
### License | ||
|
||
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](LICENSE.md) |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# bert.py | ||
|
||
import os | ||
|
||
import joblib | ||
from bertopic import BERTopic | ||
from sentence_transformers import SentenceTransformer | ||
|
||
|
||
def initialize_model(): | ||
""" | ||
Initialize the BERTopic model. | ||
You can customize the model with different parameters as needed. | ||
""" | ||
# Using a specific embedding model for better performance | ||
embedding_model = SentenceTransformer("all-mpnet-base-v2") | ||
|
||
topic_model = BERTopic( | ||
embedding_model=embedding_model, | ||
verbose=True, | ||
# You can add more parameters here | ||
) | ||
return topic_model | ||
|
||
|
||
def save_model(model, path): | ||
""" | ||
Save the BERTopic model to disk. | ||
""" | ||
joblib.dump(model, path) | ||
print(f"Model saved to {path}") | ||
|
||
|
||
def load_model(path): | ||
""" | ||
Load the BERTopic model from disk. | ||
""" | ||
if os.path.exists(path): | ||
model = joblib.load(path) | ||
print(f"Model loaded from {path}") | ||
return model | ||
else: | ||
print(f"No existing model found at {path}. Initializing a new model.") | ||
return None |
Oops, something went wrong.