diff --git a/docs/db/schema.md b/docs/db/schema.md
index 6b44d8a..9bf0d0a 100644
--- a/docs/db/schema.md
+++ b/docs/db/schema.md
@@ -1,9 +1,3 @@
-Here’s an updated markdown version with explanations for `SemanticVector` and `score_decay_rate`:
-
----
-
-# Graph Representation
-
## Nodes
### Channel
@@ -24,9 +18,9 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
- **topic_id**: Unique identifier
- **name**: Summary of the topic
-- **keywords**: List of key terms with scores
-- **overall_score**: Average or cumulative score
+- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
- **bertopic_metadata**: BerTopic metadata
+- **topic_embedding: Topic embedding
- **updated_at**: Last updated timestamp
---
@@ -45,14 +39,10 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
### SemanticVector
- **vector_id**: Unique identifier
-- **semantic_vector**: Aggregated representation of recent message semantics in a channel. This vector captures the
- summarized, anonymized essence of new content without storing individual messages, aligning with privacy requirements.
+- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
- **created_at**: Creation date
-> **Explanation**: The `SemanticVector` represents the semantic profile of recent messages in a channel, allowing
-> Concord to adjust topic relevance without storing each message. Each vector aggregates the semantics of recent content
-> into a general representation, which can influence the `channel_score` in `ASSOCIATED_WITH` relationships between
-> channels and topics. This approach maintains user privacy while updating topic relevance dynamically.
+> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.
---
@@ -60,43 +50,20 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
### ASSOCIATED_WITH (Channel → Topic)
-- **channel_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
-- **keywords_weights**: Channel-specific keywords and their weights, reflecting the unique relationship between the
- channel and topic
+- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
+- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
- **message_count**: Number of messages analyzed in relation to the topic
- **last_updated**: Timestamp of the last update
-- **score_decay_rate**: Rate at which `channel_score` decreases over time if no new relevant messages are analyzed. This
- decay rate allows topic scores to adjust gradually, so less active or outdated topics diminish in relevance without
- active content.
- **trend**: Indicator of topic trend over time within the channel
-> **Explanation**: `score_decay_rate` ensures that topics associated with a channel decrease in relevance if no new
-> messages support their ongoing importance. This helps maintain an accurate and current reflection of active discussions
-> in a channel, giving more weight to trending or frequently discussed topics while allowing older or less relevant topics
-> to fade naturally.
+> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.
---
### RELATED_TO (Topic ↔ Topic)
- **similarity_score**: Degree of similarity between two topics
-- **temporal_similarity**: Time-based similarity metric to track changing topic relationships over time
-- **co-occurrence_rate**: Frequency with which two topics are discussed together across channels
+- **temporal_similarity**: Metric to track similarity over time
+- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
- **common_channels**: Number of shared channels discussing both topics
-- **topic_trend_similarity**: Similarity in trends or changes in relevance for each topic
-
-```mermaid
-graph TD
-%% Nodes
- Channel["Channel
-------------------------
channel_id: Unique identifier
platform: Platform (e.g., Telegram)
name: Name of the channel
description: Description of the channel
created_at: Creation date
active_members_count: Number of active members
language: Language of the channel
region: Geographical region
activity_score: Posting activity score"]
- Topic["Topic
-------------------------
topic_id: Unique identifier
name: Summary of the topic
keywords: List of key terms with scores
overall_score: Average or cumulative score
bertopic_metadata: BerTopic metadata
updated_at: Last updated timestamp"]
- TopicUpdate["TopicUpdate
-------------------------
update_id: Unique identifier
channel_id: Associated channel
topic_id: Associated topic
keywords: Keywords from the update
score_delta: Change in topic score
timestamp: Update time"]
- SemanticVector["SemanticVector
-------------------------
vector_id: Unique identifier
semantic_vector: Aggregated semantics
created_at: Creation date"]
-%% Relationships
- Channel -.-> ASSOCIATED_WITH["ASSOCIATED_WITH Relationship
-------------------------
channel_score: Cumulative or weighted score
keywords_weights: Channel-specific keywords and weights
message_count: Number of messages analyzed
last_updated: Timestamp of last update
score_decay_rate: Rate of score decay
trend: Topic trend over time"] --> Topic
- Topic -.-> RELATED_TO["RELATED_TO Relationship
-------------------------
similarity_score: Degree of similarity
temporal_similarity: Time-based similarity
co-occurrence_rate: Co-occurrence of keywords
common_channels: Number of shared channels
topic_trend_similarity: Trend alignment"] --> Topic
- TopicUpdate --> Topic
- SemanticVector --> Channel
-```
-
----
\ No newline at end of file
+- **topic_trend_similarity**: Measure of similarity in topic trends across channels
diff --git a/src/bert/concord.py b/src/bert/concord.py
index b674629..2de4bc1 100644
--- a/src/bert/concord.py
+++ b/src/bert/concord.py
@@ -1,9 +1,13 @@
# concord.py
from bert.pre_process import preprocess_documents
+from graph.schema import Topic
-def concord(topic_model, documents):
+def concord(
+ topic_model,
+ documents,
+):
# Load the dataset and limit to 100 documents
print(f"Loaded {len(documents)} documents.")
@@ -40,4 +44,4 @@ def concord(topic_model, documents):
print(f" {word_score_str}")
print("\nTopic modeling completed.")
- return len(documents), None
+ return len(documents), Topic.create_topic()
diff --git a/src/graph/schema.py b/src/graph/schema.py
index 3798c42..c4412ac 100644
--- a/src/graph/schema.py
+++ b/src/graph/schema.py
@@ -10,11 +10,10 @@
# Relationship Models
class AssociatedWithRel(StructuredRel):
- channel_score = FloatProperty()
+ topic_score = FloatProperty()
keywords_weights = ArrayProperty()
message_count = IntegerProperty()
last_updated = DateTimeProperty()
- score_decay_rate = FloatProperty()
trend = StringProperty()
@@ -60,14 +59,13 @@ def create_channel(cls, platform: str, name: str, description: str,
def associate_with_topic(self, topic: 'Topic', channel_score: float,
keywords_weights: List[str], message_count: int,
- score_decay_rate: float, trend: str) -> None:
+ trend: str) -> None:
self.topics.connect(
topic, {
'channel_score': channel_score,
'keywords_weights': keywords_weights,
'message_count': message_count,
'last_updated': datetime.utcnow(),
- 'score_decay_rate': score_decay_rate,
'trend': trend
})
@@ -83,8 +81,8 @@ class Topic(StructuredNode):
topic_id = UniqueIdProperty()
name = StringProperty()
keywords = ArrayProperty()
- overall_score = FloatProperty()
bertopic_metadata = JSONProperty()
+ topic_embedding = ArrayProperty()
updated_at = DateTimeProperty(default_now=True)
# Relationships
@@ -96,17 +94,24 @@ class Topic(StructuredNode):
# Wrapper Functions
@classmethod
- def create_topic(cls, name: str, keywords: List[str], overall_score: float,
+ def create_topic(cls, name: str, keywords: List[str],
bertopic_metadata: Dict[str, Any]) -> 'Topic':
+ """
+ Create a new topic node with the given properties.
+ """
return cls(name=name,
keywords=keywords,
- overall_score=overall_score,
bertopic_metadata=bertopic_metadata).save()
def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,
temporal_similarity: float, co_occurrence_rate: float,
common_channels: int,
topic_trend_similarity: float) -> None:
+ """
+ Create a relationship to another topic with various similarity metrics.
+ """
+ if not isinstance(other_topic, Topic):
+ raise ValueError("The related entity must be a Topic instance.")
self.related_topics.connect(
other_topic, {
'similarity_score': similarity_score,
@@ -118,10 +123,22 @@ def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,
def add_update(self, update_keywords: List[str],
score_delta: float) -> 'TopicUpdate':
+ """
+ Add an update to the topic with keyword changes and score delta.
+ """
update = TopicUpdate.create_topic_update(update_keywords, score_delta)
update.topic.connect(self)
return update
+ def set_topic_embedding(self, embedding: List[float]) -> None:
+ """
+ Set the topic embedding vector, ensuring all values are floats.
+ """
+ if not all(isinstance(val, float) for val in embedding):
+ raise ValueError("All elements in topic_embedding must be floats.")
+ self.topic_embedding = embedding
+ self.save()
+
class TopicUpdate(StructuredNode):
update_id = UniqueIdProperty()