diff --git a/docs/db/schema.md b/docs/db/schema.md index 6b44d8a..9bf0d0a 100644 --- a/docs/db/schema.md +++ b/docs/db/schema.md @@ -1,9 +1,3 @@ -Here’s an updated markdown version with explanations for `SemanticVector` and `score_decay_rate`: - ---- - -# Graph Representation - ## Nodes ### Channel @@ -24,9 +18,9 @@ Here’s an updated markdown version with explanations for `SemanticVector` and - **topic_id**: Unique identifier - **name**: Summary of the topic -- **keywords**: List of key terms with scores -- **overall_score**: Average or cumulative score +- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`) - **bertopic_metadata**: BerTopic metadata +- **topic_embedding: Topic embedding - **updated_at**: Last updated timestamp --- @@ -45,14 +39,10 @@ Here’s an updated markdown version with explanations for `SemanticVector` and ### SemanticVector - **vector_id**: Unique identifier -- **semantic_vector**: Aggregated representation of recent message semantics in a channel. This vector captures the - summarized, anonymized essence of new content without storing individual messages, aligning with privacy requirements. +- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages. - **created_at**: Creation date -> **Explanation**: The `SemanticVector` represents the semantic profile of recent messages in a channel, allowing -> Concord to adjust topic relevance without storing each message. Each vector aggregates the semantics of recent content -> into a general representation, which can influence the `channel_score` in `ASSOCIATED_WITH` relationships between -> channels and topics. This approach maintains user privacy while updating topic relevance dynamically. +> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance. --- @@ -60,43 +50,20 @@ Here’s an updated markdown version with explanations for `SemanticVector` and ### ASSOCIATED_WITH (Channel → Topic) -- **channel_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel -- **keywords_weights**: Channel-specific keywords and their weights, reflecting the unique relationship between the - channel and topic +- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel +- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic - **message_count**: Number of messages analyzed in relation to the topic - **last_updated**: Timestamp of the last update -- **score_decay_rate**: Rate at which `channel_score` decreases over time if no new relevant messages are analyzed. This - decay rate allows topic scores to adjust gradually, so less active or outdated topics diminish in relevance without - active content. - **trend**: Indicator of topic trend over time within the channel -> **Explanation**: `score_decay_rate` ensures that topics associated with a channel decrease in relevance if no new -> messages support their ongoing importance. This helps maintain an accurate and current reflection of active discussions -> in a channel, giving more weight to trending or frequently discussed topics while allowing older or less relevant topics -> to fade naturally. +> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel. --- ### RELATED_TO (Topic ↔ Topic) - **similarity_score**: Degree of similarity between two topics -- **temporal_similarity**: Time-based similarity metric to track changing topic relationships over time -- **co-occurrence_rate**: Frequency with which two topics are discussed together across channels +- **temporal_similarity**: Metric to track similarity over time +- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels - **common_channels**: Number of shared channels discussing both topics -- **topic_trend_similarity**: Similarity in trends or changes in relevance for each topic - -```mermaid -graph TD -%% Nodes - Channel["Channel
-------------------------
channel_id: Unique identifier
platform: Platform (e.g., Telegram)
name: Name of the channel
description: Description of the channel
created_at: Creation date
active_members_count: Number of active members
language: Language of the channel
region: Geographical region
activity_score: Posting activity score"] - Topic["Topic
-------------------------
topic_id: Unique identifier
name: Summary of the topic
keywords: List of key terms with scores
overall_score: Average or cumulative score
bertopic_metadata: BerTopic metadata
updated_at: Last updated timestamp"] - TopicUpdate["TopicUpdate
-------------------------
update_id: Unique identifier
channel_id: Associated channel
topic_id: Associated topic
keywords: Keywords from the update
score_delta: Change in topic score
timestamp: Update time"] - SemanticVector["SemanticVector
-------------------------
vector_id: Unique identifier
semantic_vector: Aggregated semantics
created_at: Creation date"] -%% Relationships - Channel -.-> ASSOCIATED_WITH["ASSOCIATED_WITH Relationship
-------------------------
channel_score: Cumulative or weighted score
keywords_weights: Channel-specific keywords and weights
message_count: Number of messages analyzed
last_updated: Timestamp of last update
score_decay_rate: Rate of score decay
trend: Topic trend over time"] --> Topic - Topic -.-> RELATED_TO["RELATED_TO Relationship
-------------------------
similarity_score: Degree of similarity
temporal_similarity: Time-based similarity
co-occurrence_rate: Co-occurrence of keywords
common_channels: Number of shared channels
topic_trend_similarity: Trend alignment"] --> Topic - TopicUpdate --> Topic - SemanticVector --> Channel -``` - ---- \ No newline at end of file +- **topic_trend_similarity**: Measure of similarity in topic trends across channels diff --git a/src/bert/concord.py b/src/bert/concord.py index b674629..2de4bc1 100644 --- a/src/bert/concord.py +++ b/src/bert/concord.py @@ -1,9 +1,13 @@ # concord.py from bert.pre_process import preprocess_documents +from graph.schema import Topic -def concord(topic_model, documents): +def concord( + topic_model, + documents, +): # Load the dataset and limit to 100 documents print(f"Loaded {len(documents)} documents.") @@ -40,4 +44,4 @@ def concord(topic_model, documents): print(f" {word_score_str}") print("\nTopic modeling completed.") - return len(documents), None + return len(documents), Topic.create_topic() diff --git a/src/graph/schema.py b/src/graph/schema.py index 3798c42..c4412ac 100644 --- a/src/graph/schema.py +++ b/src/graph/schema.py @@ -10,11 +10,10 @@ # Relationship Models class AssociatedWithRel(StructuredRel): - channel_score = FloatProperty() + topic_score = FloatProperty() keywords_weights = ArrayProperty() message_count = IntegerProperty() last_updated = DateTimeProperty() - score_decay_rate = FloatProperty() trend = StringProperty() @@ -60,14 +59,13 @@ def create_channel(cls, platform: str, name: str, description: str, def associate_with_topic(self, topic: 'Topic', channel_score: float, keywords_weights: List[str], message_count: int, - score_decay_rate: float, trend: str) -> None: + trend: str) -> None: self.topics.connect( topic, { 'channel_score': channel_score, 'keywords_weights': keywords_weights, 'message_count': message_count, 'last_updated': datetime.utcnow(), - 'score_decay_rate': score_decay_rate, 'trend': trend }) @@ -83,8 +81,8 @@ class Topic(StructuredNode): topic_id = UniqueIdProperty() name = StringProperty() keywords = ArrayProperty() - overall_score = FloatProperty() bertopic_metadata = JSONProperty() + topic_embedding = ArrayProperty() updated_at = DateTimeProperty(default_now=True) # Relationships @@ -96,17 +94,24 @@ class Topic(StructuredNode): # Wrapper Functions @classmethod - def create_topic(cls, name: str, keywords: List[str], overall_score: float, + def create_topic(cls, name: str, keywords: List[str], bertopic_metadata: Dict[str, Any]) -> 'Topic': + """ + Create a new topic node with the given properties. + """ return cls(name=name, keywords=keywords, - overall_score=overall_score, bertopic_metadata=bertopic_metadata).save() def relate_to_topic(self, other_topic: 'Topic', similarity_score: float, temporal_similarity: float, co_occurrence_rate: float, common_channels: int, topic_trend_similarity: float) -> None: + """ + Create a relationship to another topic with various similarity metrics. + """ + if not isinstance(other_topic, Topic): + raise ValueError("The related entity must be a Topic instance.") self.related_topics.connect( other_topic, { 'similarity_score': similarity_score, @@ -118,10 +123,22 @@ def relate_to_topic(self, other_topic: 'Topic', similarity_score: float, def add_update(self, update_keywords: List[str], score_delta: float) -> 'TopicUpdate': + """ + Add an update to the topic with keyword changes and score delta. + """ update = TopicUpdate.create_topic_update(update_keywords, score_delta) update.topic.connect(self) return update + def set_topic_embedding(self, embedding: List[float]) -> None: + """ + Set the topic embedding vector, ensuring all values are floats. + """ + if not all(isinstance(val, float) for val in embedding): + raise ValueError("All elements in topic_embedding must be floats.") + self.topic_embedding = embedding + self.save() + class TopicUpdate(StructuredNode): update_id = UniqueIdProperty()