Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: remove score_decay_rate and update formatting #36

Merged
merged 1 commit into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 10 additions & 43 deletions docs/db/schema.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,3 @@
Here’s an updated markdown version with explanations for `SemanticVector` and `score_decay_rate`:

---

# Graph Representation

## Nodes

### Channel
Expand All @@ -24,9 +18,9 @@ Here’s an updated markdown version with explanations for `SemanticVector` and

- **topic_id**: Unique identifier
- **name**: Summary of the topic
- **keywords**: List of key terms with scores
- **overall_score**: Average or cumulative score
- **keywords**: List of key terms with associated weights (e.g., `[{"term": "AI", "weight": 0.35}, {"term": "neural networks", "weight": 0.28}]`)
- **bertopic_metadata**: BerTopic metadata
- **topic_embedding: Topic embedding
- **updated_at**: Last updated timestamp

---
Expand All @@ -45,58 +39,31 @@ Here’s an updated markdown version with explanations for `SemanticVector` and
### SemanticVector

- **vector_id**: Unique identifier
- **semantic_vector**: Aggregated representation of recent message semantics in a channel. This vector captures the
summarized, anonymized essence of new content without storing individual messages, aligning with privacy requirements.
- **semantic_vector**: Aggregated representation of recent message semantics in a channel, preserving privacy by summarizing content instead of storing individual messages.
- **created_at**: Creation date

> **Explanation**: The `SemanticVector` represents the semantic profile of recent messages in a channel, allowing
> Concord to adjust topic relevance without storing each message. Each vector aggregates the semantics of recent content
> into a general representation, which can influence the `channel_score` in `ASSOCIATED_WITH` relationships between
> channels and topics. This approach maintains user privacy while updating topic relevance dynamically.
> **Explanation**: The SemanticVector node represents a general semantic profile of recent messages in a channel, supporting dynamic topic relevance without storing each message individually. This approach aligns with privacy requirements while allowing for the adjustment of topic relevance.

---

## Relationships

### ASSOCIATED_WITH (Channel → Topic)

- **channel_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
- **keywords_weights**: Channel-specific keywords and their weights, reflecting the unique relationship between the
channel and topic
- **topic_score**: Cumulative or weighted score representing a topic’s importance or relevance to the channel
- **keywords_weights**: Channel-specific keywords and their weights, highlighting the unique relationship between the channel and topic
- **message_count**: Number of messages analyzed in relation to the topic
- **last_updated**: Timestamp of the last update
- **score_decay_rate**: Rate at which `channel_score` decreases over time if no new relevant messages are analyzed. This
decay rate allows topic scores to adjust gradually, so less active or outdated topics diminish in relevance without
active content.
- **trend**: Indicator of topic trend over time within the channel

> **Explanation**: `score_decay_rate` ensures that topics associated with a channel decrease in relevance if no new
> messages support their ongoing importance. This helps maintain an accurate and current reflection of active discussions
> in a channel, giving more weight to trending or frequently discussed topics while allowing older or less relevant topics
> to fade naturally.
> **Explanation**: This relationship captures the importance of each topic to specific channels, with channel-specific keyword weights providing additional insight into unique topic-channel dynamics. `trend` enables tracking how each topic's relevance changes over time within the channel.

---

### RELATED_TO (Topic ↔ Topic)

- **similarity_score**: Degree of similarity between two topics
- **temporal_similarity**: Time-based similarity metric to track changing topic relationships over time
- **co-occurrence_rate**: Frequency with which two topics are discussed together across channels
- **temporal_similarity**: Metric to track similarity over time
- **co-occurrence_rate**: Frequency of concurrent discussion of topics across channels
- **common_channels**: Number of shared channels discussing both topics
- **topic_trend_similarity**: Similarity in trends or changes in relevance for each topic

```mermaid
graph TD
%% Nodes
Channel["Channel<br>-------------------------<br>channel_id: Unique identifier<br>platform: Platform (e.g., Telegram)<br>name: Name of the channel<br>description: Description of the channel<br>created_at: Creation date<br>active_members_count: Number of active members<br>language: Language of the channel<br>region: Geographical region<br>activity_score: Posting activity score"]
Topic["Topic<br>-------------------------<br>topic_id: Unique identifier<br>name: Summary of the topic<br>keywords: List of key terms with scores<br>overall_score: Average or cumulative score<br>bertopic_metadata: BerTopic metadata<br>updated_at: Last updated timestamp"]
TopicUpdate["TopicUpdate<br>-------------------------<br>update_id: Unique identifier<br>channel_id: Associated channel<br>topic_id: Associated topic<br>keywords: Keywords from the update<br>score_delta: Change in topic score<br>timestamp: Update time"]
SemanticVector["SemanticVector<br>-------------------------<br>vector_id: Unique identifier<br>semantic_vector: Aggregated semantics<br>created_at: Creation date"]
%% Relationships
Channel -.-> ASSOCIATED_WITH["ASSOCIATED_WITH Relationship<br>-------------------------<br>channel_score: Cumulative or weighted score<br>keywords_weights: Channel-specific keywords and weights<br>message_count: Number of messages analyzed<br>last_updated: Timestamp of last update<br>score_decay_rate: Rate of score decay<br>trend: Topic trend over time"] --> Topic
Topic -.-> RELATED_TO["RELATED_TO Relationship<br>-------------------------<br>similarity_score: Degree of similarity<br>temporal_similarity: Time-based similarity<br>co-occurrence_rate: Co-occurrence of keywords<br>common_channels: Number of shared channels<br>topic_trend_similarity: Trend alignment"] --> Topic
TopicUpdate --> Topic
SemanticVector --> Channel
```

---
- **topic_trend_similarity**: Measure of similarity in topic trends across channels
8 changes: 6 additions & 2 deletions src/bert/concord.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# concord.py

from bert.pre_process import preprocess_documents
from graph.schema import Topic


def concord(topic_model, documents):
def concord(
topic_model,
documents,
):
# Load the dataset and limit to 100 documents
print(f"Loaded {len(documents)} documents.")

Expand Down Expand Up @@ -40,4 +44,4 @@ def concord(topic_model, documents):
print(f" {word_score_str}")

print("\nTopic modeling completed.")
return len(documents), None
return len(documents), Topic.create_topic()
31 changes: 24 additions & 7 deletions src/graph/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,10 @@

# Relationship Models
class AssociatedWithRel(StructuredRel):
channel_score = FloatProperty()
topic_score = FloatProperty()
keywords_weights = ArrayProperty()
message_count = IntegerProperty()
last_updated = DateTimeProperty()
score_decay_rate = FloatProperty()
trend = StringProperty()


Expand Down Expand Up @@ -60,14 +59,13 @@ def create_channel(cls, platform: str, name: str, description: str,

def associate_with_topic(self, topic: 'Topic', channel_score: float,
keywords_weights: List[str], message_count: int,
score_decay_rate: float, trend: str) -> None:
trend: str) -> None:
self.topics.connect(
topic, {
'channel_score': channel_score,
'keywords_weights': keywords_weights,
'message_count': message_count,
'last_updated': datetime.utcnow(),
'score_decay_rate': score_decay_rate,
'trend': trend
})

Expand All @@ -83,8 +81,8 @@ class Topic(StructuredNode):
topic_id = UniqueIdProperty()
name = StringProperty()
keywords = ArrayProperty()
overall_score = FloatProperty()
bertopic_metadata = JSONProperty()
topic_embedding = ArrayProperty()
updated_at = DateTimeProperty(default_now=True)

# Relationships
Expand All @@ -96,17 +94,24 @@ class Topic(StructuredNode):

# Wrapper Functions
@classmethod
def create_topic(cls, name: str, keywords: List[str], overall_score: float,
def create_topic(cls, name: str, keywords: List[str],
bertopic_metadata: Dict[str, Any]) -> 'Topic':
"""
Create a new topic node with the given properties.
"""
return cls(name=name,
keywords=keywords,
overall_score=overall_score,
bertopic_metadata=bertopic_metadata).save()

def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,
temporal_similarity: float, co_occurrence_rate: float,
common_channels: int,
topic_trend_similarity: float) -> None:
"""
Create a relationship to another topic with various similarity metrics.
"""
if not isinstance(other_topic, Topic):
raise ValueError("The related entity must be a Topic instance.")
self.related_topics.connect(
other_topic, {
'similarity_score': similarity_score,
Expand All @@ -118,10 +123,22 @@ def relate_to_topic(self, other_topic: 'Topic', similarity_score: float,

def add_update(self, update_keywords: List[str],
score_delta: float) -> 'TopicUpdate':
"""
Add an update to the topic with keyword changes and score delta.
"""
update = TopicUpdate.create_topic_update(update_keywords, score_delta)
update.topic.connect(self)
return update

def set_topic_embedding(self, embedding: List[float]) -> None:
"""
Set the topic embedding vector, ensuring all values are floats.
"""
if not all(isinstance(val, float) for val in embedding):
raise ValueError("All elements in topic_embedding must be floats.")
self.topic_embedding = embedding
self.save()


class TopicUpdate(StructuredNode):
update_id = UniqueIdProperty()
Expand Down
Loading