Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthetic data generates on demand rather than on schedule #63

Closed
wants to merge 6 commits into from
Closed

Synthetic data generates on demand rather than on schedule #63

wants to merge 6 commits into from

Conversation

yaroslavyaroslav
Copy link

@yaroslavyaroslav yaroslavyaroslav commented Oct 13, 2024

PR Overview

TLDR

This PR planned to provide on demand synthetic data calculation, where on demands stands for in addition to request received.

Implementation details

This PR consists from two parts:

  • The main one, like the logic of heavy computational tasks caching and the modified logic of the scheduling from schedule loop to expanding data of any given task received.
  • And the work around one, that is required to make this behavior feasible in isolated environment (with nodes, miners and all the stuff disabled).

Main part of PR

Main part consists of the following changes:

  • refresh_sythetic_data.py has been (well it actually wasn't completely but supposed to be) rewritten to store nothing but a cache warmup function.
  • schedule_synthetic_queries.py is the main place where the most changes have been made
    • _fetch_tasks_in_chunks function reads a chunk of new organic tasks to proceed, if any.
    • _enhance_task function enhances the initial raw task with the unique data. Currently implemented just for chat and text-to-image tasks, because of it's nothing more but a switch by task kind it can be expanded effortlessly.
    • _push_to_query_node function pushes enhanced tasks to a query_node
    • warmup_function function implements logic of a service warmup in sake of code intents clarification
    • schedule_synthetics_until_done function has been rewritten to reflect the new approach — enhance any given tasks amount with synthetic data
  • execution_cycle.pywarmup_function call added on a service start.

Work around part of PR

  • cli.py cli tool that allows developer to manually send arbitrary amount of tasks of any kind to the control_node to check the whole -> contol_node -> query_node pipeline.
  • redis_constants.pyCONTROL_NODE_QUEUE_KEY and CONTROL_TASK_CHUNK_SIZE constants have been added to be used in cli.py and in the queue that is modified schedule_synthetics_until_done function reads raw tasks from.

Closes: #55

@yaroslavyaroslav yaroslavyaroslav marked this pull request as ready for review October 13, 2024 18:53
Copy link
Collaborator

@namoray namoray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's really hard to review this PR because a massive amount of code has been refactored which (a) did not need to be, and (b) is not relevant to the changes we're tryng to make here

I'm also not sure what the PR description means, or how this solution solves the problem. I believe it changes functionality of the whole system too much

@@ -43,25 +44,25 @@ async def _post_vali_stats(config: Config):
)


async def get_nodes_and_contenders(config: Config) -> list[Contender] | None:
async def get_nodes_and_contenders(config: Config) -> List[Contender] | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why refactor typehints? Prefer the lowercase

logger.info("Starting cycle...")
if config.refresh_nodes:
logger.info("First refreshing metagraph and storing the nodes")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is all this refactored?

@@ -70,29 +71,25 @@ async def main(config: Config) -> None:
time_to_sleep_if_no_contenders = 20
contenders = await get_nodes_and_contenders(config)

# date_to_delete = datetime(2024, 10, 13, 30)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this refactored?

# await delete_task_data_older_than_date(connection, date_to_delete)
# await delete_contender_history_older_than(connection, date_to_delete)
# await delete_reward_data_older_than(connection, date_to_delete)
await warmup_function(config=config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'warmup_function' is a bad name -> what is it doing?

remaining_requests: int

def __lt__(self, other: "TaskScheduleInfo"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this refactoreD?

break

if i % 100 == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has all this logic gone? We've completely changed how we schedule synthetic queries now right? All this logic must remain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synthetic data generation
2 participants