Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Passing FS param to handle writes for metadata #158

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

harini-venkataraman
Copy link
Contributor

What

PR to fix write errors while metadata is writted to remote storage.
...

Why

Metadata is needed for processing highlight data
...

How

Passing the fs instance argument to method processed for writting when hightlihgt is enabled.
...

Relevant Docs

Not applicable.
...

Related Issues or PRs

Not applicable.
...

Dependencies Versions / Env Variables

Not applicable.
...

Notes on Testing

Create a project, upload a file and index the file.
metadata file is to be present in remote storage.
...

Screenshots

image

...

Checklist

I have read and understood the Contribution Guidelines.

metadata_path = base_path / ToolExecKey.METADATA_FILE
with metadata_path.open("w", encoding="utf-8") as f:
f.write(ToolUtils.json_to_str(metadata))
base_path = self.execution_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harini-venkataraman are _get_data_dir() and execution_dir equivalent?

Copy link
Contributor Author

@harini-venkataraman harini-venkataraman Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaya3-zipstack Just pushed a commit fixing the same. Both paths are not same.

@@ -268,9 +271,15 @@ def _update_exec_metadata(self) -> None:
else:
self._exec_metadata[MetadataKey.TOOL_META].append(tool_metadata)

self._write_exec_metadata(metadata=self._exec_metadata)
self._write_exec_metadata(
metadata=self._exec_metadata, fs=self.workflow_filestorage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will self.workflow_filestorage work here? Or should we have fs passed in here also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will work here, since we set it in the tool itself. The extract_text method is directly called from classifer, and we are setting these values in spinning. Ref - https://github.com/Zipstack/unstract/blob/10f48e098afb051e5bd12372ea84f3db7546de1d/tools/classifier/src/main.py#L63

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We are using fs in some places and self.workflow_filestorage in others—are these different, or should they be the same
  • Would it make sense to set fs as a class property (self.fs) for consistency?

Copy link
Contributor

@gaya3-zipstack gaya3-zipstack Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally passing around the filestorage would have been ideal. But for tools, we have workflow_filestorage. Everywhere else, it is fs as those are all APIs which the services call directly.. @muhammad-ali-e Havingm an instance variable for filestorage for each tool does make sense as this use case is slighly different from other APIs. But other than that, you can point out why we decided to do it this way.

pyproject.toml Outdated
Comment on lines 84 to 85
"gcsfs",
"s3fs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaya3-zipstack @harini-venkataraman I think we should pin it to a major version while allowing the minor and patch version selection. Not mentioning any version would not be a good idea.

cc: @hari-kuriakose @chandrasekharan-zipstack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed @ritwik-g

with metadata_path.open("w", encoding="utf-8") as f:
f.write(ToolUtils.json_to_str(metadata))
base_path = self.execution_dir
if fs and fs.provider.value == "gcs":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not suggest checks for the provider here to get the logic implemented.
Had this comment - https://github.com/Zipstack/unstract/pull/1106/files#r1937038323
OR
In _get_data_dir, we are looking for env var TOOL_DATA_DIR which is what non remote storage uses. In case of remote storage, we only use EXECTION_DATA_DIR. Rigth @muhammad-ali-e ?

Can we use this to understand how to get the base_path depending on which var is available?
All said and done, these pieces can be removed once integration stabilises... So....

@muhammad-ali-e We will need your expertise here to avoid re-work ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants