-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FIX] Passing FS param to handle writes for metadata #158
base: main
Are you sure you want to change the base?
Conversation
metadata_path = base_path / ToolExecKey.METADATA_FILE | ||
with metadata_path.open("w", encoding="utf-8") as f: | ||
f.write(ToolUtils.json_to_str(metadata)) | ||
base_path = self.execution_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harini-venkataraman are _get_data_dir() and execution_dir equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaya3-zipstack Just pushed a commit fixing the same. Both paths are not same.
@@ -268,9 +271,15 @@ def _update_exec_metadata(self) -> None: | |||
else: | |||
self._exec_metadata[MetadataKey.TOOL_META].append(tool_metadata) | |||
|
|||
self._write_exec_metadata(metadata=self._exec_metadata) | |||
self._write_exec_metadata( | |||
metadata=self._exec_metadata, fs=self.workflow_filestorage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will self.workflow_filestorage work here? Or should we have fs passed in here also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will work here, since we set it in the tool itself. The extract_text method is directly called from classifer, and we are setting these values in spinning. Ref - https://github.com/Zipstack/unstract/blob/10f48e098afb051e5bd12372ea84f3db7546de1d/tools/classifier/src/main.py#L63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We are using fs in some places and self.workflow_filestorage in others—are these different, or should they be the same
- Would it make sense to set fs as a class property (self.fs) for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally passing around the filestorage would have been ideal. But for tools, we have workflow_filestorage. Everywhere else, it is fs as those are all APIs which the services call directly.. @muhammad-ali-e Havingm an instance variable for filestorage for each tool does make sense as this use case is slighly different from other APIs. But other than that, you can point out why we decided to do it this way.
pyproject.toml
Outdated
"gcsfs", | ||
"s3fs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaya3-zipstack @harini-venkataraman I think we should pin it to a major version while allowing the minor and patch version selection. Not mentioning any version would not be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed @ritwik-g
with metadata_path.open("w", encoding="utf-8") as f: | ||
f.write(ToolUtils.json_to_str(metadata)) | ||
base_path = self.execution_dir | ||
if fs and fs.provider.value == "gcs": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will not suggest checks for the provider here to get the logic implemented.
Had this comment - https://github.com/Zipstack/unstract/pull/1106/files#r1937038323
OR
In _get_data_dir, we are looking for env var TOOL_DATA_DIR which is what non remote storage uses. In case of remote storage, we only use EXECTION_DATA_DIR. Rigth @muhammad-ali-e ?
Can we use this to understand how to get the base_path depending on which var is available?
All said and done, these pieces can be removed once integration stabilises... So....
@muhammad-ali-e We will need your expertise here to avoid re-work ....
What
PR to fix write errors while metadata is writted to remote storage.
...
Why
Metadata is needed for processing highlight data
...
How
Passing the fs instance argument to method processed for writting when hightlihgt is enabled.
...
Relevant Docs
Not applicable.
...
Related Issues or PRs
Not applicable.
...
Dependencies Versions / Env Variables
Not applicable.
...
Notes on Testing
Create a project, upload a file and index the file.
metadata file is to be present in remote storage.
...
Screenshots
...
Checklist
I have read and understood the Contribution Guidelines.