Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the possibility of removing dataframe field from Document #8627

Open
anakin87 opened this issue Dec 11, 2024 · 1 comment
Open
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@anakin87
Copy link
Member

I've been thinking about dropping the Dataframe field in Haystack Document dataclass for a few reasons:

  • Users are already using text representations (CSV, Markdown) that LLMs handle great - even for originally tabular data
  • Pandas DataFrame creates serialization headaches. (e.g. in Hayhooks)
  • Pandas is a heavy dependency that complicates things, especially in serverless environments like Lambda. We could make it optional.
  • Supporting dataframes across different Document Stores requires complex workarounds.

I will reach out to internal and external users to validate my assumptions.
We should also investigate how impactful this change would be.

@anakin87
Copy link
Member Author

@EdAbati, the author of dataframes-haystack confirmed that this idea makes sense to him.
@sjrl too.

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Dec 12, 2024
@anakin87 anakin87 self-assigned this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

3 participants