Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ux): include basename of path in generated table names in read_*() #10522

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Nov 22, 2024

I often read in many files, and then when I do con.tables, it is not obvious which is which.
This should make it a little more user-friendly.

I don't think think we need to be perfect here, I was just aiming for slightly better. For example, if you pass a collection of paths, then we arbitrarily pick the first one to be representative. Same thing with globs, the generated name isn't always the prettiest. If the user wants more precise semantics, they can name the table themselves.

I also simplified the polars.read_parquet() logic while I was in there, it was needed in order to make this change.

I considered a different API of def gen_name_from_paths(*paths: str | Path, file_type: str) -> str:, but decided that the choosing of the first file should be the caller's responsibility. IDK, I can switch to this if you like.

@github-actions github-actions bot added clickhouse The ClickHouse backend pyspark The Apache PySpark backend datafusion The Apache DataFusion backend duckdb The DuckDB backend polars The polars backend snowflake The Snowflake backend flink Issues or PRs related to Flink labels Nov 22, 2024
@NickCrews NickCrews force-pushed the path-in-read branch 2 times, most recently from 23b7669 to 1fc10ca Compare December 3, 2024 04:30
ibis/util.py Outdated Show resolved Hide resolved
ibis/util.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the tests Issues or PRs related to tests label Dec 5, 2024
basename = re.sub(r"[^a-zA-Z0-9_]", "_", basename)
# MySQL has a limit of 64 characters for table names.
# Let's not give users runtime errors because of this.
basename = basename[-25:]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to push this logic down into gen_name() so we never create a too-long name?

This is an unstable API and shouldn't be relied on.
@NickCrews
Copy link
Contributor Author

Failing tests are revealing that our old tests were buggy: we currently verify that register() worked by doing a assert "my_table" in con.list_tables(). But, some test cases reuse the same name ("my_table"), and never clean up after themselves. So even if the register() doesn't work as intended, the test will still pass because that table is still left around by the last test case.

I think this means we should be more careful about cleaning up after ourselves. I started adding a helper class/context manager for this, but it was getting too complex, and wasn't related to this PR, so I think it should happen in a followup PR. So it's somewhere, here is an idea on what that helper class could look like:

class BackendJanitor:
    """Cleans up tables created during a test.

    Examples
    --------
    >>> con = ibis.sqlite.connect(":memory:")
    >>> con.create_table("foo", schema=dict(x="int64"))
    >>> with BackendJanitor(con) as janitor:
    ...     con.create_table("bar", schema=dict(y="int64"))
    >>> con.list_tables()
    ['foo']
    >>> janitor.tables_final
    {'bar', 'foo'}
    """

    def __init__(self, con: ibis.BaseBackend):
        self.con = con
        self.tables_initial = set(con.list_tables())

    @property
    def tables_now(self) -> set[str]:
        return set(self.con.list_tables())

    @property
    def tables_final(self) -> set[str]:
        """The set of tables that existed at the point of the last call to `clean_up`."""
        if not hasattr(self, "_tables_final"):
            raise ValueError("need to call `clean_up` first")
        return self._tables_final

    def clean_up(self):
        """Remove all tables created since initialization."""
        self._tables_final = self.tables_now
        for table_name in self.tables_final - self.tables_initial:
            self.con.drop_table(table_name)

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.clean_up()

@NickCrews
Copy link
Contributor Author

ehh, man, I really don't love how I implemented the ensured-cleanup of the created tables. I will do that properly in another PR before/after this if you want.

@gforsyth
Copy link
Member

gforsyth commented Dec 6, 2024

ehh, man, I really don't love how I implemented the ensured-cleanup of the created tables. I will do that properly in another PR before/after this if you want.

Yeah, I would suggest just fixing the tests manually for now and then we can implement something more automatic later. To some degree, any solution here requires humans to do the right thing -- I think this is probably a good job for a context manager that generates table names, so you have something like:

with test_table(con) as table_name:
    con.create_table(table_name, some_obj)

and then we clean up at __exit__. It's also possible I've already written this elsewhere in the code base...

ah, similar but slightly different:

@contextlib.contextmanager
def create_and_destroy_db(con):
    con.create_database(dbname := gen_name("db"))
    try:
        yield dbname
    finally:
        con.drop_database(dbname)

backend = table._find_backend()
name = table.get_name()
try:
backend.drop_table(name)
Copy link
Contributor Author

@NickCrews NickCrews Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one example use case that may inform the API we design in #8382. For example, a singular con.drop(table_or_view) would be nice.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like this idea.

@@ -522,6 +522,26 @@ def gen_name(namespace: str) -> str:
return f"ibis_{namespace}_{uid}"


def gen_name_from_path(path: str | Path) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of baking any part of the path into an identifier is a very bad idea.

Lots of software has weird limitations around paths, that we should not bring into Ibis, like the thing you've already discovered about MySQL.

In particular, paths can be arbitrary byte sequences, they're not required to be UTF-8, Latin-1, anything in between, or anything in particular except a sequence of bytes.

Is this implementation prepared to handle that? If not, then we shouldn't do it, because it's adding a new set of unnecessary assumptions. If you need to keep track of what tables map to what files, DuckDB has an argument for that, or if you're not using DuckDB then it's better done upstream in the system that does the reading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can write some code to handle the "random sequence of bytes". I think it would just replace those with the empty string, or an underscore, or something. I don't think the behavior here matters that much, as long as it doesn't error, because this should be so rare. I think this means that the implementation should be able to be fairly simple and maintainable.

If I come up with something not terrible would you consider it @cpcloud? I strongly still like the main idea of this PR.

Copy link
Member

@cpcloud cpcloud Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the idea of replacing sequences of bytes with anything.

What are we gaining by doing any of this, that makes it a worthwhile maintenance tradeoff? In other words, why should we bake in new assumptions about paths?

Right now, I don't see an improvement from all the effort here that justifies the maintenance burden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse The ClickHouse backend datafusion The Apache DataFusion backend duckdb The DuckDB backend flink Issues or PRs related to Flink polars The polars backend pyspark The Apache PySpark backend snowflake The Snowflake backend tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants