feat(ux): include basename of path in generated table names in read_*() #10522

NickCrews · 2024-11-22T18:14:54Z

I often read in many files, and then when I do con.tables, it is not obvious which is which.
This should make it a little more user-friendly.

I don't think think we need to be perfect here, I was just aiming for slightly better. For example, if you pass a collection of paths, then we arbitrarily pick the first one to be representative. Same thing with globs, the generated name isn't always the prettiest. If the user wants more precise semantics, they can name the table themselves.

I also simplified the polars.read_parquet() logic while I was in there, it was needed in order to make this change.

I considered a different API of def gen_name_from_paths(*paths: str | Path, file_type: str) -> str:, but decided that the choosing of the first file should be the caller's responsibility. IDK, I can switch to this if you like.

ibis/util.py

NickCrews · 2024-12-05T00:39:04Z

ibis/util.py

+    basename = re.sub(r"[^a-zA-Z0-9_]", "_", basename)
+    # MySQL has a limit of 64 characters for table names.
+    # Let's not give users runtime errors because of this.
+    basename = basename[-25:]


do we want to push this logic down into gen_name() so we never create a too-long name?

This is an unstable API and shouldn't be relied on.

NickCrews · 2024-12-06T13:01:09Z

Failing tests are revealing that our old tests were buggy: we currently verify that register() worked by doing a assert "my_table" in con.list_tables(). But, some test cases reuse the same name ("my_table"), and never clean up after themselves. So even if the register() doesn't work as intended, the test will still pass because that table is still left around by the last test case.

I think this means we should be more careful about cleaning up after ourselves. I started adding a helper class/context manager for this, but it was getting too complex, and wasn't related to this PR, so I think it should happen in a followup PR. So it's somewhere, here is an idea on what that helper class could look like:

class BackendJanitor:
    """Cleans up tables created during a test.

    Examples
    --------
    >>> con = ibis.sqlite.connect(":memory:")
    >>> con.create_table("foo", schema=dict(x="int64"))
    >>> with BackendJanitor(con) as janitor:
    ...     con.create_table("bar", schema=dict(y="int64"))
    >>> con.list_tables()
    ['foo']
    >>> janitor.tables_final
    {'bar', 'foo'}
    """

    def __init__(self, con: ibis.BaseBackend):
        self.con = con
        self.tables_initial = set(con.list_tables())

    @property
    def tables_now(self) -> set[str]:
        return set(self.con.list_tables())

    @property
    def tables_final(self) -> set[str]:
        """The set of tables that existed at the point of the last call to `clean_up`."""
        if not hasattr(self, "_tables_final"):
            raise ValueError("need to call `clean_up` first")
        return self._tables_final

    def clean_up(self):
        """Remove all tables created since initialization."""
        self._tables_final = self.tables_now
        for table_name in self.tables_final - self.tables_initial:
            self.con.drop_table(table_name)

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.clean_up()

NickCrews · 2024-12-06T13:47:28Z

ehh, man, I really don't love how I implemented the ensured-cleanup of the created tables. I will do that properly in another PR before/after this if you want.

gforsyth · 2024-12-06T15:39:58Z

ehh, man, I really don't love how I implemented the ensured-cleanup of the created tables. I will do that properly in another PR before/after this if you want.

Yeah, I would suggest just fixing the tests manually for now and then we can implement something more automatic later. To some degree, any solution here requires humans to do the right thing -- I think this is probably a good job for a context manager that generates table names, so you have something like:

with test_table(con) as table_name:
    con.create_table(table_name, some_obj)

and then we clean up at __exit__. It's also possible I've already written this elsewhere in the code base...

ah, similar but slightly different:

@contextlib.contextmanager
def create_and_destroy_db(con):
    con.create_database(dbname := gen_name("db"))
    try:
        yield dbname
    finally:
        con.drop_database(dbname)

NickCrews · 2024-12-06T22:50:15Z

ibis/backends/tests/test_register.py

+    backend = table._find_backend()
+    name = table.get_name()
+    try:
+        backend.drop_table(name)


This is one example use case that may inform the API we design in #8382. For example, a singular con.drop(table_or_view) would be nice.

cpcloud

I really do not like this idea.

cpcloud · 2024-12-09T18:31:40Z

ibis/util.py

@@ -522,6 +522,26 @@ def gen_name(namespace: str) -> str:
    return f"ibis_{namespace}_{uid}"


+def gen_name_from_path(path: str | Path) -> str:


I think the idea of baking any part of the path into an identifier is a very bad idea.

Lots of software has weird limitations around paths, that we should not bring into Ibis, like the thing you've already discovered about MySQL.

In particular, paths can be arbitrary byte sequences, they're not required to be UTF-8, Latin-1, anything in between, or anything in particular except a sequence of bytes.

Is this implementation prepared to handle that? If not, then we shouldn't do it, because it's adding a new set of unnecessary assumptions. If you need to keep track of what tables map to what files, DuckDB has an argument for that, or if you're not using DuckDB then it's better done upstream in the system that does the reading.

I can write some code to handle the "random sequence of bytes". I think it would just replace those with the empty string, or an underscore, or something. I don't think the behavior here matters that much, as long as it doesn't error, because this should be so rare. I think this means that the implementation should be able to be fairly simple and maintainable.

If I come up with something not terrible would you consider it @cpcloud? I strongly still like the main idea of this PR.

I really don't like the idea of replacing sequences of bytes with anything.

What are we gaining by doing any of this, that makes it a worthwhile maintenance tradeoff? In other words, why should we bake in new assumptions about paths?

Right now, I don't see an improvement from all the effort here that justifies the maintenance burden.

github-actions bot added clickhouse The ClickHouse backend pyspark The Apache PySpark backend datafusion The Apache DataFusion backend duckdb The DuckDB backend polars The polars backend snowflake The Snowflake backend flink Issues or PRs related to Flink labels Nov 22, 2024

NickCrews force-pushed the path-in-read branch from 3919f2f to 54d6794 Compare November 22, 2024 18:17

NickCrews force-pushed the path-in-read branch 2 times, most recently from 23b7669 to 1fc10ca Compare December 3, 2024 04:30

gforsyth reviewed Dec 4, 2024

View reviewed changes

ibis/util.py Outdated Show resolved Hide resolved

gforsyth reviewed Dec 4, 2024

View reviewed changes

ibis/util.py Outdated Show resolved Hide resolved

NickCrews force-pushed the path-in-read branch from 1fc10ca to 6f67bd8 Compare December 5, 2024 00:10

github-actions bot added the tests Issues or PRs related to tests label Dec 5, 2024

feat(ux): include basename of path in generated table names in read_*()

edee1ea

NickCrews force-pushed the path-in-read branch from 6f67bd8 to edee1ea Compare December 5, 2024 00:37

NickCrews commented Dec 5, 2024

View reviewed changes

chore: update register tests to not rely on generated table names

99e6fa5

This is an unstable API and shouldn't be relied on.

chore: clean up registered tables after every test

f7cb256

chore: move import into TYPE_CHECKING block

dcd19df

chore: fixup usage in snowflake and pyspark

e79e26d

NickCrews commented Dec 6, 2024

View reviewed changes

cpcloud requested changes Dec 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ux): include basename of path in generated table names in read_*() #10522

feat(ux): include basename of path in generated table names in read_*() #10522

NickCrews commented Nov 22, 2024 •

edited

Loading

NickCrews Dec 5, 2024

NickCrews commented Dec 6, 2024

NickCrews commented Dec 6, 2024

gforsyth commented Dec 6, 2024 •

edited

Loading

NickCrews Dec 6, 2024 •

edited

Loading

cpcloud left a comment

cpcloud Dec 9, 2024

NickCrews Dec 10, 2024

cpcloud Dec 16, 2024 •

edited

Loading

		@@ -522,6 +522,26 @@ def gen_name(namespace: str) -> str:
		return f"ibis_{namespace}_{uid}"


		def gen_name_from_path(path: str \| Path) -> str:

feat(ux): include basename of path in generated table names in read_*() #10522

Are you sure you want to change the base?

feat(ux): include basename of path in generated table names in read_*() #10522

Conversation

NickCrews commented Nov 22, 2024 • edited Loading

NickCrews Dec 5, 2024

Choose a reason for hiding this comment

NickCrews commented Dec 6, 2024

NickCrews commented Dec 6, 2024

gforsyth commented Dec 6, 2024 • edited Loading

NickCrews Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

cpcloud left a comment

Choose a reason for hiding this comment

cpcloud Dec 9, 2024

Choose a reason for hiding this comment

NickCrews Dec 10, 2024

Choose a reason for hiding this comment

cpcloud Dec 16, 2024 • edited Loading

Choose a reason for hiding this comment

NickCrews commented Nov 22, 2024 •

edited

Loading

gforsyth commented Dec 6, 2024 •

edited

Loading

NickCrews Dec 6, 2024 •

edited

Loading

cpcloud Dec 16, 2024 •

edited

Loading