brackets for table and column quoting instead of backticks #3483

snth · 2023-09-07T22:48:05Z

snth
Sep 7, 2023
Maintainer

One great use case for PRQL is in working with data in the terminal; see for example prql-query and my recent tweet on turning UK Census data csv files into parquet files. The code from that is (slightly edited):

census() {
wget https://nomisweb.co.uk/output/census/2021/$1.zip; unzip -d $1 $1.zip
q=$(printf '|append (from `%s`)' $1/$1-*.csv)
echo "${q/#|append/}" | prqlc compile | echo -e "COPY (`cat`\n) TO '$1.parquet'" | duckdb
}
for f in census2021-ts{001..079}; do census $f; done

The problem is that when table and column names have to be quoted with like table, the backticks `` have a special meaning in the shell (subshell invocation). In order to avoid that behaviour you then have to quote your PRQL code with single quotes '' which in turn means you miss out on many of the string interpolation features that make shells such great tools for interactive work.

So what alternative quoting characters could we look at? The single-quotes and double-quotes are already heavily used and there are reasons why we don't use them for this so we don't need to revisit that. This Modern SQL Style Guide reminded me that SQL Server uses brackets for this like [table].[column]. I've never been much of a fan of this but after exploring some thoughts around this, I've come to the conclusion that it's actually a very good option.

Your first concern about this might be that we want to use brackets for arrays/lists so won't this create an ambiguity/inconsistency? I have two responses to this:

The cases I'm proposing this for would be single element arrays and they would be of unquoted identifiers vs arrays of strings so would there really be an ambiguity? Compare for example the following two [my column] vs ["my string"].
Even if there is no ambiguity at the parser/lexical level, doesn't it create a logical inconsistency in that it makes things less clear for the reader? After some consideration I would argue that we might actually come to see this as a feature. As part of type systems discussion we have already discussed ideas around columns really being arrays of values (they are lists of homogenously typed values). Therefore an expression like [column] would actually aid in expressing this visually.

Take for example the first example from the aggregation tutorial page:

from invoices
aggregate { grand_total = sum total }

with the proposed syntax this could be written as:

from invoices
aggregate { grand_total = sum [total] }

which I think actually quite nicely communicates that you are summing an array of values.

Of course the bracket quoting wouldn't be necessary in the example above and would only be required when there is something like a space in the column name like:

from invoices
aggregate { grand_total = sum [invoice total] }

Of course this would resolve the CLI usage problem with the backticks.

One question that arose for me is what would we use on the LHS of assignments? Would we use the same? Some languages have different rules for this but I think it would make sense to stay consistent.

from invoices
aggregate { [grand total] = sum [invoice total] }
derive [double total] = 2 * [grand total]

I still want to test this out in a few more scenarios to see how it interacts with other parts of the language, but so far it is looking very promising to me.

snth · 2023-09-07T22:57:11Z

snth
Sep 7, 2023
Maintainer Author

One concern I had was around array literals but I think they would actually be ok.

So instead of

from [
  {`small number'=1e-10, `large number`=1e10},
  {`small number'=2e-10, `large number`=2e10},
]
select {`small number', `large number`}

we would write:

from [
  {[small number]=1e-10, [large number]=1e10},
  {[small number]=2e-10, [large number]=2e10},
]
select {[small number], [large number]}

The brackets in the from aren't great but they're not bad either. However in the select this proposed syntax actually makes a lot of sense since to me it says that here you have a tuple of two arrays which is exactly what you have!

0 replies

eitsupi · 2023-09-07T22:59:41Z

eitsupi
Sep 7, 2023
Maintainer

you miss out on many of the string interpolation features that make shells such great tools for interactive work.

What does this mean?

6 replies

eitsupi Sep 8, 2023
Maintainer

Thanks.

But I believe single quotes are the way to go for this type of use.
For example, if we don't use single quotes, we can't use dollar marks $ in the string, right?

$ echo "$foo"

$ echo '$foo'
$foo

Similarly, I think it is safe to use single quotes in your usage, just use single quotes again where you want Bash to interpret them.

$ for f in file1.csv file2.csv file3.csv; do echo "from `$f`"; done
bash: file1.csv: command not found
from
bash: file2.csv: command not found
from
bash: file3.csv: command not found
from
$ for f in file1.csv file2.csv file3.csv; do echo 'from `$f`'; done
from `$f`
from `$f`
from `$f`
$ for f in file1.csv file2.csv file3.csv; do echo 'from `'$f'`'; done
from `file1.csv`
from `file2.csv`
from `file3.csv`

In short, this is a matter of how to write shell scripts, and the PRQL syntax should not be changed for this reason alone, I think.

(Again, would we also change the use of $? Even jq, a very popular command line tool, uses $.)

And, backtick substitution is generally not a recommended way of writing.
Please check the ShellCheck rule, for example. https://www.shellcheck.net/wiki/SC2006

snth Sep 8, 2023
Maintainer Author

Thanks @eitsupi but I don't quite understand your comment.

What I want is the following:

❯ for f in file{1..2}.csv; do prql "from \`$f\`"; done
SELECT
  *
FROM
  "file1.csv"

-- Generated by PRQL compiler version:0.9.4 (https://prql-lang.org)
SELECT
  *
FROM
  "file2.csv"

-- Generated by PRQL compiler version:0.9.4 (https://prql-lang.org)

Single quotes won't work because that gives you:

❯ for f in file{1..2}.csv; do prql 'from `$f`'; done
SELECT
  *
FROM
  $ f

-- Generated by PRQL compiler version:0.9.4 (https://prql-lang.org)
SELECT
  *
FROM
  $ f

-- Generated by PRQL compiler version:0.9.4 (https://prql-lang.org)

I was going to say that the backticks are required because otherwise DuckDB doesn't apply the read_csv_auto correctly, but to my surprise the following worked so maybe something changed?

❯ for f in file{1..2}.csv; do prql "from $f" | duckdb; done
┌───────┬───────┐
│   a   │   b   │
│ int64 │ int64 │
├───────┼───────┤
│     1 │     1 │
└───────┴───────┘
┌───────┬───────┐
│   a   │   b   │
│ int64 │ int64 │
├───────┼───────┤
│     1 │     2 │
└───────┴───────┘

snth Sep 8, 2023
Maintainer Author

And, backtick substitution is generally not a recommended way of writing.

I agree with that. I usually use $(...}. I used backticks in that tweet because it's one character less and I kept hitting the character limit.

The point though is that even if I don't want to use the backticks in bash, they get interpreted anyway, at least by default. Is there a way to turn that off?

max-sixty Sep 8, 2023
Maintainer

How about this approach? (not saying it's elegant, but is it correct?)

$ for f in file1.csv file2.csv file3.csv; do echo 'from `'$f'`'; done
from `file1.csv`
from `file2.csv`
from `file3.csv`

eitsupi Sep 8, 2023
Maintainer

The point though is that even if I don't want to use the backticks in bash, they get interpreted anyway, at least by default. Is there a way to turn that off?

My understanding is that it is always to use single quotes for strings passed to another language.

I am convinced that it would be less user-friendly to change double and single quarts to non-interchangeable here. (as @max-sixty says #3483 (comment))
This is because single quotes have special meaning in a single quote string.

max-sixty · 2023-09-07T23:02:14Z

max-sixty
Sep 7, 2023
Maintainer

That seems like the parser is going to have to be very loose to allow that — it's not going to know whether foo in [bar] is referring to an array [bar] or a column bar until late in the compilation process, unless we specifically disallow arrays containing a single column.

We don't use arrays much yet, and I guess there aren't many times we have an array of columns, so I agree it's possible to shoehorn at the moment.

But mixing syntax like this arguably makes a language quite difficult to understand for people as well as the compiler. "What do brackets do?" — "They're for arrays, unless there's a single item, in which case they can escape columns names" — I would argue that's not simple, for a very basic question.

I do agree backticks have a disadvantage in shells or markdown. But if it really were a big issue, I would much sooner repurpose single or double quotes than make ambiguity with arrays.

8 replies

max-sixty Sep 8, 2023
Maintainer

It is very easy and almost no problem to have backquotes present in a sentence in Markdown. Like `.

Just to ensure I'm presenting my full view even though I disagree with the proposal — I still see having backticks in a language as imperfect, because when pasting a line into markdown, it can mess up the quoting.

For example, if we "open backtick, paste, close backtick" on this query:

from `my tracks` | select `my artist`

Then we get:

Hey I have a query from my tracks| selectmy artist`` and I'm getting this result...

max-sixty Sep 8, 2023
Maintainer

...so if we feel strongly about this, we could use single-quotes to escape, and double quotes for literals. I would be -0.1 on making that change, and so fairly open to it

snth Sep 8, 2023
Maintainer Author

I'm also claiming that even if something is possible to parse, that's not sufficient to make it good — it should parse simply without much context.

I agree with that. I was suggesting though that the bracket notation might be more than just a syntactic convenience or a lack of alternatives, but actually rather an expression of a deeper equivalence in the data. I will address that in a separate thread because it's a bigger topic. First let me address the potential ambiguity in syntax, because that's potentially simpler.

In order to be able to express single element arrays with the proposed syntax, you could have something similar to how Python treats parentheses and tuples, in that a trailing comma is always required for a single element tuple, e.g.

>>> (5) == 5
True
>>> (5,) == tuple([5])
True

So in PRQL we could have:

let five=5

from tbl
select { value_col=[my col], array_col=[five,] }

I just want to point out that if tbl has a column named five then the ambiguity here has nothing to do with arrays and exists already in:

let five=5

from [{five=5}]
select { ten=2*five, doubled_five=2*this.five }

This currently throws the following error in the Playground:

Error: 
   ╭─[:4:16]
   │
 4 │ select { ten=2*five, doubled_five=2*this.five }
   │                ──┬─  
   │                  ╰─── Ambiguous name
   │ 
   │ Help: could be any of: five, this._literal_194.five
───╯

I'll address the possibly more interesting semantic point in a separate subthread.

eitsupi Sep 8, 2023
Maintainer

Just to ensure I'm presenting my full view even though I disagree with the proposal — I still see having backticks in a language as imperfect, because when pasting a line into markdown, it can mess up the quoting.

Isn't it simply a lack of knowledge of Markdown grammar and anyone who sees the rendered result can fix it right away?

For example, if we want to represent a Markdown fenced code block as a code block, we need to surround it with a greater number of backticks; Markdown code blocks may not display well when pasted into Markdown, so the Markdown syntax is Is this not ideal?

Correct one:

````md
```sh
echo "hello"
```

This is a code block.
````

Rendered as:

```sh
echo "hello"
```

This is a code block.

Mistakes we see often:

```md
```sh
echo "hello"
```

This is a code block.
```

Rendered as:

```sh
echo "hello"

This is a code block.

max-sixty Sep 8, 2023
Maintainer

Yes, it's just less friendly...

eitsupi · 2023-09-08T10:51:21Z

eitsupi
Sep 8, 2023
Maintainer

Is this related to the question "A scalar is equivalent to an array of length 1?"
I think I saw a similar issue with a conversion between R where there is no scalar and JSON where there is a scalar.

2 replies

max-sixty Sep 8, 2023
Maintainer

V interesting!

It is related in a way — if scalars and one-item-arrays were equivalent, then the problems with the proposal maybe mostly go away?

Could they be equivalent? It's a nice property of array languages, but it can become difficult — it's not possible to have a function in depend on the type; e.g. have "ab" in "abc" and also have "ab" in ["aa", "ab", "ba"], because of "ab" in ["abc"]...

(I also think this is a big question, and even if it were true, I would still try to find a different way of achieving this particular goal)

aljazerzen Sep 10, 2023
Maintainer

I have a strong opinion on this matter: they should not be equivalent.

That's because in some cases you cannot statically confirm that an array will have a single element.
Thus you cannot statically determine the type of the variable, which breaks a lot of current semantics.

This equivalence might solve a few problems, but I think it introduces a few, which are much harder to solve, or even provably impossible.

snth · 2023-09-08T21:59:51Z

snth
Sep 8, 2023
Maintainer Author

Tables can be thought of as lists of tuples/records (the traditional OLTP model) or as tuples of arrays/columns (the OLAP model).

Take for example from_text "a,b\n1,2", in Pandas we could construct this as:

>>> d1 = pd.DataFrame.from_records( [ {'a':1, 'b':2 } ] )
>>> d2 = pd.DataFrame.from_dict( { 'a':[1], 'b':[2] } )
>>> d1 == d2
      a     b
0  True  True
>>> d1
   a  b
0  1  2

Hopefully you can already see hints of where I'm wanting to go. If we ignore the constructor methods and just look at the arguments we can see that there we roughly have that

[ {'a':1, 'b':2 } ] ~ { 'a':[1], 'b':[2] }

In PRQL we already have the array literal syntax, so you can do

from [{a=1, b=2}]
select {a, b}       # or select {`a`, `b`}

With the proposal above, you could instead also write

from [{a=1, b=2}]
select {[a], [b]}

My larger point is that this actually clearly communicates the structure of the data as tuple of column arrays. I think this is quite elegant!

I think this also ties in with the discussion with @aljazerzen in #2723 around how to think of the arguments to aggregation and window functions, especially this comment #2723 (comment).

I think the proposed notation makes some of these notions actually much clearer:

let tax_rate = 0.15

from invoices
derive [tax amount] = tax_rate * [invoice amount]
aggregate { 
    [invoice total] = sum [invoice amount],
    [tax total] = sum [tax amount],
}

Let's look at the following line in detail to explain how I suggest one should think about this:

derive [tax amount] = tax_rate * [invoice amount]

I read that as the [tax amount] column array is the product of the [invoice amount] column array with the tax_rate. tax_rate happens to be a scalar so standard numpy-like broadcasting rules apply. I'm not suggesting that we implement the numpy-like broadcasting rules (in fact there shouldn't be any implementation changes other than the quoting behaviour), rather I'm saying that SQL behaves like this already and the above syntax just makes this clearer. tax_rate is a constant that gets broadcast down the whole column while [invoice amount] is a column array that has a different value in each row.

After seeing this invoices example I'm becoming even more convinced. I think this looks really neat! WDYT?

EDIT: I changed [tax rate] to tax_rate from an earlier version because that was introducing an unnecessary complication orthogonal to the proposal.

4 replies

max-sixty Sep 8, 2023
Maintainer

How would we distinguish between func arg and a column named func arg?

I really don't get the inclination to use brackets here! They generally mean something completely different. We can change literal strings to only use one sort of quote while using the other for column quoting if we don't want to use backticks for that.

(I also think that the "are scalars the same as single-item arrays" is a v interesting question, though even if they were, would we use brackets for this? I think that question might be easier to answer than the scalar/array question...)

snth Sep 9, 2023
Maintainer Author

How would we distinguish between func arg and a column named func arg?

Does the following help?

derive {
    func_arg_col = [func arg],
    func_arg_eval = func arg,
    str_arr_col = ['a', 'b', 'c'],
    int_arr_col = [1, 2, 3],
    arr_col = [col1, [col2], func_arg_eval==func arg, func_arg_col==[func arg]],
    # the last two elements of the array in the column above are always true (NULLs aside),
    single_elem_str_arr_col = ['a',],
    single_elem_int_arr_col = [1,],
    col_named_1_duplicate = [1],
    always_true_column = [1] == col_named_1_duplicate,
}

max-sixty Sep 9, 2023
Maintainer

(deleted previous question)

It is possible to do the python tuple thing and force commas for singe item lists. But it's awkward, and it removes the "trailing commas are optional" quality that PRQL currently has...

aljazerzen Sep 10, 2023
Maintainer

I think this syntax would make the problem from #2723 worse.

When I first saw this, I was reading it as:

derive [tax amount] = tax_rate * [invoice amount]

derive an array of values named "tax amount" as tax_rate value times an array of values named "invoice amount"

This is ok, but this is not:

aggregate { 
    [invoice total] = sum amount,
}

aggregate an array of values named "invoice total" as the sum of value amount

Problem: types don't add up - sum should take an array and return a single value, but here it is just the only way around.

aljazerzen · 2023-09-10T18:58:45Z

aljazerzen
Sep 10, 2023
Maintainer

To put it bluntly: I'm not in favor of this proposal, -1.

We already have quite a few ways to quote strings.
Single, double, raw, multiple single, multiple double.
I was actually thinking about removing single quotes, to adhere to the "one way of doing things" principle.
I do see the need for this feature when embedding PRQL into Bash strings.
But if we add lexical features for Bash,
I might also make sense to add a lexical feature that would ease embedding into Java or Python.
This is a classical "slippery slope" argument, but I really feel this way;
Overloading a language so it can be easily embedded into many other languages will yield a bloated language.
As Max points out, this might be possible to parse, but at what cost.
I was implementing this, I would would not be on the parser level, but in a pass immediately after.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brackets for table and column quoting instead of backticks #3483

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 20 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

brackets for table and column quoting instead of backticks #3483

snth Sep 7, 2023 Maintainer

Replies: 6 comments · 20 replies

snth Sep 7, 2023 Maintainer Author

eitsupi Sep 7, 2023 Maintainer

eitsupi Sep 8, 2023 Maintainer

snth Sep 8, 2023 Maintainer Author

snth Sep 8, 2023 Maintainer Author

max-sixty Sep 8, 2023 Maintainer

eitsupi Sep 8, 2023 Maintainer

max-sixty Sep 7, 2023 Maintainer

max-sixty Sep 8, 2023 Maintainer

max-sixty Sep 8, 2023 Maintainer

snth Sep 8, 2023 Maintainer Author

eitsupi Sep 8, 2023 Maintainer

max-sixty Sep 8, 2023 Maintainer

eitsupi Sep 8, 2023 Maintainer

max-sixty Sep 8, 2023 Maintainer

aljazerzen Sep 10, 2023 Maintainer

snth Sep 8, 2023 Maintainer Author

max-sixty Sep 8, 2023 Maintainer

snth Sep 9, 2023 Maintainer Author

max-sixty Sep 9, 2023 Maintainer

aljazerzen Sep 10, 2023 Maintainer

aljazerzen Sep 10, 2023 Maintainer

snth
Sep 7, 2023
Maintainer

Replies: 6 comments 20 replies

snth
Sep 7, 2023
Maintainer Author

eitsupi
Sep 7, 2023
Maintainer

eitsupi Sep 8, 2023
Maintainer

snth Sep 8, 2023
Maintainer Author

snth Sep 8, 2023
Maintainer Author

max-sixty Sep 8, 2023
Maintainer

eitsupi Sep 8, 2023
Maintainer

max-sixty
Sep 7, 2023
Maintainer

max-sixty Sep 8, 2023
Maintainer

max-sixty Sep 8, 2023
Maintainer

snth Sep 8, 2023
Maintainer Author

eitsupi Sep 8, 2023
Maintainer

max-sixty Sep 8, 2023
Maintainer

eitsupi
Sep 8, 2023
Maintainer

max-sixty Sep 8, 2023
Maintainer

aljazerzen Sep 10, 2023
Maintainer

snth
Sep 8, 2023
Maintainer Author

max-sixty Sep 8, 2023
Maintainer

snth Sep 9, 2023
Maintainer Author

max-sixty Sep 9, 2023
Maintainer

aljazerzen Sep 10, 2023
Maintainer

aljazerzen
Sep 10, 2023
Maintainer