Support LuaMetaTeX #436

andreiborisov · 2024-04-21T11:45:40Z

As discussed in #402, there are a couple of blockers to supporting LuaMetaTeX:

expl3 no longer works with LuaMetaTeX: latex3/latex3#1518
Selene Unicode package is unavailable in LuaMetaTeX

The text was updated successfully, but these errors were encountered:

Witiko · 2024-11-21T12:57:17Z

Pull request #529 has removed the files markdownthemewitiko_tilde.tex, markdownthemewitiko_dot.sty, and markdownthemewitiko_graphicx_http.sty from the Markdown package. These files are now inlined in files markdown.tex and markdown.sty. This should simplify the distribution and the installation of the Markdown package in ConTeXt standalone.

Witiko · 2025-01-20T09:18:56Z

As discussed in #402, there are a couple of blockers to supporting LuaMetaTeX:

expl3 no longer works with LuaMetaTeX: latex3/latex3#1518

@andreiborisov: LuaMetaTeX should now be compatible with expl3: latex3/latex3#1518.

andreiborisov · 2025-01-22T12:29:39Z

@andreiborisov: LuaMetaTeX should now be compatible with expl3: latex3/latex3#1518.

Wow, this is fantastic news!

Can we do something with Selene Unicode too?

Witiko · 2025-01-24T14:11:14Z

Can we do something with Selene Unicode too?

There are three functions from Selene Unicode that we use: unicode.utf8.char(), lower(), and match().

$ grep -F unicode.utf8 markdown.dtx

  return unicode.utf8.char(n)
  return unicode.utf8.char(n)
  return unicode.utf8.char(n)
      table.insert(char_table, unicode.utf8.char(code_point))
        local is_letter = unicode.utf8.match(char, "%a")
      if not unicode.utf8.match(char, "[%w_%-%.%s]") then
      if unicode.utf8.match(char, "[%s\n]") then
        char = unicode.utf8.lower(char)
        local is_letter = unicode.utf8.match(char, "%S")
      if not unicode.utf8.match(char, "[%w_%-%s]") then
      if unicode.utf8.match(char, "[%s\n]") then
        char = unicode.utf8.lower(char)
      local code = unicode.utf8.char(tonumber(codepoint, 16))
        if (c ~= nil) and (unicode.utf8.match(c, chartype)) then

The function unicode.utf8.char() encodes a Unicode codepoint in UTF8. We can easily replace it with the built-in function utf8.char() that is also available in LuaMetaTeX.

The function unicode.utf8.lower() lower-cases a UTF8-encoded string and can be replaced by the method uni_algos.case.casefold() from the Lua module lua-uni-algos, which we already use. Unlike Selene Unicode, lua-uni-algos is written in Lua, not in C, and can be installed together with the Markdown package without extending LuaMetaTeX itself.

The function unicode.utf8.match() finds regular expressions in a UTF8-encoded string and has no drop-in replacement. Currently, the file markdown-unicode-data.lua contains LPEG parsers for Unicode punctuation, generated from the file UnicodeData.txt. We would need to generate parsers for other classes of Unicode characters that we wish to recognize.

I can replace the functions unicode.utf8.char() and lower() within an hour. Replacing the function unicode.utf8.match() seems more difficult and will take me up to ~4 hours. I won't have time in January but I will schedule it for February.

Witiko · 2025-02-10T15:05:09Z

@lostenderman: In #551, I am rewriting parts of the code that works with Unicode and I take that as an opportunity to better understand some parts of the code related to CommonMark. Currently, I am looking at the function check_unicode_type(s, i, start_pos, end_pos, chartype) and I think there is either an undiscovered bug there or I misunderstand the code. Either way, I would appreciate your feedback.

When we use check_preceding_unicode_whitespace(s, i), then that expands to check_unicode_type(s, i, -4, -1, "%s") and the values of char_length will be 4, 3, 2, 1. This makes sense if we want to check whether there is a whitespace character ending just before the index i:

                | i - 4 | i - 3 | i - 2 | i - 1 |  i   |
POSSIBILITY #1  | XXXXX | XXXXX | XXXXX | XXXXX |      |
POSSIBILITY #2  |       | XXXXX | XXXXX | XXXXX |      |
POSSIBILITY #3  |       |       | XXXXX | XXXXX |      |
POSSIBILITY #4  |       |       |       | XXXXX |      |

Fig 1: How `check_preceding_unicode_whitespace(s, i)` currently works.

However, when we use check_following_unicode_whitespace(s, i), then that expands to check_unicode_type(s, i, 0, 3, "%s") and the values of char_length will be 1, 2, 3, 4. This makes much less sense to me, since the checked whitespace characters are not aligned on any index:

                |   i   | i + 1 | i + 2 | i + 3 | i + 4 | i + 5 | i + 6 |
POSSIBILITY #1  | XXXXX |       |       |       |       |       |       |
POSSIBILITY #2  |       | XXXXX | XXXXX |       |       |       |       |
POSSIBILITY #3  |       |       | XXXXX | XXXXX | XXXXX |       |       |
POSSIBILITY #4  |       |       |       | XXXXX | XXXXX | XXXXX | XXXXX |

Fig 2: How `check_following_unicode_whitespace(s, i)` currently works.

I assume we want to either 1) have the values of char_length be 4, 3, 2, 1, which would be achieved by using char_length = end_pos - pos + 1. This makes sense if we want to check whether there is a whitespace character ending just before the index i + 4:

                |   i   | i + 1 | i + 2 | i + 3 | i + 4 |
POSSIBILITY #1  | XXXXX | XXXXX | XXXXX | XXXXX |       |
POSSIBILITY #2  |       | XXXXX | XXXXX | XXXXX |       |
POSSIBILITY #3  |       |       | XXXXX | XXXXX |       |
POSSIBILITY #4  |       |       |       | XXXXX |       |

Fig 3: How I think `check_following_unicode_whitespace(s, i)` should work (option 1).

However, this seems to be robust against i being in the middle of another Unicode character in a way that check_preceding_unicode_whitespace(s, i) isn't, so I am skeptical that this is what we want.

Alternatively, we may want to 2) expand to 1, 2, 3, 4 but not change the starting index in the call to lpeg.match() during the iteration. This makes sense if we want to check whether there is a whitespace character starting at the index i:

                |   i   | i + 1 | i + 2 | i + 3 | i + 4 |
POSSIBILITY #1  | XXXXX |       |       |       |       |
POSSIBILITY #2  | XXXXX | XXXXX |       |       |       |
POSSIBILITY #3  | XXXXX | XXXXX | XXXXX |       |       |
POSSIBILITY #4  | XXXXX | XXXXX | XXXXX | XXXXX |       |

Fig 4: How I think `check_following_unicode_whitespace(s, i)` should work (option 2).

This seems much more likely to me. In this case, there seems to be little reason for using a for-cycle. Instead, we could collapse the check into a single PEG pattern that would check for the presence of whitespace characters of any length.

Can you please help out, @lostenderman? Which one is the expected behavior? Fig 2, 3, 4, or something else?

lostenderman · 2025-02-15T21:10:39Z

Hmm, the Fig 2 does indeed not look quite right. Fig 4 looks more like something what I would expect the function to check.

I am surprised none of my tests caught this at the time.

Witiko · 2025-02-17T08:44:22Z

Fig 4 looks more like something what I would expect the function to check.

Thanks for the feedback, Fig 4 was my hunch too! Then, I suppose I can replace all functions check_following_...() with a simple PEG pattern that will recognize Unicode character of any UTF8-encoded length. This will be faster and we need these patterns for checking character categories elsewhere to get rid of a dependency on Selene Unicode's unicode.utf8.match().

I am surprised none of my tests caught this at the time.

I will probably start off by creating a test file that breaks the current code. It should break somewhere in left-flanking and right-flanking delimiter runs, see also the code.

andreiborisov mentioned this issue Apr 21, 2024

Support ConTeXt standalone #402

Open

Witiko added feature request context Related to the ConTeXt interface and implementation labels Apr 21, 2024

Witiko mentioned this issue Nov 20, 2024

Clean up and update built-in themes and make them part of markdown.tex and markdown.sty #522

Closed

3 tasks

Witiko added this to the 3.11.0 milestone Jan 24, 2025

Witiko linked a pull request Feb 10, 2025 that will close this issue

Remove dependency on Selene Unicode #551

Draft

4 tasks

Witiko mentioned this issue Feb 17, 2025

Prevent left-flanking and right-flanking delimiter runs followed by multi-byte whitespace or punctuation characters #552

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LuaMetaTeX #436

Support LuaMetaTeX #436

andreiborisov commented Apr 21, 2024 •

edited

Loading

Witiko commented Nov 21, 2024

Witiko commented Jan 20, 2025 •

edited

Loading

andreiborisov commented Jan 22, 2025

Witiko commented Jan 24, 2025 •

edited

Loading

Witiko commented Feb 10, 2025 •

edited

Loading

lostenderman commented Feb 15, 2025

Witiko commented Feb 17, 2025 •

edited

Loading

Support LuaMetaTeX #436

Support LuaMetaTeX #436

Comments

andreiborisov commented Apr 21, 2024 • edited Loading

Witiko commented Nov 21, 2024

Witiko commented Jan 20, 2025 • edited Loading

andreiborisov commented Jan 22, 2025

Witiko commented Jan 24, 2025 • edited Loading

Witiko commented Feb 10, 2025 • edited Loading

lostenderman commented Feb 15, 2025

Witiko commented Feb 17, 2025 • edited Loading

andreiborisov commented Apr 21, 2024 •

edited

Loading

Witiko commented Jan 20, 2025 •

edited

Loading

Witiko commented Jan 24, 2025 •

edited

Loading

Witiko commented Feb 10, 2025 •

edited

Loading

Witiko commented Feb 17, 2025 •

edited

Loading