Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support LuaMetaTeX #436

Open
1 of 2 tasks
andreiborisov opened this issue Apr 21, 2024 · 7 comments · May be fixed by #551
Open
1 of 2 tasks

Support LuaMetaTeX #436

andreiborisov opened this issue Apr 21, 2024 · 7 comments · May be fixed by #551
Labels
context Related to the ConTeXt interface and implementation feature request
Milestone

Comments

@andreiborisov
Copy link

andreiborisov commented Apr 21, 2024

As discussed in #402, there are a couple of blockers to supporting LuaMetaTeX:

  • expl3 no longer works with LuaMetaTeX: latex3/latex3#1518
  • Selene Unicode package is unavailable in LuaMetaTeX
@Witiko
Copy link
Owner

Witiko commented Nov 21, 2024

Pull request #529 has removed the files markdownthemewitiko_tilde.tex, markdownthemewitiko_dot.sty, and markdownthemewitiko_graphicx_http.sty from the Markdown package. These files are now inlined in files markdown.tex and markdown.sty. This should simplify the distribution and the installation of the Markdown package in ConTeXt standalone.

@Witiko
Copy link
Owner

Witiko commented Jan 20, 2025

As discussed in #402, there are a couple of blockers to supporting LuaMetaTeX:

@andreiborisov: LuaMetaTeX should now be compatible with expl3: latex3/latex3#1518.

@andreiborisov
Copy link
Author

@andreiborisov: LuaMetaTeX should now be compatible with expl3: latex3/latex3#1518.

Wow, this is fantastic news!

Can we do something with Selene Unicode too?

@Witiko
Copy link
Owner

Witiko commented Jan 24, 2025

Can we do something with Selene Unicode too?

There are three functions from Selene Unicode that we use: unicode.utf8.char(), lower(), and match().

$ grep -F unicode.utf8 markdown.dtx
  return unicode.utf8.char(n)
  return unicode.utf8.char(n)
  return unicode.utf8.char(n)
      table.insert(char_table, unicode.utf8.char(code_point))
        local is_letter = unicode.utf8.match(char, "%a")
      if not unicode.utf8.match(char, "[%w_%-%.%s]") then
      if unicode.utf8.match(char, "[%s\n]") then
        char = unicode.utf8.lower(char)
        local is_letter = unicode.utf8.match(char, "%S")
      if not unicode.utf8.match(char, "[%w_%-%s]") then
      if unicode.utf8.match(char, "[%s\n]") then
        char = unicode.utf8.lower(char)
      local code = unicode.utf8.char(tonumber(codepoint, 16))
        if (c ~= nil) and (unicode.utf8.match(c, chartype)) then

The function unicode.utf8.char() encodes a Unicode codepoint in UTF8. We can easily replace it with the built-in function utf8.char() that is also available in LuaMetaTeX.

The function unicode.utf8.lower() lower-cases a UTF8-encoded string and can be replaced by the method uni_algos.case.casefold() from the Lua module lua-uni-algos, which we already use. Unlike Selene Unicode, lua-uni-algos is written in Lua, not in C, and can be installed together with the Markdown package without extending LuaMetaTeX itself.

The function unicode.utf8.match() finds regular expressions in a UTF8-encoded string and has no drop-in replacement. Currently, the file markdown-unicode-data.lua contains LPEG parsers for Unicode punctuation, generated from the file UnicodeData.txt. We would need to generate parsers for other classes of Unicode characters that we wish to recognize.


I can replace the functions unicode.utf8.char() and lower() within an hour. Replacing the function unicode.utf8.match() seems more difficult and will take me up to ~4 hours. I won't have time in January but I will schedule it for February.

@Witiko Witiko added this to the 3.11.0 milestone Jan 24, 2025
@Witiko Witiko linked a pull request Feb 10, 2025 that will close this issue
4 tasks
@Witiko Witiko linked a pull request Feb 10, 2025 that will close this issue
4 tasks
@Witiko
Copy link
Owner

Witiko commented Feb 10, 2025

@lostenderman: In #551, I am rewriting parts of the code that works with Unicode and I take that as an opportunity to better understand some parts of the code related to CommonMark. Currently, I am looking at the function check_unicode_type(s, i, start_pos, end_pos, chartype) and I think there is either an undiscovered bug there or I misunderstand the code. Either way, I would appreciate your feedback.

When we use check_preceding_unicode_whitespace(s, i), then that expands to check_unicode_type(s, i, -4, -1, "%s") and the values of char_length will be 4, 3, 2, 1. This makes sense if we want to check whether there is a whitespace character ending just before the index i:

                | i - 4 | i - 3 | i - 2 | i - 1 |  i   |
POSSIBILITY #1  | XXXXX | XXXXX | XXXXX | XXXXX |      |
POSSIBILITY #2  |       | XXXXX | XXXXX | XXXXX |      |
POSSIBILITY #3  |       |       | XXXXX | XXXXX |      |
POSSIBILITY #4  |       |       |       | XXXXX |      |

Fig 1: How `check_preceding_unicode_whitespace(s, i)` currently works.

However, when we use check_following_unicode_whitespace(s, i), then that expands to check_unicode_type(s, i, 0, 3, "%s") and the values of char_length will be 1, 2, 3, 4. This makes much less sense to me, since the checked whitespace characters are not aligned on any index:

                |   i   | i + 1 | i + 2 | i + 3 | i + 4 | i + 5 | i + 6 |
POSSIBILITY #1  | XXXXX |       |       |       |       |       |       |
POSSIBILITY #2  |       | XXXXX | XXXXX |       |       |       |       |
POSSIBILITY #3  |       |       | XXXXX | XXXXX | XXXXX |       |       |
POSSIBILITY #4  |       |       |       | XXXXX | XXXXX | XXXXX | XXXXX |

Fig 2: How `check_following_unicode_whitespace(s, i)` currently works.

I assume we want to either 1) have the values of char_length be 4, 3, 2, 1, which would be achieved by using char_length = end_pos - pos + 1. This makes sense if we want to check whether there is a whitespace character ending just before the index i + 4:

                |   i   | i + 1 | i + 2 | i + 3 | i + 4 |
POSSIBILITY #1  | XXXXX | XXXXX | XXXXX | XXXXX |       |
POSSIBILITY #2  |       | XXXXX | XXXXX | XXXXX |       |
POSSIBILITY #3  |       |       | XXXXX | XXXXX |       |
POSSIBILITY #4  |       |       |       | XXXXX |       |

Fig 3: How I think `check_following_unicode_whitespace(s, i)` should work (option 1).

However, this seems to be robust against i being in the middle of another Unicode character in a way that check_preceding_unicode_whitespace(s, i) isn't, so I am skeptical that this is what we want.

Alternatively, we may want to 2) expand to 1, 2, 3, 4 but not change the starting index in the call to lpeg.match() during the iteration. This makes sense if we want to check whether there is a whitespace character starting at the index i:

                |   i   | i + 1 | i + 2 | i + 3 | i + 4 |
POSSIBILITY #1  | XXXXX |       |       |       |       |
POSSIBILITY #2  | XXXXX | XXXXX |       |       |       |
POSSIBILITY #3  | XXXXX | XXXXX | XXXXX |       |       |
POSSIBILITY #4  | XXXXX | XXXXX | XXXXX | XXXXX |       |

Fig 4: How I think `check_following_unicode_whitespace(s, i)` should work (option 2).

This seems much more likely to me. In this case, there seems to be little reason for using a for-cycle. Instead, we could collapse the check into a single PEG pattern that would check for the presence of whitespace characters of any length.

Can you please help out, @lostenderman? Which one is the expected behavior? Fig 2, 3, 4, or something else?

@lostenderman
Copy link
Collaborator

Hmm, the Fig 2 does indeed not look quite right. Fig 4 looks more like something what I would expect the function to check.

I am surprised none of my tests caught this at the time.

@Witiko
Copy link
Owner

Witiko commented Feb 17, 2025

Fig 4 looks more like something what I would expect the function to check.

Thanks for the feedback, Fig 4 was my hunch too! Then, I suppose I can replace all functions check_following_...() with a simple PEG pattern that will recognize Unicode character of any UTF8-encoded length. This will be faster and we need these patterns for checking character categories elsewhere to get rid of a dependency on Selene Unicode's unicode.utf8.match().

I am surprised none of my tests caught this at the time.

I will probably start off by creating a test file that breaks the current code. It should break somewhere in left-flanking and right-flanking delimiter runs, see also the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
context Related to the ConTeXt interface and implementation feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants