Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special characters breaking project pages #1143

Closed
demenech opened this issue Apr 20, 2023 · 12 comments
Closed

Special characters breaking project pages #1143

demenech opened this issue Apr 20, 2023 · 12 comments
Assignees
Labels

Comments

@demenech
Copy link
Member

demenech commented Apr 20, 2023

We are facing some difficulties with READMEs that contain certain special characters, specially "<" and ">".

The issue is that our MDX parser interprets that as a special syntax and causes 500 errors on the project page. See the example on the notes.

To fix this issue, we could escape special characters on README, so instead of:

>

Have:

{">"}

The problems with this approach are that:

  • README becomes less readable when not being read on datahub-next
  • We would have to fix all individual situations in all the README files

A better approach would be to find a way to make our parser ignore errors like this OR automatically escape these characters since we don't really use this syntax anywhere.

Notes

Example:

Take a look at:

https://datahub.io/core/world-religion-projections

If we access this dataset on a dev environment, the following error can be found:

image

Note that what's breaking the page is this part of the README:

The data is sourced from [PEW RESEARCH CENTER ](https://www.pewforum.org/2015/04/02/religious-projection-table/2010/percent/all/). In original dataset the number and the percentage share of followers for some religions as "<10000" and "<10%". Because of technical limitations of the visualization tool, these values had to be changed into "10000".
@rufuspollock
Copy link
Member

This issue is still happening on https://datahub.io/core/world-religion-projections - i think we probably want a way to fix this.

@rufuspollock rufuspollock transferred this issue from another repository May 25, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in DataHub May 25, 2024
@rufuspollock rufuspollock moved this from 🆕 New to 📋 Backlog in DataHub Jun 27, 2024
@LuisVCSilva
Copy link

LuisVCSilva commented Jul 3, 2024

Shaping

Problem:

MDX parser interprets special characters "<" and ">" and alikes in README files as special syntax, causing 500 errors on the project page.

Appetite:

~ 1 week to implement a robust solution.

Rabbit-holes:

  • N/A

No Gos:

  • Manually fixing all individual instances of special characters in all README files.
  • Introducing a solution that significantly impacts the readability of README files outside of datahub-next.
  • Implementing a fix that is not scalable or maintainable in the long term.

Solution:

Write a regex to escape characters properly

@olayway
Copy link
Member

olayway commented Jul 3, 2024

@LuisVCSilva where is the "Solution" part? :)

The "Rabbit holes" should be any potential facets of the solution you're sketching that you're either not sure of or that have a potential of draining lots of resources on resolving them. Those can also be any questions that might have come up to you and you don't have answers to them yet.

No-goes are usually rabbit-holes that you can safely discard - you think they are not worth working on.

@LuisVCSilva
Copy link

@olayway Edits made, thanks!

@olayway
Copy link
Member

olayway commented Jul 5, 2024

~ 1 week to implement a robust solution.

It shouldn't take more than 1 day

@rufuspollock
Copy link
Member

@LuisVCSilva also IMO the solution should be worked out in much more detail b4 implementation here e.g. where in the code will you likely make these changes? Will you write tests? If so where ...

Is this just < and > that causes issues?

And what about things like => (common in obsidian for "implies") - will you replace just the > or will you replace the whole => with implies unicode character etc?

Do you actually need to fix this or do we just need to report to users what the error relates to so they can fix it (getting into regexes to fix stuff is a potentially complex business - or maybe it is very simple ...)

Have you done any research online e.g. googling about this issue.

PS: does this relate to #1206?

@rufuspollock rufuspollock moved this from 📋 Backlog to 🏗 In progress in DataHub Jul 9, 2024
@gradedSystem
Copy link
Member

gradedSystem commented Aug 8, 2024

Situation

Facing difficulties with READMEs that contain certain special characters, specially "<" and ">".

Problem

MDX throws an error whenever special characters like < or > appears in the .md file
It could also mean problem with other special characters that might also give us an MDX parsing error.

Solution

  • Currently content from README.md files are parsed through remark plugins in the lib/markdown.ts file, thus making sure that before passing the content we can replace special characters to an HTML entities such > = &gt; or < = %lt;.
  • It would also make sense to write unit-tests for all possible different cases, and it would also make sense replace maybe these ones as well:
    • ' = &apos;
    • & = &amp;
    • " = &quot;
  • Making sure everything documented somewhere how the parsing of the .md file works could be a good feature to add thoughts? @olayway

Appetite

Implementation should be pretty quick(15-30 mins), also writing different test cases might take some time to test(1-2h).

Rabbit-holes

No-goes

  • Hardcoding solution only for specific characters, meaning focusing only < or > this might be not a good approach

Appendix

cc @olayway @rufuspollock

@olayway
Copy link
Member

olayway commented Aug 9, 2024

@gradedSystem Ok, let's just do this. I think we're making it more complicated than it actually is at this point. For now let's just check >, <, =>, <=, <=> maybe.

Hint: we you want to escape these characters before this line: https://github.com/datopian/datahub-next/blob/34c53cdc9d9a1dbd96e6ce08f219c9e064e698b1/lib/markdown.ts#L30
Creating a small remark plugin will not work, as the error is thrown already by gray-matter.

Also, I advise starting by writing tests and coding the solution second.

@olayway olayway moved this from 🏗 In progress to 📋 Backlog in DataHub Aug 9, 2024
@gradedSystem
Copy link
Member

I just did the shaping and thought to include above too😅, ok noted will start working on tests before solution

@gradedSystem gradedSystem moved this from 📋 Backlog to 🏗 In progress in DataHub Aug 12, 2024
@olayway
Copy link
Member

olayway commented Aug 16, 2024

FIXED

@olayway olayway closed this as completed Aug 16, 2024
@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in DataHub Aug 16, 2024
@olayway olayway reopened this Aug 19, 2024
@olayway
Copy link
Member

olayway commented Aug 19, 2024

Reopening as we haven't thought about some cases and the current solution breaks e.g. callouts and mermaid diagrams.

@olayway
Copy link
Member

olayway commented Aug 19, 2024

WONTFIX

Closing as wont-fix as there are too many edge cases and supporting this would require quite a lot of work. For now any standalone > or < that are causing errors need to be escaped with expressions: {">"} .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: ✅ Done
Development

No branches or pull requests

5 participants