Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: publish as ebook/plain html/pdf or other formats #35

Open
gonewest818 opened this issue Apr 18, 2020 · 25 comments
Open

docs: publish as ebook/plain html/pdf or other formats #35

gonewest818 opened this issue Apr 18, 2020 · 25 comments

Comments

@gonewest818
Copy link

gonewest818 commented Apr 18, 2020

As discussed on discord, would the team please consider generating ebook versions of the documentation as an additional artifact of the site build?

I've tested manual conversion with pandoc which looks promising, but obviously the output needs tweaking and the process is not nice. e.g. I had to manually parse sidebar.json to get the correct ordering of the articles.

Whereas this might get closer to the right thing:
https://www.gatsbyjs.org/packages/gatsby-plugin-ebook/#gatsby-plugin-ebook

thanks all. stay safe-

UPDATE: Jump to #35

@jorgeorpinel
Copy link

jorgeorpinel commented Apr 29, 2020

So. This need has recently surfaced again as a relatively easy way to start keeping an archive of versions of the docs that match different major DVC releases. So either a PDF eBook or a simple standalone static HTML website of dvc.com/doc would be ideal, if that's something we can achieve easily with Gatsby.

Thoughts @shcheklein @fabiosantoscode @iAdramelk ? Cc @dmpetrov and @rogermparent

Thanks!

@rogermparent
Copy link
Contributor

rogermparent commented Apr 29, 2020

With the Models PR separating Doc nodes from others, something like this should be pretty painless to implement as long as there's a way to generate the required formats in Node.
A simple HTML output is obviously the easiest route because it can just be done as another page, but for non-HTML formats we can take the same approach gatsby-plugin-sitemap does and output a file from within the onPostBuild hook using data sourced from GraphQL.

There's also the different ways such a page could be formatted like choosing if we keep the sidebar, use another more page-friendly form of index, or skip the index altogether. I can also see the need for some slight schema changes to get every page accessible in sidebar order, but that wouldn't be a big deal for me to implement.

I'm going to look into gatsby-plugin-ebook to see if it suits our needs- it probably provides an easy way to use the onPostBuild approach.

@rogermparent rogermparent self-assigned this Apr 29, 2020
@fabiosantoscode
Copy link

Since the website is already a set of static files, keeping an archive of HTML shouldn't be too hard to accomplish.

Remember that most of what's good for epub, is also good for PDF and print. A lot of it also applies to AMP. So we can deal with all of those at once if need be.

@fabiosantoscode
Copy link

Of course there's the joining all the pages together, which depends on how epub works (I don't know anything about it!), but if we use a print-to-PDF tool we can control page breaks with CSS, and the rest (removing sidebars, top bar) with print CSS.

@jorgeorpinel
Copy link

jorgeorpinel commented Apr 30, 2020

Thanks for the answers guys, sounds promising! But I'm wondering how to keep this as simple as possible. We have all the content in Markdown so in theory this should not be a tough problem, let's not even force it to be done via Gatsby if it's too invovled.

A simple HTML output is obviously the easiest route

Lets focus on this format for now. What I'm imagining is:

  • A special build process that produces an archive e.g. a ZIP or TAR file.
  • The archive contains a directory that basically matches the content/docs/ dir tree, but with .html files instead of .md.
  • The index.html file has the content of https://dvc.org/doc (docs home), and so on. You just open this or any file from file explorer to browse the docs archive (i.e. file:// protocol in browser).
  • All web pages use the same layout (and basic CSS) as the actual site, including the navigation sidebar, yes — it can be repeated in every single HTML or as an <iframe> or similar.
  • The nav bar is generated from sidebar.json (same as in the actual site).
  • In this kind of "build" (let's call it plain doc static site), there's no JS, so links are actual <a href=... tags to the other HTML files, so we need a way to resolve these paths.

Possible approaches

a) I haven't studied Gatsby yet unfortunately, but since it's a SSG these goals should pretty much be it's most basic behavior anyway? Except we have so many other layers of complexity, pages, blog, etc... Is there a way to create a special build that ignores everything except the contents/docs/ dir?

the website is already a set of static files...
...joining all the pages together, which depends on how epub works

@fabiosantoscode sounds like you're saying we basically already have the archive, but there are some server-side elements that don't let us just release current builds as a plain site, right? Again, there'll be no server (i.e. gatsby serve for plan site builds).

epub, is also good for PDF and print... So we can deal with all of those at once

You may be right but let's keep this super simple for now and think only on HTML for now.

b) And if that's complicated, what about a completely custom script (not Gatsby) that uses some other tool to build the plain site explained above? I.e. a hand-written HTML layout (and CSS) and a Node.js script that uses some library to "compile" MD to HTML. It could be run manually or as part of the CI/CD (even list the archive as an asset in https://github.com/iterative/dvc.org/releases).

@jorgeorpinel jorgeorpinel changed the title publish documentation in ebook format(s) docs: publish as ebook/plain html/pdf or other formats Apr 30, 2020
@gonewest818
Copy link
Author

For whatever it's worth, I really did mean epub or similar in my request. So I'm all for you getting your static html archive issue solved because it seems like a pressing need, but also if you could keep epub format in your sights I would appreciate it.

@jorgeorpinel
Copy link

Gotcha. Yes, in a second iteration on this, formats like epub and pdf could be addresses. We'll keep this issue open. Thanks

@fabiosantoscode
Copy link

@jorgeorpinel I'm 99% sure we can achieve what you want pretty easily with Gatsby.

We're probably going to need to post-process the resulting HTML to:

  • remove script tags
  • make URLs relative to the current file (you can't link to /foo/bar in file://)
  • make links to other pages point to actual HTML files (currently they point to pretty paths without the /index.html suffix).

To process HTML we can use something like cheerio which is easy to use.

Besides the above, correct me if I'm wrong @rogermparent but I think we should be able to create a new createPages function which renders the same that's under /doc, but places it in a different prefix (say /doc-static) and adds some React context so the underlying components can render the subtle differences @jorgeorpinel wants us to have.

For local development, one can visit /doc-static in their browser and have hot reload while they adjust the differences we need. In production, this folder is removed from the build result, its HTML files preprocessed and placed into a zip file which we can then host from anywhere we like.

@fabiosantoscode

This comment was marked as outdated.

@iAdramelk
Copy link

@jorgeorpinel @fabiosantoscode @rogermparent Sorry for late answer. Technically generating either epub, pdf or single html file should be relatively easy. But we also have another problem to solve here: How to maintain and index docs from different versions of dvc.

If I understand the basic problem correctly we don't just want to create static version, we also want to store generated versions for previous docs and probably have some way to access them in the site's UI. So we need to either store all versions in md explicitly and rebuild all of them every time, or to have some way to save generated artifacts between builds and add their results to the next build.

There are different ways to solve this problem. Simplest is probably just to add command to cli that generate folder/file and to add it to git manually, but there are other automated ways to do it.

@rogermparent
Copy link
Contributor

rogermparent commented May 1, 2020

gatsby-plugin-ebook is a little outdated and doesn't have great customization options, but the logic is quite simple and it makes epubs so I'm going to make a quick fork that can be shaped to our current schema.
I don't have anything to show yet, I just wanted to confirm epub generation is at least possible. I also need to make it so docs are accessible and guaranteed sidebar order.

The version archive is a separate, but interesting issue. At first brush I'd suggest explicitly storing each version. The docs build much faster than the blog so I don't see adding more docs pages being a big issue. It's certainly something I'd like to solve from within Gatsby, but I'll have to hammer out the exact implementation later.

My primary goal right now is to get docs pages accessible in sidebar order through a custom resolver, as that is required before we do anything with alternate doc formats. Once I do that, stuffing the data into an epub generator is practically minutes away barring any unforeseen issues.

@jorgeorpinel
Copy link

jorgeorpinel commented May 2, 2020

@iAdramelk

Technically generating either epub, pdf or single html file should be relatively easy.

Any hints on things to try for this effect would be greatly appreciated.

How to maintain and index docs from different versions of dvc

Git tags that match those in https://github.com/iterative/dvc/tags? The archives could be setup as artifacts in the Github release history same as old versions of DVC itself.
It's a secondary problem, but important yes.

@iAdramelk
Copy link

Any hints on things to try for this effect would be greatly appreciated.

@jorgeorpinel we have similar task for our internal handbook. We solved it using pandoc. In simplest case you can generate epub or pdf from a folder of markdown files with one cli command, see examples at: https://pandoc.org/demos.html or ebook tutorial here https://pandoc.org/epub.html

We can add syntax highlighting, title page, our own css, etc.

But AFAIR it will not work with remark plugins for markdown and we still need to read file order and titles from sidebar.json.

If we want to use plugins and sidebar.json we will need to generate correct html page with all docs and meta and then convert it to pdf of epub with pandoc.

We can do it two ways:

  1. Use gatsby to build single static page with all docs.

PROS:

  • Fast, around 2-3h to create html.
  • All our installed and custom plugins for markdown would work out of the box.

CONS:

  • We have a lot of unneeded js in the html file so we can't just distribute it as is. We will need to remove unneeded stuff with rehype or some html sanitizer.
  • We don't really want ALL plugins to work. For example our linker plugin that replace dvc commands with links should be either updated to local urls or removed, etc.
  1. Write simple nodejs script that generates static html page using rework and simple string concatenation. Optionally webpack html template plugin to simplify things a little.

PROS:

  • We control everything about html and only include stuff we need, so we can also distribute html version as well as epub or pdf.
  • We can use remark plugins.

CONS:

  • Gatbsy plugins for remark have somewhat different wrapper from the normal remark plugins. For our plugins we can write simple wrapper in 5-10 loc. But for plugins we use from npm we will need to look for remark analogues.

TLDR:

  • If we don't need plugins and only need sidebar.json for ordering: Write script that generated list of pathnames from sidebar.json, send in to pandoc with custom css path -> get epub or pdf.
  • If we want titles from sidebar.json and/or some simple rework plugins: Write simple nodejs script that generates static html, send it to pandoc -> get epub or pdf.
  • If we want to use some custom gatsby plugins for remark and can't find alternatives for them in remark -> generate html with gatsby, sanitize it, send it to pandoc -> get epub or pdf.

@jorgeorpinel
Copy link

jorgeorpinel commented May 5, 2020

Pandoc looks great if we go for a custom shell script that just generates HTML files from the Markdown ones, thanks for the tip! No React plugins or navigation.

Use gatsby to build single static page with all docs.
around 2-3h to create html

2-3 hours? Takes a few minutes 🙂

Yeah guys actually I ran build and looked at the HTML files in public/doc/ and they're basically almost ready for this! Images and links are all broken but they look perfect:

image

Other than the more complex processes outlines above by some of you, can we pass this directory through a custom script that will just fix links and images? I think most images are in public/img/


Also, for the record, I think the first PR for DVC 1.0 is in iterative/dvc.org#1215 (comment) so we could tag the commit before that as the last 0.9x in order to produce the last pre-1.0 docs archive.

p.s. 0.94.0 is out with all the optimizations and bug fixes so we would probably want to apply those to the pre 1.0 docs as well...

@rogermparent
Copy link
Contributor

I assigned myself to this issue because I'm working with the sidebar, but if someone wants to try getting a working pandoc script going I'd be happy to let them take it.

I already have a branch that integrates sidebar.json into the GraphQL API and allows more control over the sidebar in general, but my main stopper now is how to take that data and turn it into an ebook. epub-gen from the Gatsby plugin accepts one layer of "chapters", but doesn't seem to support the kind of multi-level tree hierarchy we have.

I'm thinking I could use rehype to make each top-level category a chapter that contains all its children, then implement the custom HTML ToC option from epub-gen.

@jorgeorpinel
Copy link

jorgeorpinel commented May 8, 2020

my main stopper now is how to take that data and turn it into an ebook. epub-gen from the Gatsby plugin accepts one layer of "chapters

How about just a simple html web site like we mentioned @rogermparent? Would that be a better first step? It would cover our first use case here which is having an archive of older doc versions.

@rogermparent
Copy link
Contributor

rogermparent commented May 8, 2020

@jorgeorpinel kind of, but it runs into the same problems. Epub is very close to HTML, after all.

Because of specifically how our content is structured hierarchically but written as individual pages, making any single page that contains a doc page with its children becomes an issue that requires some sort of transformation of the content to make nesting visible.

The simplest viable transform that I can think of in this regard is something that prepends a sort of chapter ID before the title of each page.
These IDs would be generated in a way that keeps both order and hierarchy, like the following example:

( { "intro": ["tut1", "tut2"] } => { "1: intro": ["1.1: tut1", "1.2: tut2"] })

(Not our actual content structure, but you get the point)

There's also other ways to go about it, like plugins for Rehype to shift all headings of a document in a level so child headings will "nest", but that opens up questions like "how deep do our docs go with headings without this?" and "what do we do with a heading that gets pushed past h6?".

I'd suggest we go for the title prepend solution first as it's simplest, then open ourselves up to other possibly better-looking changes in the book formatting afterward because there's much less drawback to change on a new feature that's as inside baseball as this one compared to something like the docs home where each change brings up SEO questions.

@jorgeorpinel
Copy link

Epub is very close to HTML, after all.

OK but is epub easier than HTML? We already have HTML to work with, so the priority here would be HTML. Most people, I assume, will be more comfortable with this format too: you can simply extract the archive and browse the local docs with your browser.

making any single page that contains a doc page with its children

I wasn't suggesting boiling everything down to a single page. Just replace the links so you can navigate the static site locally over file:// as explained in #35 🙂

I like your ideas but unless I'm getting something wrong, I think we're overengineering this. Have you ran yarn build and checked out the contents of public/doc/ @rogermparent?

I'd suggest we go for the title prepend solution first as it's simplest

If you think the build process can add some metadata to these html files so we can easily transform them with a custom script, that sounds good to me.

Alternatively, have you checked out pandoc? See #35

Thanks

@gonewest818
Copy link
Author

I was really thinking as simple as a script that does this:

pandoc -i index.md dvc-files-and-directories.md dvc-file-format.md dvcignore.md [...] -o dvc-v0.94.0-user-guide.epub

More sophistication isn't strictly needed as a first MVP, and if you decide to do something nicer later you can always go back and regenerate the legacy docs.

What does need to be fixed are paths to image resources, and whatever you're using to handle the "Expand to learn about ..." sections in your tutorials also doesn't convert properly.

For example if you look at the AWS documentation in html and pdf/mobi formats side by side, you can see they're not sweating over the details very much at all...

@jorgeorpinel
Copy link

jorgeorpinel commented May 8, 2020

Agreed, and thanks for bringing out about those expandable sections! They should probably just get fixed open (expanded) somehow before transforming to the final html/epub

Ideally we should also generate a glossary page and link the glossary terms to it (since the tooltips don't work) but this is more advanced, for another iteration of this.

@jorgeorpinel
Copy link

Would be nice to prioritize this if possible cc @efiop (per iterative/dvc.org#593 (comment)) and @shcheklein

BTW is this kind of a duplicate of iterative/dvc.org#593?

@efiop
Copy link

efiop commented Jun 3, 2020

@jorgeorpinel Doesn't seem like a duplicate to me.

@rogermparent
Copy link
Contributor

@jorgeorpinel I think these should be separate issues. This one's initial concept is a legitimate concern that got informally usurped in the comments by what iterative/dvc.org#593 addresses.

The two may have some overlap depending on how either is implemented, but at their core the issues are totally different. I think the best course going forward is to discuss and prioritize docs versioning at iterative/dvc.org#593 and leave this issue to be specifically alternate formats like ebooks.

Sorry @gonewest818, lots other other stuff has gotten in the way of this issue! I'm going to unassign myself from this issue so others may be more inclined to take it and/or iterative/dvc.org#593 on.

@rogermparent rogermparent removed their assignment Jun 3, 2020
@jorgeorpinel
Copy link

OK but I think this is much easier to implement and covers the basic need of iterative/dvc.org#593: "having docs for different DVC versions".

@jorgeorpinel
Copy link

@rogermparent please decide whether this is still a need, prioritize, etc. Transferring to the engine repo...

@jorgeorpinel jorgeorpinel transferred this issue from iterative/dvc.org Jun 29, 2022
@jorgeorpinel jorgeorpinel removed the A: docs Area: documentation label Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants