-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: publish as ebook/plain html/pdf or other formats #35
Comments
So. This need has recently surfaced again as a relatively easy way to start keeping an archive of versions of the docs that match different major DVC releases. So either a PDF eBook or a simple standalone static HTML website of dvc.com/doc would be ideal, if that's something we can achieve easily with Gatsby. Thoughts @shcheklein @fabiosantoscode @iAdramelk ? Cc @dmpetrov and @rogermparent Thanks! |
With the Models PR separating Doc nodes from others, something like this should be pretty painless to implement as long as there's a way to generate the required formats in Node. There's also the different ways such a page could be formatted like choosing if we keep the sidebar, use another more page-friendly form of index, or skip the index altogether. I can also see the need for some slight schema changes to get every page accessible in sidebar order, but that wouldn't be a big deal for me to implement. I'm going to look into |
Since the website is already a set of static files, keeping an archive of HTML shouldn't be too hard to accomplish. Remember that most of what's good for epub, is also good for PDF and print. A lot of it also applies to AMP. So we can deal with all of those at once if need be. |
Of course there's the joining all the pages together, which depends on how epub works (I don't know anything about it!), but if we use a print-to-PDF tool we can control page breaks with CSS, and the rest (removing sidebars, top bar) with print CSS. |
Thanks for the answers guys, sounds promising! But I'm wondering how to keep this as simple as possible. We have all the content in Markdown so in theory this should not be a tough problem, let's not even force it to be done via Gatsby if it's too invovled.
Lets focus on this format for now. What I'm imagining is:
Possible approachesa) I haven't studied Gatsby yet unfortunately, but since it's a SSG these goals should pretty much be it's most basic behavior anyway? Except we have so many other layers of complexity, pages, blog, etc... Is there a way to create a special build that ignores everything except the
@fabiosantoscode sounds like you're saying we basically already have the archive, but there are some server-side elements that don't let us just release current builds as a plain site, right? Again, there'll be no server (i.e.
You may be right but let's keep this super simple for now and think only on HTML for now. b) And if that's complicated, what about a completely custom script (not Gatsby) that uses some other tool to build the plain site explained above? I.e. a hand-written HTML layout (and CSS) and a Node.js script that uses some library to "compile" MD to HTML. It could be run manually or as part of the CI/CD (even list the archive as an asset in https://github.com/iterative/dvc.org/releases). |
For whatever it's worth, I really did mean epub or similar in my request. So I'm all for you getting your static html archive issue solved because it seems like a pressing need, but also if you could keep epub format in your sights I would appreciate it. |
Gotcha. Yes, in a second iteration on this, formats like epub and pdf could be addresses. We'll keep this issue open. Thanks |
@jorgeorpinel I'm 99% sure we can achieve what you want pretty easily with Gatsby. We're probably going to need to post-process the resulting HTML to:
To process HTML we can use something like cheerio which is easy to use. Besides the above, correct me if I'm wrong @rogermparent but I think we should be able to create a new For local development, one can visit |
This comment was marked as outdated.
This comment was marked as outdated.
@jorgeorpinel @fabiosantoscode @rogermparent Sorry for late answer. Technically generating either epub, pdf or single html file should be relatively easy. But we also have another problem to solve here: How to maintain and index docs from different versions of dvc. If I understand the basic problem correctly we don't just want to create static version, we also want to store generated versions for previous docs and probably have some way to access them in the site's UI. So we need to either store all versions in md explicitly and rebuild all of them every time, or to have some way to save generated artifacts between builds and add their results to the next build. There are different ways to solve this problem. Simplest is probably just to add command to cli that generate folder/file and to add it to git manually, but there are other automated ways to do it. |
The version archive is a separate, but interesting issue. At first brush I'd suggest explicitly storing each version. The docs build much faster than the blog so I don't see adding more docs pages being a big issue. It's certainly something I'd like to solve from within Gatsby, but I'll have to hammer out the exact implementation later. My primary goal right now is to get docs pages accessible in sidebar order through a custom resolver, as that is required before we do anything with alternate doc formats. Once I do that, stuffing the data into an epub generator is practically minutes away barring any unforeseen issues. |
Any hints on things to try for this effect would be greatly appreciated.
Git tags that match those in https://github.com/iterative/dvc/tags? The archives could be setup as artifacts in the Github release history same as old versions of DVC itself. |
@jorgeorpinel we have similar task for our internal handbook. We solved it using pandoc. In simplest case you can generate epub or pdf from a folder of markdown files with one cli command, see examples at: https://pandoc.org/demos.html or ebook tutorial here https://pandoc.org/epub.html We can add syntax highlighting, title page, our own css, etc. But AFAIR it will not work with remark plugins for markdown and we still need to read file order and titles from sidebar.json. If we want to use plugins and sidebar.json we will need to generate correct html page with all docs and meta and then convert it to pdf of epub with pandoc. We can do it two ways:
PROS:
CONS:
PROS:
CONS:
TLDR:
|
Pandoc looks great if we go for a custom shell script that just generates HTML files from the Markdown ones, thanks for the tip! No React plugins or navigation.
2-3 hours? Takes a few minutes 🙂 Yeah guys actually I ran build and looked at the HTML files in public/doc/ and they're basically almost ready for this! Images and links are all broken but they look perfect: Other than the more complex processes outlines above by some of you, can we pass this directory through a custom script that will just fix links and images? I think most images are in public/img/ Also, for the record, I think the first PR for DVC 1.0 is in iterative/dvc.org#1215 (comment) so we could tag the commit before that as the last 0.9x in order to produce the last pre-1.0 docs archive. p.s. 0.94.0 is out with all the optimizations and bug fixes so we would probably want to apply those to the pre 1.0 docs as well... |
I assigned myself to this issue because I'm working with the sidebar, but if someone wants to try getting a working pandoc script going I'd be happy to let them take it. I already have a branch that integrates I'm thinking I could use rehype to make each top-level category a chapter that contains all its children, then implement the custom HTML ToC option from |
How about just a simple html web site like we mentioned @rogermparent? Would that be a better first step? It would cover our first use case here which is having an archive of older doc versions. |
@jorgeorpinel kind of, but it runs into the same problems. Epub is very close to HTML, after all. Because of specifically how our content is structured hierarchically but written as individual pages, making any single page that contains a doc page with its children becomes an issue that requires some sort of transformation of the content to make nesting visible. The simplest viable transform that I can think of in this regard is something that prepends a sort of chapter ID before the title of each page.
(Not our actual content structure, but you get the point) There's also other ways to go about it, like plugins for Rehype to shift all headings of a document in a level so child headings will "nest", but that opens up questions like "how deep do our docs go with headings without this?" and "what do we do with a heading that gets pushed past h6?". I'd suggest we go for the title prepend solution first as it's simplest, then open ourselves up to other possibly better-looking changes in the book formatting afterward because there's much less drawback to change on a new feature that's as inside baseball as this one compared to something like the docs home where each change brings up SEO questions. |
OK but is epub easier than HTML? We already have HTML to work with, so the priority here would be HTML. Most people, I assume, will be more comfortable with this format too: you can simply extract the archive and browse the local docs with your browser.
I wasn't suggesting boiling everything down to a single page. Just replace the links so you can navigate the static site locally over file:// as explained in #35 🙂 I like your ideas but unless I'm getting something wrong, I think we're overengineering this. Have you ran
If you think the build process can add some metadata to these html files so we can easily transform them with a custom script, that sounds good to me. Alternatively, have you checked out pandoc? See #35 Thanks |
I was really thinking as simple as a script that does this:
More sophistication isn't strictly needed as a first MVP, and if you decide to do something nicer later you can always go back and regenerate the legacy docs. What does need to be fixed are paths to image resources, and whatever you're using to handle the "Expand to learn about ..." sections in your tutorials also doesn't convert properly. For example if you look at the AWS documentation in html and pdf/mobi formats side by side, you can see they're not sweating over the details very much at all... |
Agreed, and thanks for bringing out about those expandable sections! They should probably just get fixed open (expanded) somehow before transforming to the final html/epub
|
Would be nice to prioritize this if possible cc @efiop (per iterative/dvc.org#593 (comment)) and @shcheklein BTW is this kind of a duplicate of iterative/dvc.org#593? |
@jorgeorpinel Doesn't seem like a duplicate to me. |
@jorgeorpinel I think these should be separate issues. This one's initial concept is a legitimate concern that got informally usurped in the comments by what iterative/dvc.org#593 addresses. The two may have some overlap depending on how either is implemented, but at their core the issues are totally different. I think the best course going forward is to discuss and prioritize docs versioning at iterative/dvc.org#593 and leave this issue to be specifically alternate formats like ebooks. Sorry @gonewest818, lots other other stuff has gotten in the way of this issue! I'm going to unassign myself from this issue so others may be more inclined to take it and/or iterative/dvc.org#593 on. |
OK but I think this is much easier to implement and covers the basic need of iterative/dvc.org#593: "having docs for different DVC versions". |
@rogermparent please decide whether this is still a need, prioritize, etc. Transferring to the engine repo... |
As discussed on discord, would the team please consider generating ebook versions of the documentation as an additional artifact of the site build?
I've tested manual conversion with
pandoc
which looks promising, but obviously the output needs tweaking and the process is not nice. e.g. I had to manually parse sidebar.json to get the correct ordering of the articles.Whereas this might get closer to the right thing:
https://www.gatsbyjs.org/packages/gatsby-plugin-ebook/#gatsby-plugin-ebook
thanks all. stay safe-
UPDATE: Jump to #35
The text was updated successfully, but these errors were encountered: