-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
layering: FG only vs. FG/BG vs. something even more complicated #11
Comments
Without having thought things through, I'm guessing an mage-as-background layer might be easier to implement for DomTerm or other terminals that aren't tile/cell-based. Mixing text and images may be easier if the image layer (canvas) is separate from the text layer. However, resource management (including the issue of when to clear and release an image) would be different -possibly easier, possible harder. Just mentioning the point; not arguing for any particular approach. |
Thinking out loud: Before walking into "more complex image processing" territory- partial updates, etc.), I would suggest we at least keep an eye on the selective erases (DEC and/OR ISO style) and DEC rectangle functions. Granted they are not implemented by every vte ( A clarification: My point is, FG actually has other advantages that are usually overlooked because not every vte implements some really handy functions or features that later models of DEC VT series came equipped with. These functions can let us easily delegate a lot of stuff to applications and possibly shrink the draft. OTOH, their adoption is another big issue... |
Hey, with this draft spec I was very closely following Egmont's post on a future Good image protocol, trying to formalize it and add some semantics/syntax to it (hence, this repo name :) ) as I also stand behind his statements he has made in there. I think one gets the best success by being simple, easily understandable, and therefore quickly adoptable by TUI/VTE devs. The drawbacks of "image data replaces text" you mention I do not see as a drawback but as a strength, as this way you guarantee to be best fitting into the already existing VT sequences. Anything else would make it harder to stay compatible with existing VT sequences (IMHO. This is why this draft spec has also adopted it that way. Things like blending I would definitely recommend to not include as it would just reduce simplicity and potentially hurt adaptability / acceptance. Egmont's argument that such things belong to the client side I am fully standing behind, too. As to the idea of using that @ismail-yilmaz I actually plan to implement the rectangular and selective VT sequences in my next milestone release as I think it shouldn't be too hard and certainly fun to use on the client side, too. :) |
I had implemented them for our TerminalCtrl (a..k.a Ultimate++ terminal widget) and for its stripped-down and specialized, commercial sibling (which I am not free to disclose more info), and all I can say is that these functions are efficiently used in the wild by some client software to do interesting stuff with inline images, thanks to FG rendering, without requiring a new image protocol. But then again, this is just one instance I am pointing out, not the rule. P.s They are not hard to implement but selective erases have one or two little quirks you may want to know, so If you need help, let me know. |
This brings another question that was on my mind for some time related -albeit loosely- to layering: Who is the target audience? Desktop vte users, web browser users? I mean, this feature sounds nice but the examples are from -AFAIK- desktop terminals with relatively high requirements. What about vtes on low end devices (!= old) e.g. rb pi boards, or some other similar device utilizing linux fb and CPU to draw stuff? How would layering scale down to low end devices. The layering is not a problem in itself, IMO, but it being a part of core draft might lead to some conflicts. |
Hmm guess I shaped the issue too broad, and it should have been two issues:
Following the original idea "image data is treated as FG content" strict cell affinity is kinda implied, with all pros and cons. This only works with a tiling model, where an image gets spread across text cells, and tiles move along with them. All text manipulation sequences are meant to still work as expected, even the rectangular one (if implemented). The really hard part is the text reflow thingy here, image tile reflow is just not the right thing to do. We need an answer to that, if we keep going with the text cell tiling model. Layering on the other hand is more about which layer we want to address, where to put it relative to existing content layers like FG/BG in a sense of visual representation (and thus in terms of how transparency rules will apply). Do we want multiple layers? Or are we good with just one as FG replacement layer? |
We sidestepped into this topic in #8, so I reply in this thread instead: I am generally against the z-axis with the fact that currently (in Kitty for example) you can magically blend images over text and text over images. - how many images can one cell then contain? What about commands like
In the spec I currently even mention that as a potential future improvement. If I'm reading it once more, I'll go straight to How it could be added. Let's not temper with z-axis ideas. Just reuse the existing command for uploading the image and then add a new command that puts that image into the background, to be precise: as background image filling the full screen (p.s. @jerch, this might then be bad to provide rows-x-cols during upload, as you might potentially down-scale an image, losing information, that then gets up-scaled again, just to display it as fullscreen background image, any thoughts?) So apart from having the ability to programmatically set the terminal's screen background (that stays the same when the user scrolls up/down in the history!), I wonder if there is any real use-case for the z-axis and the wish to be able to draw above the text or anywhere else below the text. |
I agree, a fine-grained z-index thingy gets a "no" from me. This is way over what terminals should be capable to do, imho. There is one exception though - I still think one compostion/blending step from old to new image at cell XY might be a good idea. Could be done like that in a drawing sequence:
As I wrote in the xterm image PR this opens the door for apps to do partial updates by follow-up images, where unchanged areas are simply set to full transparency. With PNG this allows quite nice packrate effectively only transmitting the changed pixels. For the blending itself I am not interested in complicated blending modes either (we dont write an advanced image manipulation lib, just go with additive to get the full tranparency mapping from old data done).
Yes thats true. Even when I wrote things above I kinda ended up thinking, that a viewport background picture sequence might be beyond this spec. The problem I see here - I currently tend to think that we dont want any other layers beside FG at all. If we all agree on that, fine - lets skip the viewport background thingy here or do it by other means. But if we want to be able to address several layers (e.g. FG + BG + whatever else) - then we need another param in the sequence to denote the layer. There we could add "viewport background" as a target layer as well, and special case that type of sequence without any grid size params. |
the importance of the kitty z-axis for me has not been its correspondence to the integers, but rather three distinct values: -1, 0, and 1. without those, you can't flexibly integrate text and graphics into the same cell. sixel only lets you do graphics-over-text, not text-over-graphics, whereas with kitty i get both. as pointed out by @christianparpart , the ability to blend images and such eliminates much of the use case for other values. |
Guess we are now in the real layer needs discussion. @dankamongmen Does "graphics-over-text, not text-over-graphics" mean, you need both? Because for that we would need 3 FG-related layers: How about BG layers? Could FG-1 be treated as a single BG layer (effectively erasing other BG pixels), or do we need multiple layers here as well? |
Hey, many thanks for responding to that matter. My big question though is: what is the Usecase, or is that a purely academical feature that has no real use. I am concerned about this since especially in the TWG forum there seemed to be the consensus that a cell either displays text or an arbitrary image. I do not intend to have a spec that is so dead simple that there is no usefulness anymore but i would prefer to avoid tending to either extreme, but have a set of features that are necessary to sufficiently suit the app developers needs without exploding on the other side (for really rarely useful features especially) We also need to think about TEs such as gneome-terminal/vte which seems to really care about how many bits are used per cell. The z-axis approach would make it very unlikely to be adopted by vte/Gnome-Terminal. I would like to avoid that. :) |
ideally i would like three layers, yes (and three layers ought be sufficient): text, graphic, text. to fully match the Notcurses abstraction, i would theoretically need N layers (text, graphics, and arbitrarily many more text+graphics atop that), but i think that's a bit much to ask from terminal authors (though kitty does provide it). but let me at least get my three. that allows me to have a text background with graphic sprites over it, with labels on those sprites. with that--especially with graphics which don't wipe one another out (transparent pixels will suffice)--i can cover the vast majority of use cases. i.e., let me get these two, and i'm very happy:
|
basically, notcurses gives you piles -- rendering contexts -- each consisting of a totally ordered set of planes. those planes can have text or bitmaps on them (though not currently both). glyphs do not stack, but bitmaps theoretically do (and this is a reality on kitty). so someone could very well have, say:
and what we would expect to see (since notcurses uses the painter's algorithm from top to bottom of a pile would be):
i can draw a diagram if that didn't make sense. so if you give me only three planes, i can't stack arbitrary graphics, even if i lay them down temporally, because i need draw those P2 glyphs and have them obstruct P1, but not P3 but if you don't want to do this, i get it! i can tell users, "sorry, but glyphs do not stack properly when separated by text in all terminals." but if it's just as easy to do it......i wouldn't mind it =] |
so my work thus far has been heavy on the eye-catching sprite side of things, but real uses:
i mean, "what are the uses" -- consider any GUI app that lets you do this, that you'd like to be able to use remotely. i don't know about you, but X windows has never worked smoothly for me outside of a LAN. the greatly reduced bandwidth of a TUI (even with some graphics, especially if they're tightly encoded) makes it possible. i'm counting on this, for instance, for an SDR TUI i'm writing. my SDR samples gigabits per second, far more than i want to move across my network to render locally. likewise, X forwarding across that network sucks. but i can draw the top line of my waterfall plot in my TUI with pixels, and boom, now i have working, responsive SDR across the network. as a library author, i have found time and time again that creative users with problems you've never considered will use all the features you make available, often in delightful and unexpected ways. if it's not much more difficult to do N planes than 1 plane, and especially if it's a zero-cost-when-unused abstraction, give the people their planes, and see wonderful things down the road. if not, screw the people. if they want more planes, they can by god write their own terminal emulators, with anime and marching bands and built-in prolog interpreters. your calls, good sirs. |
Good morning @dankamongmen. I will keep it short for now and have details later. I certainly do not want to scare developers like you away from implementing this, so a definitely respect you taking the time in here :) I know vte/Gnome-Terminal is really bit-savy and alacritty refuses to put in support for something that would increase cell size. The rendering part should be easy for any OpenGL based terminal like kitty (for me this would be the least problem too). I simply assume that a DECIC would still cut that rectangular area (containing three layers of images and text) in half. So other VT sequence semantics should not be altered at least. |
yep, understood. perhaps it would be a natural decay mode to accept a z-index parameter on bijection with 𝒁, and they can collapse to -1/0/-1, and they could further collapse to -1/0 or 0/1, and further collapse to 0? so you can pass a z, but there's no guarantee you're going to get a fully faithful implementation. |
From my perspective in implementing images in wezterm (both sixel and iterm2), and thinking only about images, rather than text+images, modeling 3 vs. arbitrary layers isn't especially different (fixed size array vs. a runtime sized vector), but rendering is where it gets more tricky with eg: OpenGL, and may require additional (arbitrary) draw passes, which is a bit at odds with the desire to minimize the number of draw passes for a high-performing renderer. That said, in thinking about how I might efficiently implement 3 layers in wezterm (eg: track a layer bitset per line and OR that for the viewport to understand which layers are present), its not much of a stretch to apply that same technique to arbitrary layers.
I think this sounds reasonable, so long as the TE can indicate to the client app the number of supported layers and can make an educated choice about what it's going use.
@dankamongmen: say layer If the former, then there is really just a single text cell with some number of image attachments that either render before or after the text, and that feels like an easy delta from my current implementation of images. If the text needs to logically render in layer 1 rather than 0 then there could be an attribute on the cell to specify its layer number. If the text is intended to blend over other background text, that feels much more complex to model! |
nope, never this. one glyph per cell; that is The Way. each glyph, however, might be at a different "height" relative to the various graphics. the only place where there is more than one glyph per cell is internally to the toolkit, which is providing the z-axis / pile abstraction. (in Notcurses, i speak of rendering -- taking a pile of totally ordered planes, and solving for each cell in terms of glyph, styling, background color, and foreground color -- and rasterizing -- turning this into an optimized serious of outputs with respect to what's already on the screen -- and writing -- uhhh, writ(2)ing. by the end of rendering, you've projected N planes of text onto a single matrix. each cell of the matrix has a z-order, relative to the various graphic planes. i render from top to bottom, then rasterize from bottom to top, if that makes sense) so a cell has a glyph on it, but only one. that glyph might have an opaque background, or might not. if the background is opaque, nothing is rendered underneath it. if the background is not opaque, graphics can be rendered underneath it. graphics can always be rendered atop it. but never glyph-on-glyph. that's madness. |
Holy cow - ok, gonna need abit to get through all. Plz let the ideas flow, I think it is good to have all maybe-things on the table first. Whether we really want to create rainbow ponies later on (guess I dont), we should discuss in a consolidation phase (sorting out technically challenging or useless stuff etc). |
OK, so my mental model for this is that each cell has a textual z-axis layer number (default 0) and that it can have essentially a list of attachments; each attachment is comprised of a slice (a reference to a region within an uploaded image) and its layer number. The layer number for the image is a render property (the same uploaded image could be rendered in multiple locations and different layers). Today wezterm effectively has a hardcoded textual layer of 0 and supports a maximum of one image slice at layer 0 (images trump text in its implementation today). There needs to be some guidance on resolving conflicting images with the same layer number, similarly for images that have the same layer number as text. Something like the above doesn't sound terribly difficult to implement. |
I get the feeling, that we kinda mixed up rendering/output layers and buffer storage needs way to early. Ofc giving every layer its own persistence/buffer would impose bigger issues, as pointed out by @christianparpart: If I remember right, VTE uses around 16 bytes per cell, in xterm.js we currently alloc 12 bytes per se, but with an optional additional storage for very rarely used things like different underline styles and colors. I also added the image tile information there. Thus I think we should not carry the multiple layer idea over to the buffer yet, furthermore the persistence of all layer information is not even needed. Instead we can think of layers as composition rules, what needs to be drawn when to get certain output states (just like you would paint on a canvas in multiple passes). To tackle things maybe these steps are useful:
To 1.
but taking a closer look at those images, there you never have stacked glyphs. In fact you emphazise this further, that stacked glyphs are not needed (and I second that). Sweet, we still only need one "persistence layer" for text - FG. Lets think about how we can compose that output: image 1 - schematically:
Now lets reduce that step by step:
What I get from this - things would work here, if the image would not destroy former FG/BG data, but gets attached to the cells above FG. Then a renderer can always redraw things from 1 to 5. image 2 - schematically:
How to solve 3? How about that: So we basically would have these layers:
To 2. And to mark the bonus point here - this is stil compatible to HTML engines, where BG/FG box-printing is hard tagged to a char styling (we cannot print between BG and FG there). The stitching through with BG(transparent) works there too. (Maybe this answers your question @christianparpart). To 3. @dankamongmen Would something like that work for your use cases? I know that it does not reflect every aspect of your wishlist, but should give capable composition tools at your hand with some multipass writing from appside. Btw I think both image layers should get the blending cap I proposed in #11 (comment). |
i need to focus on my poor satellites for a bit, and also get some sleep, but i really like where this is going, and i'm inclined to say yes. i'd like to read over it more closely, but your notion of "composable buffers" vs persistence within the emulator seems dead-on. i'm suddenly very excited about this entire effort. |
@jerch are you the primary xterm.js author? would you mind shooting me a mail at [email protected] so i might ask you a question or two? |
@dankamongmen You prolly got mail 😸 |
I also need new oil for my poor satellites first. But a few things for now:
Compilers I think is the least likely item that will pick it up. But a VIM-LSP plugin in order to improve visual of tiny tooltips would probably be appealing to the users. |
What do you suggest to discuss instead first? To me this thread feels pretty much on point, as it may shape fundamental supported/unsupported inner mechanics of the spec. Imho we cannot completely avoid the layering/composition discussion, it needs to be solved even for just one image layer to clarify transparency handling.
If we keep the output composition from the layers easy (no complicated blending), anyone could do that "shader work" on 2d arrays on the CPU with reasonable speed. Basically all TEs already have to do that for FG drawing on top of the BG color. And if the spec is any good, they gonna adopt it.
I totally agree on the perf aspect with you, typically I strive to squeeze things even for the last 5%. But lets not step into the premature optimization trap that early and even stall the discussion about fundamentals. Ofc with 4-pass rendering (FG-1, BG, FG, FG+1) we add runtime (roughly doubling, prolly more, if image has to be fetched from some "colder" storage), but thats only the case if both image layers hold any data. Your TE can still render as fast as before for BG/FG only (beside some additional conditions).
Eww, thats a strong argument for a single color ASCII only TE. Will have a very small memory footprint, and is fast as hell. Lets do that. 🚀
Yay, lets get Clippy into vim, that would be huge. And we need sound, it should say "HEELLLOO" on startup 😸 |
FWIW, I don't see that additional layers means massively increasing the storage requirements. The spec talks about uploading an image, and then separately placing it in the model. That placement operation is, in my mind, setting That tuple isn't especially large and doesn't require materializing an additional bitmap per cell, or for each layer of each cell. No bitmaps are needed until those cells are actually rendered, and then the bitmap size need only be as large as the viewport. |
Nah. That is absolutely fine! I am actually getting happy that we are seriously progressing. I have two more days of stress and then i can work on some spec text to at least care about @wez 's alt-text idea. I will attempt to carefully follow in the meantime :-) |
This week I implemented translucent windows -- thanks to @dankamongmen 's really big clue :-) -- and got images too. Some notes on that here: https://gitlab.com/klamonte/jexer/-/issues/88 I strongly support the notion of a +/- Z axis in Good Image Protocol, i.e. be possible to do the green-text-"a"-over-octopus shot in #11 (comment) . That would enable an application to continue using the TE's font(s) rather than guess and render its own glyph(s) in the stack of image layers (which would also force text into images, ouch on performance!). How many more layers than one each below/above is less important IMHO. Perhaps GIP could say, "At least one image layer below and one above, but the TE is free to implement more. An application can detect the number of layers supported via {...an escape sequence or procedure...}". You can see a bit of the weirdness when layering images and text must always destroy the image here: The rectangles in the color wheel picture have the background rectangles of the text behind it because I blitted the image cells over the background of the underlying layer. But I cannot truly mix image and text in the same cell, or else I would have dropped the background rectangle but kept the text. A proper +/- Z would let me put the girl image, then text (foreground glyph only), then color wheel, and it would look as people are used to in proper GUI compositing window managers. |
Hey @klamonte Nice work! I'm also almost done with my refactors'n'cleanups so that I can resume work on the more fun topics (GIP) soon.
That is more or less what I'll also update the spec with. While it is nice for simplicity to have just one image layer that effectively replaces the text cell, we then found out (have proven, whatever), that the DEC VT 340 actually paints on top (not replacing) and even handles transparent pixels. So I've adapted my code base accordingly too.
While the last one is debatable, I think there's not really need for more image layers than that. Because the TE isn't an image editing program, and an app can still do so client-side (notcurses? @dankamongmen? impressions?) Have a nice weekend, EDIT: I just noticed your screenshot is a little more complicated. I'd still however prefer not to introduce more z-planes than really necessary (would more hinder adoption rate than adding usefulness) |
On a VT340 the drawing model is basically just a "draw on top"-canvas thingy. Which makes me think, that with a proper buffer abstraction the layering in the TE would not be needed at all, as the application could just do the layering passes itself. For that to work, the TE only would have to preserve merged pixel information from previous content layering. I have no clue yet, how that can be done efficiently with large scrollback & reflow & font-resizing, as here the pixel information would have to mutate. Maybe this would work in terminal buffer:
Now the layering by the application:
On a render cycle in the TE, do this:
This simplified model does not work yet with font-size changes, if they happen while a cell is that "tile canvas" mode. For that to work, the TE would have to store all steps of the layering separated, and relayer things with the resized glyph. |
Font size changes IMHO should just cause pixels in cells to rescale (potentially only once, if there is a single user-facing head) to the new aspect ratio, and with anti-aliasing enabled if the TE supports that. The "canvas" stretches or shrinks and the picture stretches/shrinks with it. On the "images over text", how my cells render including to multihead . A more narrative flow figuring things out: https://jexer.sourceforge.io/evolution.html#year2021 (Pretty much all of the last two months were inspired by notcurses. It's really really cool and it's been fun working to catch up. :-) Well, trying to catch up: notcurses is so much faster, it feels like trying to follow a Ford GT40 in a Datsun 520.) |
Yes that would work as a workaround. It still might lead to poor output experience (glyph pixels getting stretched loosing their "sharpness"). But that would only happen on the tiled cells, which is maybe not a biggy. Stitching artefacts also might show up, if the canvas placing on the final output buffer is not pixel exact (can be achieved with some floor/ceil logic though). |
oh yeah, I have definitely rejected as an axiom the idea of drawing my own glyphs since the very beginning. at that point, why the hell are you using a terminal (ok, maybe the networking argument)? |
Thanks! And awesome, would love to see the fun topics continue. :)
I like making images-replace-text an option, and would go even further to say it should be the default layer for images, i.e. default Z = 0. This is a departure from sixel, but would make it very explicit how images and text interact. Right now if you put an image up on xterm, and then place the cursor over that image, the text that was underneath will bleed through, showing that images are NOT replacing text. Example: It was much harder for me on the multiplexer/application end to make images and text coexist due to this. I resorted to drawing all images first -- being careful to stay within text cell boundaries -- and then text afterwards, because while text can destroy images (in most protocols), images could not destroy text. Worse, if you put text over image on several terminals (including xterm) you might corrupt other cells of that image. Give the option of making images and text mutually destructive makes things like "thumbnails in ranger that work over ssh" much easier for ranger to do. History...(History: This is also why I abandoned the Kitty protocol in late 2018 / early 2019 after sixel was working. I would have needed two wildly different rendering models which looked to be a lot more invasive than I wanted -- and all just to support a single terminal, for a use case (multiplexer) that its team was uninterested in supporting. At that time only xterm, RLogin, and yaft could handle my sixel output without crashing, so it was more important to me to see more terminals implement _anything_ without crashing.)
Agreed. A large part of my motivation is demonstrating that we can get the effects we want with much smaller investments from the TE. I side with the worse is better approach: get something that works for the 90% case, and then use that knowledge to refine the solution for the rest. |
unifying kitty and sixel (and linux fbcon) ended up being less complex than i initially thought, or indeed implemented. there are several sections which need distinct implementations, but you can take those chunks and pretty entirely call them from the same code path. linux framebuffer is IMHO more difficult to unite with sixel+kitty than either is with one another. the chunks i needed split up were (this can be taken pretty entirely from
with those chunks, i have a very simple (5 state) machine and two common functions ( |
Indeed, it would probably be a lot easier now. At that time I did not have my sixel encoder factored out for independent testing, nor "encoders" (base64 wrappers) for iTerm2/Jexer, so it was internally a lot more spaghetti with wire encoding and cells-to-images all mushed together. (But now that I have invested more time in sixel, I find it going much farther than I thought it could, which reduces the desire for more complex protocols.) If I were going to do a protocol that treated text and images as wholly separate, I would probably redo my composition model along the way too, internally look more like a GPU-based TE (I'm learning some cool things from wezterm. ;-) ), and think more like a proper game engine.
lol. You hate it, yet you've also got the best encoder out here! |
Imho we should have a discussion, how image data shall be related to FG (foreground), BG (background) or any other more advanced layering ideas.
Background
Back in terminal-wg discussion I strongly voted for "image data is treated as FG content", therefore replaces existing FG content or gets replaced by later FG updates (be it text or another image). Most ppl in the thread agreed to that. This simplified approach has several advantages:
Problem
Ofc this simplicity has its drawbacks - image is FG content, period. No text-over-image, no image-image layering (blending), whatsoever. But we kinda already have some demand for more capable image handling:
I like to discuss, if those things above are not even worth to be considered in a TE env, or if they are real issues and if so, what are possible solutions to that. Maybe we find some neat solutions, maybe we keep concluding:
The text was updated successfully, but these errors were encountered: