Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layering: FG only vs. FG/BG vs. something even more complicated #11

Open
jerch opened this issue Apr 22, 2021 · 37 comments
Open

layering: FG only vs. FG/BG vs. something even more complicated #11

jerch opened this issue Apr 22, 2021 · 37 comments

Comments

@jerch
Copy link

jerch commented Apr 22, 2021

Imho we should have a discussion, how image data shall be related to FG (foreground), BG (background) or any other more advanced layering ideas.

Background
Back in terminal-wg discussion I strongly voted for "image data is treated as FG content", therefore replaces existing FG content or gets replaced by later FG updates (be it text or another image). Most ppl in the thread agreed to that. This simplified approach has several advantages:

  • easy to implement on terminal buffer level (low code entry barrier)
  • easy to comprehend - there is either nothing, some text char or some image tile in a cell
  • no need for composition / blending modes or image layer juggling, new content always replaces older stuff

Problem
Ofc this simplicity has its drawbacks - image is FG content, period. No text-over-image, no image-image layering (blending), whatsoever. But we kinda already have some demand for more capable image handling:

  1. Send an image to the background. As a matter of fact, ppl love TEs for that (imho kitty and windows terminal support it by other means).
  2. SGR already knows a transparent mode for colors in FG/BG. I for myself dont know yet, what to make out of it, esp. thinking of a transparent FG seems awkward to me (makes only sense if TEs enter really complex image processing realms, meh). Imho transparent BG is already used by some TEs to let some lower layer shine through (e.g. some static background image).
  3. Blending might be a neat way to reduce bandwidth needs, when an app updates the TE screen by partial images. This is already possible with old SIXEL and its single transparency (but currently not used by any app I have tested). With the simplfied idea above thats not possible, as newer image content would always replace older content.

I like to discuss, if those things above are not even worth to be considered in a TE env, or if they are real issues and if so, what are possible solutions to that. Maybe we find some neat solutions, maybe we keep concluding:

KISS - just keep it that darn simple, easy to comprehend, easy to implement. If you want image layering tricks - use an image processor. If you want fancy text-over-image stuff - use a browser.

@PerBothner
Copy link

Without having thought things through, I'm guessing an mage-as-background layer might be easier to implement for DomTerm or other terminals that aren't tile/cell-based. Mixing text and images may be easier if the image layer (canvas) is separate from the text layer.

However, resource management (including the issue of when to clear and release an image) would be different -possibly easier, possible harder.

Just mentioning the point; not arguing for any particular approach.

@ismail-yilmaz
Copy link

ismail-yilmaz commented Apr 23, 2021

@jerch

Thinking out loud: Before walking into "more complex image processing" territory- partial updates, etc.), I would suggest we at least keep an eye on the selective erases (DEC and/OR ISO style) and DEC rectangle functions. Granted they are not implemented by every vte (xterm implements them, for one, and it is arguably the de facto standard.), but neither is inline images. Yet these functions have a well defined structure that can be used with FG image cells (after all, we will be substituting characters with image parts.) They may come handy.

A clarification: My point is, FG actually has other advantages that are usually overlooked because not every vte implements some really handy functions or features that later models of DEC VT series came equipped with. These functions can let us easily delegate a lot of stuff to applications and possibly shrink the draft. OTOH, their adoption is another big issue...

@christianparpart
Copy link
Member

christianparpart commented Apr 23, 2021

Hey,

with this draft spec I was very closely following Egmont's post on a future Good image protocol, trying to formalize it and add some semantics/syntax to it (hence, this repo name :) ) as I also stand behind his statements he has made in there.

I think one gets the best success by being simple, easily understandable, and therefore quickly adoptable by TUI/VTE devs.

The drawbacks of "image data replaces text" you mention I do not see as a drawback but as a strength, as this way you guarantee to be best fitting into the already existing VT sequences. Anything else would make it harder to stay compatible with existing VT sequences (IMHO. This is why this draft spec has also adopted it that way.

Things like blending I would definitely recommend to not include as it would just reduce simplicity and potentially hurt adaptability / acceptance. Egmont's argument that such things belong to the client side I am fully standing behind, too.

As to the idea of using that GIP to also set a static background image, I actually do not see a problem. This doesn't however have to be part of an initial draft release but could be easily part of a follow-up revision of the draft spec.
Specializing the VT sequence to denote that this image isn't to be rendered in the text grid but as page background (I think) would just be a matter of different parameters, or alternatively an additional sequence that still can make use of a prior upload sequence and share the other semantics (such as resize/alignment parameters).
So with regards to background images, that topic could be discussed in its own dedicated thread (maybe not as top prio to not lose focus on the core intentions).

@ismail-yilmaz I actually plan to implement the rectangular and selective VT sequences in my next milestone release as I think it shouldn't be too hard and certainly fun to use on the client side, too. :)

@ismail-yilmaz
Copy link

@christianparpart ,

I actually plan to implement the rectangular and selective VT sequences in my next milestone release as I think it shouldn't be > too hard and certainly fun to use on the client side, too. :)

I had implemented them for our TerminalCtrl (a..k.a Ultimate++ terminal widget) and for its stripped-down and specialized, commercial sibling (which I am not free to disclose more info), and all I can say is that these functions are efficiently used in the wild by some client software to do interesting stuff with inline images, thanks to FG rendering, without requiring a new image protocol. But then again, this is just one instance I am pointing out, not the rule.

P.s They are not hard to implement but selective erases have one or two little quirks you may want to know, so If you need help, let me know.

@ismail-yilmaz
Copy link

ismail-yilmaz commented Apr 23, 2021

Send an image to the background. As a matter of fact, ppl love TEs for that (imho kitty and windows terminal support it by other means).

This brings another question that was on my mind for some time related -albeit loosely- to layering: Who is the target audience? Desktop vte users, web browser users? I mean, this feature sounds nice but the examples are from -AFAIK- desktop terminals with relatively high requirements.

What about vtes on low end devices (!= old) e.g. rb pi boards, or some other similar device utilizing linux fb and CPU to draw stuff? How would layering scale down to low end devices. The layering is not a problem in itself, IMO, but it being a part of core draft might lead to some conflicts.

@jerch
Copy link
Author

jerch commented Apr 25, 2021

Hmm guess I shaped the issue too broad, and it should have been two issues:

  • layering and transparency handling in general (dimension: where to put layer)
  • cell affinity (dimension: behavior of the layer, static vs. text cell bound)

Following the original idea "image data is treated as FG content" strict cell affinity is kinda implied, with all pros and cons. This only works with a tiling model, where an image gets spread across text cells, and tiles move along with them. All text manipulation sequences are meant to still work as expected, even the rectangular one (if implemented). The really hard part is the text reflow thingy here, image tile reflow is just not the right thing to do. We need an answer to that, if we keep going with the text cell tiling model.

Layering on the other hand is more about which layer we want to address, where to put it relative to existing content layers like FG/BG in a sense of visual representation (and thus in terms of how transparency rules will apply). Do we want multiple layers? Or are we good with just one as FG replacement layer?

@christianparpart
Copy link
Member

christianparpart commented Jun 12, 2021

We sidestepped into this topic in #8, so I reply in this thread instead:

I am generally against the z-axis with the fact that currently (in Kitty for example) you can magically blend images over text and text over images. - how many images can one cell then contain? What about commands like DECIC, SD, SU IL, ICH, ...., etc then? This (IMHO) results into a spec-nightmare or tons of undefined behaviors as every TE might behave potentially differently.

Send an image to the background. As a matter of fact, ppl love TEs for that (imho kitty and windows terminal support it by other means).

In the spec I currently even mention that as a potential future improvement. If I'm reading it once more, I'll go straight to vim and add that. :-D

How it could be added. Let's not temper with z-axis ideas. Just reuse the existing command for uploading the image and then add a new command that puts that image into the background, to be precise: as background image filling the full screen (p.s. @jerch, this might then be bad to provide rows-x-cols during upload, as you might potentially down-scale an image, losing information, that then gets up-scaled again, just to display it as fullscreen background image, any thoughts?)

So apart from having the ability to programmatically set the terminal's screen background (that stays the same when the user scrolls up/down in the history!), I wonder if there is any real use-case for the z-axis and the wish to be able to draw above the text or anywhere else below the text.
Maybe @dankamongmen can shime in on that matter, as he's been talking about this in alacritty/alacritty#910 (comment) - that would be helpful :)

@jerch
Copy link
Author

jerch commented Jun 12, 2021

I agree, a fine-grained z-index thingy gets a "no" from me. This is way over what terminals should be capable to do, imho. There is one exception though - I still think one compostion/blending step from old to new image at cell XY might be a good idea.

Could be done like that in a drawing sequence:

  • param blend: true/false
    On false (default) any image drawing will erase older image data first. On true, transparent image parts will blend with older image data from previous drawing (additive).

As I wrote in the xterm image PR this opens the door for apps to do partial updates by follow-up images, where unchanged areas are simply set to full transparency. With PNG this allows quite nice packrate effectively only transmitting the changed pixels. For the blending itself I am not interested in complicated blending modes either (we dont write an advanced image manipulation lib, just go with additive to get the full tranparency mapping from old data done).

(p.s. @jerch, this might then be bad to provide rows-x-cols during upload, as you might potentially down-scale an image, losing information, that then gets up-scaled again, just to display it as fullscreen background image, any thoughts?)

Yes thats true. Even when I wrote things above I kinda ended up thinking, that a viewport background picture sequence might be beyond this spec. The problem I see here - I currently tend to think that we dont want any other layers beside FG at all. If we all agree on that, fine - lets skip the viewport background thingy here or do it by other means. But if we want to be able to address several layers (e.g. FG + BG + whatever else) - then we need another param in the sequence to denote the layer. There we could add "viewport background" as a target layer as well, and special case that type of sequence without any grid size params.

@dankamongmen
Copy link

the importance of the kitty z-axis for me has not been its correspondence to the integers, but rather three distinct values: -1, 0, and 1. without those, you can't flexibly integrate text and graphics into the same cell. sixel only lets you do graphics-over-text, not text-over-graphics, whereas with kitty i get both. as pointed out by @christianparpart , the ability to blend images and such eliminates much of the use case for other values.

@jerch
Copy link
Author

jerch commented Jun 12, 2021

Guess we are now in the real layer needs discussion.

@dankamongmen Does "graphics-over-text, not text-over-graphics" mean, you need both? Because for that we would need 3 FG-related layers: [FG-1, FG, FG+1]. My current thinking regarding the FG layer - it is destructive for any FG content (erasing text).

How about BG layers? Could FG-1 be treated as a single BG layer (effectively erasing other BG pixels), or do we need multiple layers here as well?

@christianparpart
Copy link
Member

the importance of the kitty z-axis for me has not been its correspondence to the integers, but rather three distinct values: -1, 0, and 1. without those, you can't flexibly integrate text and graphics into the same cell. sixel only lets you do graphics-over-text, not text-over-graphics, whereas with kitty i get both. as pointed out by @christianparpart , the ability to blend images and such eliminates much of the use case for other values.

Hey, many thanks for responding to that matter. My big question though is: what is the Usecase, or is that a purely academical feature that has no real use. I am concerned about this since especially in the TWG forum there seemed to be the consensus that a cell either displays text or an arbitrary image.

I do not intend to have a spec that is so dead simple that there is no usefulness anymore but i would prefer to avoid tending to either extreme, but have a set of features that are necessary to sufficiently suit the app developers needs without exploding on the other side (for really rarely useful features especially)

We also need to think about TEs such as gneome-terminal/vte which seems to really care about how many bits are used per cell. The z-axis approach would make it very unlikely to be adopted by vte/Gnome-Terminal. I would like to avoid that. :)

@dankamongmen
Copy link

@dankamongmen Does "graphics-over-text, not text-over-graphics" mean, you need both? Because for that we would need 3 FG-related layers: [FG-1, FG, FG+1]. My current thinking regarding the FG layer - it is destructive for any FG content (erasing text).

ideally i would like three layers, yes (and three layers ought be sufficient): text, graphic, text. to fully match the Notcurses abstraction, i would theoretically need N layers (text, graphics, and arbitrarily many more text+graphics atop that), but i think that's a bit much to ask from terminal authors (though kitty does provide it).

but let me at least get my three. that allows me to have a text background with graphic sprites over it, with labels on those sprites. with that--especially with graphics which don't wipe one another out (transparent pixels will suffice)--i can cover the vast majority of use cases.

i.e., let me get these two, and i'm very happy:

Octopus-transparent-bg-fg

Octopus-transparent-kitty

How about BG layers? Could FG-1 be treated as a single BG layer (effectively erasing other BG pixels), or do we need multiple layers here as well?

@dankamongmen
Copy link

ideally i would like three layers, yes (and three layers ought be sufficient): text, graphic, text. to fully match the Notcurses abstraction, i would theoretically need N layers (text, graphics, and arbitrarily many more text+graphics atop that), but i think that's a bit much to ask from terminal authors (though kitty does provide it).

basically, notcurses gives you piles -- rendering contexts -- each consisting of a totally ordered set of planes. those planes can have text or bitmaps on them (though not currently both). glyphs do not stack, but bitmaps theoretically do (and this is a reality on kitty). so someone could very well have, say:

  • partially-transparent graphic plane P1
  • text plane P2 with transparent background (NCALPHA_TRANSPARENT)
  • partially-transparent graphic plane P3
  • smaller text plane in the center P4 with transparent background (NCALPHA_TRANSPARENT), surrounded by
  • a hollow rectangle of glyphs P5 with a solid background

and what we would expect to see (since notcurses uses the painter's algorithm from top to bottom of a pile would be):

  • all text from P5 (with solid background of one color)
  • all text from P4 (just the glyphs, ma'am)
  • all pixels from P3 not obscured by glyphs from P4 or cells of P5
  • all text (glyph pixels) from P2 save that occupying a cell likewise occupied by P4/P5 (glyphs don't stack), and that obscured by pixels from P3
  • all pixels from P1 not obscured by pixels from P4, P3, or P2, or cells from P5

i can draw a diagram if that didn't make sense. so if you give me only three planes, i can't stack arbitrary graphics, even if i lay them down temporally, because i need draw those P2 glyphs and have them obstruct P1, but not P3

but if you don't want to do this, i get it! i can tell users, "sorry, but glyphs do not stack properly when separated by text in all terminals." but if it's just as easy to do it......i wouldn't mind it =]

@dankamongmen
Copy link

Hey, many thanks for responding to that matter. My big question though is: what is the Usecase, or is that a purely academical feature that has no real use. I am concerned about this since especially in the TWG forum there seemed to be the consensus that a cell either displays text or an arbitrary image.

so my work thus far has been heavy on the eye-catching sprite side of things, but real uses:

  • all that powerline-style stuff currently done with font+unicode extensions could be handled this way
  • plots of pixels can be much more dense, and if you can label them with text with a transparent background, you don't lose data
  • i've got a compiler. it's spitting out diagnostics. i want to wrap those diagnostics in a nice bubble and then put text on them
  • anything that i want to do that would be unusable over an X forward, but could use a few graphics for information pres
  • i have an aptitude-like package manager. i want to throw screenshots in a corner using pixels. i'm downloading a progressive png to display there. i want to show the % downloaded atop the rendering PNG

i mean, "what are the uses" -- consider any GUI app that lets you do this, that you'd like to be able to use remotely. i don't know about you, but X windows has never worked smoothly for me outside of a LAN. the greatly reduced bandwidth of a TUI (even with some graphics, especially if they're tightly encoded) makes it possible. i'm counting on this, for instance, for an SDR TUI i'm writing. my SDR samples gigabits per second, far more than i want to move across my network to render locally. likewise, X forwarding across that network sucks. but i can draw the top line of my waterfall plot in my TUI with pixels, and boom, now i have working, responsive SDR across the network.

as a library author, i have found time and time again that creative users with problems you've never considered will use all the features you make available, often in delightful and unexpected ways. if it's not much more difficult to do N planes than 1 plane, and especially if it's a zero-cost-when-unused abstraction, give the people their planes, and see wonderful things down the road. if not, screw the people. if they want more planes, they can by god write their own terminal emulators, with anime and marching bands and built-in prolog interpreters. your calls, good sirs.

@christianparpart
Copy link
Member

Good morning @dankamongmen. I will keep it short for now and have details later.

I certainly do not want to scare developers like you away from implementing this, so a definitely respect you taking the time in here :)
But whilst i find your examples interesting (i didn't think of i would need that, you are right), i definitely fear some bigger terminal emulators would not want to support that behavior.
Maybe i am getting my thinking wrong (@jerch how would you implement this?)
But i think you would then need at least have 3 times the storage (per cell!) you need to support this behavior.

I know vte/Gnome-Terminal is really bit-savy and alacritty refuses to put in support for something that would increase cell size.

The rendering part should be easy for any OpenGL based terminal like kitty (for me this would be the least problem too).
But storage could be a no-go for a few.

I simply assume that a DECIC would still cut that rectangular area (containing three layers of images and text) in half. So other VT sequence semantics should not be altered at least.
I think we would need some feedback on that matter by other terminal developers that
might be willing to support this hopefully future good image protocol.

@dankamongmen
Copy link

yep, understood.

perhaps it would be a natural decay mode to accept a z-index parameter on bijection with 𝒁, and they can collapse to -1/0/-1, and they could further collapse to -1/0 or 0/1, and further collapse to 0? so you can pass a z, but there's no guarantee you're going to get a fully faithful implementation.

@wez
Copy link

wez commented Jun 14, 2021

From my perspective in implementing images in wezterm (both sixel and iterm2), and thinking only about images, rather than text+images, modeling 3 vs. arbitrary layers isn't especially different (fixed size array vs. a runtime sized vector), but rendering is where it gets more tricky with eg: OpenGL, and may require additional (arbitrary) draw passes, which is a bit at odds with the desire to minimize the number of draw passes for a high-performing renderer.

That said, in thinking about how I might efficiently implement 3 layers in wezterm (eg: track a layer bitset per line and OR that for the viewport to understand which layers are present), its not much of a stretch to apply that same technique to arbitrary layers.

perhaps it would be a natural decay mode to accept a z-index parameter on bijection with 𝒁, and they can collapse to -1/0/-1, and they could further collapse to -1/0 or 0/1, and further collapse to 0? so you can pass a z, but there's no guarantee you're going to get a fully faithful implementation.

I think this sounds reasonable, so long as the TE can indicate to the client app the number of supported layers and can make an educated choice about what it's going use.

ideally i would like three layers, yes (and three layers ought be sufficient): text, graphic, text

@dankamongmen: say layer -1 has W text and layer 1 has A text in the same cell, would that render as A (with W being overwritten completely) or as A blended over W?

If the former, then there is really just a single text cell with some number of image attachments that either render before or after the text, and that feels like an easy delta from my current implementation of images. If the text needs to logically render in layer 1 rather than 0 then there could be an attribute on the cell to specify its layer number.

If the text is intended to blend over other background text, that feels much more complex to model!

@dankamongmen
Copy link

If the text is intended to blend over other background text, that feels much more complex to model!

nope, never this. one glyph per cell; that is The Way. each glyph, however, might be at a different "height" relative to the various graphics. the only place where there is more than one glyph per cell is internally to the toolkit, which is providing the z-axis / pile abstraction.

(in Notcurses, i speak of rendering -- taking a pile of totally ordered planes, and solving for each cell in terms of glyph, styling, background color, and foreground color -- and rasterizing -- turning this into an optimized serious of outputs with respect to what's already on the screen -- and writing -- uhhh, writ(2)ing. by the end of rendering, you've projected N planes of text onto a single matrix. each cell of the matrix has a z-order, relative to the various graphic planes. i render from top to bottom, then rasterize from bottom to top, if that makes sense)

so a cell has a glyph on it, but only one. that glyph might have an opaque background, or might not. if the background is opaque, nothing is rendered underneath it. if the background is not opaque, graphics can be rendered underneath it. graphics can always be rendered atop it. but never glyph-on-glyph. that's madness.

@jerch
Copy link
Author

jerch commented Jun 14, 2021

Holy cow - ok, gonna need abit to get through all. Plz let the ideas flow, I think it is good to have all maybe-things on the table first. Whether we really want to create rainbow ponies later on (guess I dont), we should discuss in a consolidation phase (sorting out technically challenging or useless stuff etc).

@wez
Copy link

wez commented Jun 14, 2021

OK, so my mental model for this is that each cell has a textual z-axis layer number (default 0) and that it can have essentially a list of attachments; each attachment is comprised of a slice (a reference to a region within an uploaded image) and its layer number. The layer number for the image is a render property (the same uploaded image could be rendered in multiple locations and different layers).

Today wezterm effectively has a hardcoded textual layer of 0 and supports a maximum of one image slice at layer 0 (images trump text in its implementation today).

There needs to be some guidance on resolving conflicting images with the same layer number, similarly for images that have the same layer number as text.

Something like the above doesn't sound terribly difficult to implement.

@jerch
Copy link
Author

jerch commented Jun 14, 2021

I get the feeling, that we kinda mixed up rendering/output layers and buffer storage needs way to early. Ofc giving every layer its own persistence/buffer would impose bigger issues, as pointed out by @christianparpart:

If I remember right, VTE uses around 16 bytes per cell, in xterm.js we currently alloc 12 bytes per se, but with an optional additional storage for very rarely used things like different underline styles and colors. I also added the image tile information there.
With image data for every cell things get much worse - most desktop system have a font size between 12 - 16px, with full RGBA decompressed pixel data thats like 288 - 512 bytes on top. Or 25 - 44 times the memory of the current usage. And thats only for one full image layer.

Thus I think we should not carry the multiple layer idea over to the buffer yet, furthermore the persistence of all layer information is not even needed. Instead we can think of layers as composition rules, what needs to be drawn when to get certain output states (just like you would paint on a canvas in multiple passes). To tackle things maybe these steps are useful:

  1. identify which output states are wanted, what shall shine through, what cover or evict things, then relate that to BG/FG, as those are the only normative "layers" in TEs we have --> basic layering model
  2. the model from step 1. needs to be proofed against different output models, in fact it should be possible to be adopted by TEs with very different drawing caps (from full GPU/OpenGL access down to very reduced early consoles, bonus - make it work under predefined layout engines like HTML)
  3. storage buffer needs - once we have something shiny from 1. and 2. it is time to narrow things down into a solid persistence model (prolly with some re-iterations over 1. & 2.)

To 1.
Imho @dankamongmen give us here some ideas with the images and the listing above. You say:

ideally i would like three layers, yes (and three layers ought be sufficient): text, graphic, text.

but taking a closer look at those images, there you never have stacked glyphs. In fact you emphazise this further, that stacked glyphs are not needed (and I second that). Sweet, we still only need one "persistence layer" for text - FG. Lets think about how we can compose that output:

image 1 - schematically:

  #drawing layer        #screen         #classification

1 purple background    0000000000000    fill color
2 red 'a's             a   a   a   a    glyph
3 octopus                542318         pixel data
4 black BG                    111       fill color
5 frumpy                       f        glyph

Now lets reduce that step by step:

  • 1 + 2 is what all TEs already can do with FG + BG settings
  • +3 places image data on top not destroying previous data (thus blending for transparent parts) - Questions: What shall happen to the FG content here? Are the 'a's still in the cells? Only for transparent image parts? What about partially covered cells?
  • +(4 + 5) again easy, FG + BG print on top, now destroying former FG/BG information

What I get from this - things would work here, if the image would not destroy former FG/BG data, but gets attached to the cells above FG. Then a renderer can always redraw things from 1 to 5.

image 2 - schematically:

  #drawing layer        #screen         #classification

1 blue background      0000000000000    fill color
2 octopus                542318         pixel data
3 green 'a's           a   a   a   a    glyph
4 blue BG                     111       fill color
5 frumpy-ass                   f        glyph
  • 1 can be done as ED2 with a BG color set
  • 2 prints on top, taking lessons from the first image, we simply tag tiles to cells not destroying FG/BG content (BG color from 1 stays in place)
  • 3 is tricky, we have no semantics yet for that
  • +(4 + 5) again easy, FG + BG print on top, now destroying former FG/BG information

How to solve 3? How about that:
We add a second image layer as FG-1 (underneath FG), together with the conclusion from image 1 we end up with 2 image layers [FG-1, FG+1]. None of those will ever destroy any FG/BG content, furthermore FG content does not delete image information on FG-1 (maybe only in conjunction with BG transparent).

So we basically would have these layers:

  • FG-1: image data layer below FG
  • FG/BG: normal text layer, BG-transparent will let shine through FG-1, any other BG will erase FG-1 content (yes we need to free resources somehow)
  • FG+1: default image data layer above FG, will cover any FG content respecting transparency, any FG content will destroy data on FG+1

To 2.
During rendering this would be straight forward, and mostly a simple layering of default/erase background + FG-1 + BG(transparent)/FG + FG+1. Thats also doable for very reduced setups (no GPU accel, kernel console and such). What's abit annoying is the fact, that FG-1 and FG+1 data dont interfere at all, thus both have to be printed over and over if present (well this could be partially optimized by cutting off 100% coverage regions, but this needs some image analyzing effort during adding).

And to mark the bonus point here - this is stil compatible to HTML engines, where BG/FG box-printing is hard tagged to a char styling (we cannot print between BG and FG there). The stitching through with BG(transparent) works there too. (Maybe this answers your question @christianparpart).

To 3.
Buffer needs are not so nice, in a naive approach we basically doubled the needed space with the second layer (1024 bytes/cell at 16 px font now). I still dont think thats a biggy, as it will only rarely occur, and TEs will have to implement extra image storage trickery anyway (no one wants to go with 512 bytes/cell right from the beginning even for the first image layer).
Btw the buffer is the reason why it took me this long to answer. I kinda was stuck in thinking, how to get the two layers merged into just one by clever stitching of the image data, so a "morphing" FG glyph could be painted in between. Well that is possible somewhat with just one additional bit per pixel indicating whether it gets painted on the first pass (FG-1) or the third (FG+1, FG being the 2nd one). But note that this condensed pixel representation would have to recalculated on every FG change, to refill from FG-1. No clue if thats feasible for non-GPU driven TEs. Well, atm thinking about that is premature optimization.

@dankamongmen Would something like that work for your use cases? I know that it does not reflect every aspect of your wishlist, but should give capable composition tools at your hand with some multipass writing from appside. Btw I think both image layers should get the blending cap I proposed in #11 (comment).

@dankamongmen
Copy link

dankamongmen commented Jun 14, 2021

i need to focus on my poor satellites for a bit, and also get some sleep, but i really like where this is going, and i'm inclined to say yes. i'd like to read over it more closely, but your notion of "composable buffers" vs persistence within the emulator seems dead-on. i'm suddenly very excited about this entire effort.

@dankamongmen
Copy link

@jerch are you the primary xterm.js author? would you mind shooting me a mail at [email protected] so i might ask you a question or two?

@jerch
Copy link
Author

jerch commented Jun 14, 2021

@dankamongmen You prolly got mail 😸

@christianparpart
Copy link
Member

I also need new oil for my poor satellites first. But a few things for now:

  • looking at this thread clearly shows that just that one single z-axis topic lets instantly explode the whole GIP discussion. wow. (i'm not against it, better talk then not, I just think maybe... first focus on the other semantics, or do you all agree on those? - semantics, not syntax)
  • My personal goal is to especially also get gnome-terminal/vte (not sure whome to @ here, if at all). I'm not sure they'll like the layered approach, and while I think layering in GL is trivial, it may not in GTK/software.
  • Alacritty, well, I think they'll never merge anything labeled "feature", so yeah, while performance clearly matters to me, I still would like to make sure to have a solution that leans towards the performance-visioned TE devs without having an exploding complexity just for the z-axis functionality.
  • Don't forget people sometimes having rather huge scrollback buffers. When scrolling back you still want to see your images (if not evicted due to resource constraints), so memory is always of concern. :)

i've got a compiler. it's spitting out diagnostics. i want to wrap those diagnostics in a nice bubble and then put text on them

Compilers I think is the least likely item that will pick it up. But a VIM-LSP plugin in order to improve visual of tiny tooltips would probably be appealing to the users.

@jerch
Copy link
Author

jerch commented Jun 15, 2021

I also need new oil for my poor satellites first. But a few things for now:

  • looking at this thread clearly shows that just that one single z-axis topic lets instantly explode the whole GIP discussion. wow. (i'm not against it, better talk then not, I just think maybe... first focus on the other semantics, or do you all agree on those? - semantics, not syntax)

What do you suggest to discuss instead first? To me this thread feels pretty much on point, as it may shape fundamental supported/unsupported inner mechanics of the spec. Imho we cannot completely avoid the layering/composition discussion, it needs to be solved even for just one image layer to clarify transparency handling.
About one, two or multiple layers I am not settled yet, while I find the idea above somewhat intriguing, it also still has many rough edges (esp. for the FG-1 mechanics and the buffer needs). But maybe we can work that out, maybe not.

  • My personal goal is to especially also get gnome-terminal/vte (not sure whome to @ here, if at all). I'm not sure they'll like the layered approach, and while I think layering in GL is trivial, it may not in GTK/software.

If we keep the output composition from the layers easy (no complicated blending), anyone could do that "shader work" on 2d arrays on the CPU with reasonable speed. Basically all TEs already have to do that for FG drawing on top of the BG color. And if the spec is any good, they gonna adopt it.

  • Alacritty, well, I think they'll never merge anything labeled "feature", so yeah, while performance clearly matters to me, I still would like to make sure to have a solution that leans towards the performance-visioned TE devs without having an exploding complexity just for the z-axis functionality.

I totally agree on the perf aspect with you, typically I strive to squeeze things even for the last 5%. But lets not step into the premature optimization trap that early and even stall the discussion about fundamentals. Ofc with 4-pass rendering (FG-1, BG, FG, FG+1) we add runtime (roughly doubling, prolly more, if image has to be fetched from some "colder" storage), but thats only the case if both image layers hold any data. Your TE can still render as fast as before for BG/FG only (beside some additional conditions).

  • Don't forget people sometimes having rather huge scrollback buffers. When scrolling back you still want to see your images (if not evicted due to resource constraints), so memory is always of concern. :)

Eww, thats a strong argument for a single color ASCII only TE. Will have a very small memory footprint, and is fast as hell. Lets do that. 🚀
More seriously - yes that is a problem, and needs to be addressed by some caching/offloading strategies. But look at the numbers - from BG/FG only to 1 image layer memory jumps 22-45 times up, for 1 layer to 2 layers it is only doubles. So the "main damage" in terms of space complexity happened by introducing images, not by adding another layer.

i've got a compiler. it's spitting out diagnostics. i want to wrap those diagnostics in a nice bubble and then put text on them

Compilers I think is the least likely item that will pick it up. But a VIM-LSP plugin in order to improve visual of tiny tooltips would probably be appealing to the users.

Yay, lets get Clippy into vim, that would be huge. And we need sound, it should say "HEELLLOO" on startup 😸

@wez
Copy link

wez commented Jun 15, 2021

FWIW, I don't see that additional layers means massively increasing the storage requirements. The spec talks about uploading an image, and then separately placing it in the model. That placement operation is, in my mind, setting (image_id, texture_coords, layer_number) in the respective cells.

That tuple isn't especially large and doesn't require materializing an additional bitmap per cell, or for each layer of each cell. No bitmaps are needed until those cells are actually rendered, and then the bitmap size need only be as large as the viewport.

@christianparpart
Copy link
Member

christianparpart commented Jun 15, 2021

What do you suggest to discuss instead first?

Nah. That is absolutely fine! I am actually getting happy that we are seriously progressing. I have two more days of stress and then i can work on some spec text to at least care about @wez 's alt-text idea. I will attempt to carefully follow in the meantime :-)

@ghost
Copy link

ghost commented Jan 15, 2022

This week I implemented translucent windows -- thanks to @dankamongmen 's really big clue :-) -- and got images too. Some notes on that here: https://gitlab.com/klamonte/jexer/-/issues/88

I strongly support the notion of a +/- Z axis in Good Image Protocol, i.e. be possible to do the green-text-"a"-over-octopus shot in #11 (comment) . That would enable an application to continue using the TE's font(s) rather than guess and render its own glyph(s) in the stack of image layers (which would also force text into images, ouch on performance!).

How many more layers than one each below/above is less important IMHO. Perhaps GIP could say, "At least one image layer below and one above, but the TE is free to implement more. An application can detect the number of layers supported via {...an escape sequence or procedure...}".

You can see a bit of the weirdness when layering images and text must always destroy the image here:

xterm_layered

The rectangles in the color wheel picture have the background rectangles of the text behind it because I blitted the image cells over the background of the underlying layer. But I cannot truly mix image and text in the same cell, or else I would have dropped the background rectangle but kept the text. A proper +/- Z would let me put the girl image, then text (foreground glyph only), then color wheel, and it would look as people are used to in proper GUI compositing window managers.

@christianparpart
Copy link
Member

christianparpart commented Jan 15, 2022

Hey @klamonte

Nice work! I'm also almost done with my refactors'n'cleanups so that I can resume work on the more fun topics (GIP) soon.

At least one image layer below and one above, but the TE is free to implement more

That is more or less what I'll also update the spec with. While it is nice for simplicity to have just one image layer that effectively replaces the text cell, we then found out (have proven, whatever), that the DEC VT 340 actually paints on top (not replacing) and even handles transparent pixels. So I've adapted my code base accordingly too.
My idea is to support:

  • images above text
  • images below text
  • images that replace the text (in other words, live in the same Z-plane). <-- debatable.

While the last one is debatable, I think there's not really need for more image layers than that. Because the TE isn't an image editing program, and an app can still do so client-side (notcurses? @dankamongmen? impressions?)

Have a nice weekend,
Christian.

EDIT: I just noticed your screenshot is a little more complicated. I'd still however prefer not to introduce more z-planes than really necessary (would more hinder adoption rate than adding usefulness)

@jerch
Copy link
Author

jerch commented Jan 15, 2022

On a VT340 the drawing model is basically just a "draw on top"-canvas thingy. Which makes me think, that with a proper buffer abstraction the layering in the TE would not be needed at all, as the application could just do the layering passes itself. For that to work, the TE only would have to preserve merged pixel information from previous content layering.

I have no clue yet, how that can be done efficiently with large scrollback & reflow & font-resizing, as here the pixel information would have to mutate. Maybe this would work in terminal buffer:

  • save cells, that have no image pixels, the traditional FG/BG data way
  • a cell, that overlaps with image pixels, diverges from that model:
    • gets a cell tile canvas
    • pull previous cell data on that canvas (e.g. draw BG/FG)
    • draw image pixel
      • alpha 1 simply overpaints BG/FG pixels
      • alpha <1 composes with previous BG/FG pixels
    • store that tile canvas in terminal buffer

Now the layering by the application:

  • app prints whitespace: stored in terminal buffer with some BG/FG attributes, normal BG/FG render logic applied
  • app prints "A": stored in terminal buffer with some BG/FG attributes, normal BG/FG render logic applied
  • app prints an image to the cell containing "A":
    • TE fetches cells settings from buffer and renders those to a tile canvas
    • TE places image pixels on top, applying composition rules
    • tile canvas gets linked to the cell in terminal buffer (effectively invalidating the FG/BG stuff held there)
  • app prints "C" in that cell:
    • if BG is not set to transparent, it will erase any custom pixel information, e.g. delete tile canvas, and store BG/FB data in terminal buffer normally
    • if BG is set to transparent, pull the tile canvas data, draw new char glyph on top

On a render cycle in the TE, do this:

  • if cell has tile canvas:
    • pull pixels from tile canvas and place into output buffer
  • else:
    • normal render logic (paint BG + FG from terminal buffer FG content and SGR attribs)

This simplified model does not work yet with font-size changes, if they happen while a cell is that "tile canvas" mode. For that to work, the TE would have to store all steps of the layering separated, and relayer things with the resized glyph.

@ghost
Copy link

ghost commented Jan 15, 2022

This simplified model does not work yet with font-size changes, if they happen while a cell is that "tile canvas" mode. For that to work, the TE would have to store all steps of the layering separated, and relayer things with the resized glyph.

Font size changes IMHO should just cause pixels in cells to rescale (potentially only once, if there is a single user-facing head) to the new aspect ratio, and with anti-aliasing enabled if the TE supports that. The "canvas" stretches or shrinks and the picture stretches/shrinks with it.

On the "images over text", how my cells render including to multihead . A more narrative flow figuring things out: https://jexer.sourceforge.io/evolution.html#year2021

(Pretty much all of the last two months were inspired by notcurses. It's really really cool and it's been fun working to catch up. :-) Well, trying to catch up: notcurses is so much faster, it feels like trying to follow a Ford GT40 in a Datsun 520.)

@jerch
Copy link
Author

jerch commented Jan 15, 2022

Font size changes IMHO should just cause pixels in cells to rescale (potentially only once, if there is a single user-facing head) to the new aspect ratio, and with anti-aliasing enabled if the TE supports that. The "canvas" stretches or shrinks and the picture stretches/shrinks with it.

Yes that would work as a workaround. It still might lead to poor output experience (glyph pixels getting stretched loosing their "sharpness"). But that would only happen on the tiled cells, which is maybe not a biggy. Stitching artefacts also might show up, if the canvas placing on the final output buffer is not pixel exact (can be achieved with some floor/ceil logic though).

@dankamongmen
Copy link

oh yeah, I have definitely rejected as an axiom the idea of drawing my own glyphs since the very beginning. at that point, why the hell are you using a terminal (ok, maybe the networking argument)?

@ghost
Copy link

ghost commented Jan 17, 2022

@christianparpart

Nice work! I'm also almost done with my refactors'n'cleanups so that I can resume work on the more fun topics (GIP) soon.

Thanks! And awesome, would love to see the fun topics continue. :)

* images above text

* images below text

* images that replace the text (in other words, live in the same Z-plane). <-- debatable.

I like making images-replace-text an option, and would go even further to say it should be the default layer for images, i.e. default Z = 0. This is a departure from sixel, but would make it very explicit how images and text interact. Right now if you put an image up on xterm, and then place the cursor over that image, the text that was underneath will bleed through, showing that images are NOT replacing text. Example:

xterm_text_under_image

It was much harder for me on the multiplexer/application end to make images and text coexist due to this. I resorted to drawing all images first -- being careful to stay within text cell boundaries -- and then text afterwards, because while text can destroy images (in most protocols), images could not destroy text. Worse, if you put text over image on several terminals (including xterm) you might corrupt other cells of that image. Give the option of making images and text mutually destructive makes things like "thumbnails in ranger that work over ssh" much easier for ranger to do.

History... (History: This is also why I abandoned the Kitty protocol in late 2018 / early 2019 after sixel was working. I would have needed two wildly different rendering models which looked to be a lot more invasive than I wanted -- and all just to support a single terminal, for a use case (multiplexer) that its team was uninterested in supporting. At that time only xterm, RLogin, and yaft could handle my sixel output without crashing, so it was more important to me to see more terminals implement _anything_ without crashing.)

EDIT: I just noticed your screenshot is a little more complicated. I'd still however prefer not to introduce more z-planes than really necessary (would more hinder adoption rate than adding usefulness)

Agreed. A large part of my motivation is demonstrating that we can get the effects we want with much smaller investments from the TE. I side with the worse is better approach: get something that works for the 90% case, and then use that knowledge to refine the solution for the rest.

@dankamongmen
Copy link

dankamongmen commented Jan 19, 2022

(History: This is also why I abandoned the Kitty protocol in late 2018 / early 2019 after sixel was working. I would have needed two wildly different rendering models which looked to be a lot more invasive than I wanted

unifying kitty and sixel (and linux fbcon) ended up being less complex than i initially thought, or indeed implemented. there are several sections which need distinct implementations, but you can take those chunks and pretty entirely call them from the same code path. linux framebuffer is IMHO more difficult to unite with sixel+kitty than either is with one another.

the chunks i needed split up were (this can be taken pretty entirely from setup_kitty_bitmaps() vs setup_sixel_bitmaps()):

  • scrubbing, aka the removal of a cell's worth of content from an encoded image. you could get away with keeping this the same if you scrubbed the original RGBA, but then you'd have to reencode. that's slow, and could potentially change quantization, so i went the harder road. scrubbing writes to per-pixel "auxiliary vectors" having implementation-dependent size and content (implementation == kitty v sixel); management of the auxvecs is backend-independent.
  • refreshing, the inverse of scrubbing. same deal. reads and destroys the appropriate auxiliary vectors.
  • initial encoding, obviously
  • moves. sixel has no concept of moves, because it is a garbage protocol intended for printers
  • some absolutely dreadful geometry hooks deriving from sixel's mandatory 6-pixel alignment, which a year later i'm yet to feel certain i've correctly implemented, oh how i hate sixel

with those chunks, i have a very simple (5 state) machine and two common functions (clean_sprixels() and rasterize_sprixels(), both of which are very misleading names), called in phase 1 and 3 (respectively) of my 4-phase rasterizer. it ain't so bad.

@ghost
Copy link

ghost commented Jan 19, 2022

Indeed, it would probably be a lot easier now. At that time I did not have my sixel encoder factored out for independent testing, nor "encoders" (base64 wrappers) for iTerm2/Jexer, so it was internally a lot more spaghetti with wire encoding and cells-to-images all mushed together. (But now that I have invested more time in sixel, I find it going much farther than I thought it could, which reduces the desire for more complex protocols.)

If I were going to do a protocol that treated text and images as wholly separate, I would probably redo my composition model along the way too, internally look more like a GPU-based TE (I'm learning some cool things from wezterm. ;-) ), and think more like a proper game engine.

some absolutely dreadful geometry hooks deriving from sixel's mandatory 6-pixel alignment, which a year later i'm yet to feel certain i've correctly implemented, oh how i hate sixel

lol. You hate it, yet you've also got the best encoder out here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants