-
Notifications
You must be signed in to change notification settings - Fork 35
16-bit instruction to shift by a register value #165
Comments
Hi, thanks @jalobaba, I appreciate your input. To encode 16-bit shift by register value, I think it would need to go into the following reserved spaces: 101 010 xxxx 10 both of which overlap with C.FSDSP The simple approach is to have encodings for CM.SLL, CM.SRL, C.SRA, which would take up both of those spaces, and each require two register operands. For example: CM.SLL rsd', rsd', rs2' rsd' = rsd' << rs2 out of range shifts return zero. is this what you are thinking? @abukharmeh how hard is this to add into the analysis script? If it's easy then I'm wondering if we should try and get them in now - given that it's much more efficient to add all the (non-EABI) 16-bit encodings at once. I think it's worth assessing them anyway to see if they help the benchmark suite. It would be more efficient if you ran it, but it would certainly help if we could get more people involved to assess new instructions. Tariq |
It should be fairly easy if we know what patterns we are trying to optimise, is this trying to achieve the same thing as #141 ? |
It's different, that one has an immediate shift distance and I think different source and destination registers. It's worth nothing that these wouldn't expand into an existing 32 bit encoding which might be a problem, bits it's certainly worth knowing how valuable they are |
Tariq, you asked if these were the instructions I was proposing: Yes, exactly this with 2 registers and destructive shift. As for out of range shifts (i.e. rs2' > 31), I'd match the behavior of the 32-bit RISC-V shift instructions. I just checked the spec and the 32-bit shift instructions just use the bottom 5-bit of rs2 so I would match this for the 16-bit versions and then there is no special behavior for out of range. |
yes - you're right - CM.SLL would be the 16-bit encoding of SLL, and similarly for SRL/CM.SRL, SRA/CM.SRA give that there doesn't seem to be much value in these in our existing benchmark suit @jalobaba do you have an open source application which sees a benefit from them? |
No, I don't have an open-source application that benefits from 16-bit shift by register value. We have several groups at Qualcomm with RISC-V code for their microcontroller subsystems. I'm starting to reach out to them and get an effort started to evaluate Zc on those code bases. Some use GCC and some use LLVM so we can use your preliminary versions of those to evaluate Zc. We can also try Ibrahim's script. Is that useful to you to help determine the benefits of 16-bit shift by register? What is the best way to analyze our internal code to determine the benefit of 16-bit shift by register? I can always grep through the disassembly but is there a better way? How did you determine that the benchmarks you are using to evaluate Zc don't have much benefit for 16-bit shift by register? I'm hoping I could use the same approach. |
The script doesn't support that currently, but simplistically you can grep
for the 32-bit destructive forms and compare the saving from shrinking
those by 2 bytes with the size of the entire text section. That should be
simple enough.
…On Wed, 15 Jun 2022, 17:31 jalobaba, ***@***.***> wrote:
No, I don't have an open-source application that benefits from 16-bit
shift by register value.
We have several groups at Qualcomm with RISC-V code for their
microcontroller subsystems. I'm starting to reach out to them and get an
effort started to evaluate Zc on those code bases. Some use GCC and some
use LLVM so we can use your preliminary versions of those to evaluate Zc.
We can also try Ibrahim's script. Is that useful to you to help determine
the benefits of 16-bit shift by register?
What is the best way to analyze our internal code to determine the benefit
of 16-bit shift by register? I can always grep through the disassembly but
is there a better way? How did you determine that the benchmarks you are
using to evaluate Zc don't have much benefit for 16-bit shift by register?
I'm hoping I could use the same approach.
—
Reply to this email directly, view it on GitHub
<#165 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOCTJAEYZD5I235CEHCPBI3VPIAPBANCNFSM5YLXEXPQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I think it would be easier to add that to the script, you just need to add two lines similar to the following snippet, and then you can get the exact savings |
Thanks Ibrahim. I'll be working on this next week. |
I would be happy if you submit a PR for it :) Even if it didn't up in the spec, it would help people try it on their applications etc. So what you need to do is
Then just use the script
|
Hi, as promised earlier, here are the complete results from our benchmark suite I doubt we can justify the encoding space for these instructions with these results ! Are we happy to close this now @jalobaba @tariqkurd-repo |
0.09% and 0.05% for |
Yes, you are right, they would be 2^(3+3)=64 points instructions as C.MUL which has similar savings ! |
Hi @abukharmeh are we also considering CM.SRA? Are there any results for that one? I'm interested to see CM.SRA if possible. At the moment I feel that we need to put more thought into the very small amount of remaining encoding space - we can't allocate all if it to Zc, and so that we should proceed without these instructions. that doesn't mean that we can't add them in the next revision of Zc if they can be justified against other options at that point. ...otherwise we also risk failing to ratify anything this year... and that would certainly be worse for RISC-V in general than proceeding without them. what do people think? |
|
c.mul peaks in pure dsp code, many of instances being multiply accumulate that can be handled by P ext as a single 32bit instruction. Otherwise it's average 0.2x% or lower bottomed down by debian packages. In terms of code point efficiency |
I see one significant issue with "adding zce insrcutions later"; fragmentation. And those are likely to end up not used by A profile packages at all eg. Zcev1 goes into profile with mandatory vector, such profile is very likely to become defacto lowest common denominator for targeting software (and optionally RV64GC as "legacy" one). If Zcev2 brings extra 0.5% code size reduction - nobody would like to break with LCD as well as provide yet another binary for just that 0.5%. zcm* situation is a bit better, but not entirely. |
In our opinion it is definitely not worth putting ratification this year at risk by adding new instructions. Given that Zc* already consists of 7 groups and that likely/hopefully other ideas we will pop up, these instructions (and others) can be added later (if needed in a new sub-extension). |
yes exactly - and delaying Zc for several months for minimal extra code-size improvement will only make the fragmentation worse. |
agreed - so no change now - and we can consider these in the future. |
In current zce spec, i see only 4 (+1 for RVE eabi) "fragmentation" groups, 3 (+1) of them being embedded only.
|
there will only be a v0.80 if the architecture review committee request it |
@tariqkurd-repo Can we keep the issue open as its future idea ? |
I recently was reviewing the Zc spec and noticed it was missing 16-bit shift instructions by a register value. Zc only currently has 16-bit shift instructions that shift by an immediate value.
@tariqkurd told me these instructions haven't been thought of yet. Can we explore the code size benefit they provide and consider them for Zc?
Note that Arm Thumb-2 includes 16-bit shift by register instructions as do some other ISAs.
BTW, I'm new to this group. I've been watching for a while but plan to get more involved. I work for Qualcomm in San Diego and am a processor architect & designer (mostly embedded processors).
The text was updated successfully, but these errors were encountered: