Make requested revisions to Paolo's upstream patches.
- IN PROGRESS.
- This work is now being handled by Craig Blackmore.
- We have a reworked version of the first patch, which addresses the issue of whether an exception during the load/store could lead to an invalid value in the
vstart
CSR (see this mailing list post). Benchmarking shows an improvement in performance for small vectors. The patch has been submitted (see this mailing list post).- Max has replied suggesting this still does not resolve the
vstart
issue (see this mailing list post). We are investigating this. We are calling the correct atomic functions, but it seems we are not setting the flag to specify the atomicity.
- Max has replied suggesting this still does not resolve the
- For the second patch, we are exploring a possible solution to ensure that interrupts are not taken in the middle of an element during
memcpy
(see this mailing list post). We have identified the functions we will be using to resolve this.
SiFive benchmarks.
- COMPLETE.
Generate TCG Ops for vector load/store.
- IN PROGRESS.
- We have a working implementation of the TCG operations generation for the whole register loads/stores. This still needs to be made safe for atomicity and needs further testing to ensure its full correctness and maximum efficiency with different element size and
VLEN
. The current implementation is only for whole vector loads and stores (which do not form part of the SiFive benchmark suite). The implementation uses the generic TCG vector op generation. We are now working on generating x86 vector specific operations to allow us to handle up to 256 bits at a time and handling generic vector loads and stores.
Improve first-fault handling for vector load/store helper functions.
- IN PROGRESS.
- No new work to report this week.
Improve strided load/store helper functions.
- IN PROGRESS.
- No new work to report this week.
RISC-V code (hand-modified vmemcpy):
IN: vmemcpy
0x00010a6e: 02858007 vl1re8.v v0,(a1)
0x00010a72: 9596 add a1,a1,t0
0x00010a74: 40560633 sub a2,a2,t0
0x00010a78: 02868027 vs1r.v v0,(a3)
0x00010a7c: 9696 add a3,a3,t0
0x00010a7e: f675 bnez a2,-20 # 0x10a6a
TCG-Ops for vl1re8.v
mov_i64 loc3,x11/a1
mov_i64 loc6,x11/a1
qemu_ld_a64_i128 loc4,loc5,loc6,noat+un+leo,0
st_i64 loc4,env,$0x200
st_i64 loc5,env,$0x208
x86 code for vl1re8.v
780a20c0: 8b 5d f8 mov -0x8(%ebp),%ebx
780a20c3: 85 db test %ebx,%ebx
780a20c5: 0f 8c 98 00 00 00 jl 0x780a2163
780a20cb: c6 45 fc 00 movb $0x0,-0x4(%ebp)
780a20cf: 48 dec %eax
780a20d0: 8b 5d 58 mov 0x58(%ebp),%ebx
780a20d3: 4c dec %esp
780a20d4: 8b 23 mov (%ebx),%esp
780a20d6: 4c dec %esp
780a20d7: 8b 6b 08 mov 0x8(%ebx),%ebp
780a20da: 4c dec %esp
780a20db: 89 a5 00 02 00 00 mov %esp,0x200(%ebp)
780a20e1: 4c dec %esp
780a20e2: 89 ad 08 02 00 00 mov %ebp,0x208(%ebp)
- Baseline commit: 248f9209ed
- Patch commit: c55fb34cbf
- See report-2024-12-03-21-47-51.pdf.
The benefit cannot be seen on the graphs, because it is all in the small sizes, but is clear in the detailed results. For memcpy
with VLEN=128
, we see more than 20% improvement on small data and for VLEN=1024
, more than 25% improvement.
We note that the speedup is very size dependent. Why is there a substantial speedup with memcpy
for all data sizes 1-8 bytes except 7 bytes. Consistently across runs. We'll do some single instruction benchmarking to try to characterize this.
2024-11-13:
- Craig Blackmore (transferred from Paolo Savini) to investigate whether patch #1 may cause exceptions to be taken at the wrong byte address, thereby causing a failure on resumption.
- ONGOING.
- Updated patch benchmarked and submitted. Revision to specify the atomicity for the atomic functions under development.
- Craig Blackmore (transferred from Paolo Savini) to investigate whether patch #2 can use larger
memcpy
than Max and therefore may offer further improvements.- ONGOING.
- Functions to use identified.
Next meeting 11 December 2024.