RISE RP005 QEMU weekly report 2024-12-04

Milestone 2

Make requested revisions to Paolo's upstream patches.

IN PROGRESS.
This work is now being handled by Craig Blackmore.
We have a reworked version of the first patch, which addresses the issue of whether an exception during the load/store could lead to an invalid value in the vstart CSR (see this mailing list post). Benchmarking shows an improvement in performance for small vectors. The patch has been submitted (see this mailing list post).
- Max has replied suggesting this still does not resolve the vstart issue (see this mailing list post). We are investigating this. We are calling the correct atomic functions, but it seems we are not setting the flag to specify the atomicity.
For the second patch, we are exploring a possible solution to ensure that interrupts are not taken in the middle of an element during memcpy (see this mailing list post). We have identified the functions we will be using to resolve this.

SiFive benchmarks.

COMPLETE.

Milestone 3

Generate TCG Ops for vector load/store.

IN PROGRESS.
We have a working implementation of the TCG operations generation for the whole register loads/stores. This still needs to be made safe for atomicity and needs further testing to ensure its full correctness and maximum efficiency with different element size and VLEN. The current implementation is only for whole vector loads and stores (which do not form part of the SiFive benchmark suite). The implementation uses the generic TCG vector op generation. We are now working on generating x86 vector specific operations to allow us to handle up to 256 bits at a time and handling generic vector loads and stores.

Improve first-fault handling for vector load/store helper functions.

IN PROGRESS.
No new work to report this week.

Improve strided load/store helper functions.

IN PROGRESS.
No new work to report this week.

TCGOp exxample expansion from `memcpy`

RISC-V code (hand-modified vmemcpy):

IN: vmemcpy
0x00010a6e:  02858007          vl1re8.v                v0,(a1)
0x00010a72:  9596              add                     a1,a1,t0
0x00010a74:  40560633          sub                     a2,a2,t0
0x00010a78:  02868027          vs1r.v                  v0,(a3)
0x00010a7c:  9696              add                     a3,a3,t0
0x00010a7e:  f675              bnez                    a2,-20                  # 0x10a6a

TCG-Ops for vl1re8.v

 mov_i64 loc3,x11/a1
 mov_i64 loc6,x11/a1
 qemu_ld_a64_i128 loc4,loc5,loc6,noat+un+leo,0
 st_i64 loc4,env,$0x200
 st_i64 loc5,env,$0x208

x86 code for vl1re8.v

780a20c0:       8b 5d f8                mov    -0x8(%ebp),%ebx
780a20c3:       85 db                   test   %ebx,%ebx
780a20c5:       0f 8c 98 00 00 00       jl     0x780a2163
780a20cb:       c6 45 fc 00             movb   $0x0,-0x4(%ebp)
780a20cf:       48                      dec    %eax
780a20d0:       8b 5d 58                mov    0x58(%ebp),%ebx
780a20d3:       4c                      dec    %esp
780a20d4:       8b 23                   mov    (%ebx),%esp
780a20d6:       4c                      dec    %esp
780a20d7:       8b 6b 08                mov    0x8(%ebx),%ebp
780a20da:       4c                      dec    %esp
780a20db:       89 a5 00 02 00 00       mov    %esp,0x200(%ebp)
780a20e1:       4c                      dec    %esp
780a20e2:       89 ad 08 02 00 00       mov    %ebp,0x208(%ebp)

Statistics

SiFive benchmarks: Patch 1 comparison

Baseline commit: 248f9209ed
Patch commit: c55fb34cbf
See report-2024-12-03-21-47-51.pdf.

The benefit cannot be seen on the graphs, because it is all in the small sizes, but is clear in the detailed results. For memcpy with VLEN=128, we see more than 20% improvement on small data and for VLEN=1024, more than 25% improvement.

We note that the speedup is very size dependent. Why is there a substantial speedup with memcpy for all data sizes 1-8 bytes except 7 bytes. Consistently across runs. We'll do some single instruction benchmarking to try to characterize this.

Actions

2024-11-13:

Craig Blackmore (transferred from Paolo Savini) to investigate whether patch #1 may cause exceptions to be taken at the wrong byte address, thereby causing a failure on resumption.
- ONGOING.
- Updated patch benchmarked and submitted. Revision to specify the atomicity for the atomic functions under development.
Craig Blackmore (transferred from Paolo Savini) to investigate whether patch #2 can use larger memcpy than Max and therefore may offer further improvements.
- ONGOING.
- Functions to use identified.

Other

Next meeting 11 December 2024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20241204.md

20241204.md

RISE RP005 QEMU weekly report 2024-12-04

Milestone 2

Milestone 3

TCGOp exxample expansion from `memcpy`

Statistics

SiFive benchmarks: Patch 1 comparison

Actions

Other

Files

20241204.md

Latest commit

History

20241204.md

File metadata and controls

RISE RP005 QEMU weekly report 2024-12-04

Milestone 2

Milestone 3

TCGOp exxample expansion from memcpy

Statistics

SiFive benchmarks: Patch 1 comparison

Actions

Other

TCGOp exxample expansion from `memcpy`