<rwmjones>
this is qemu 8.1.1, qemu-system-riscv64, running a 16 vCPU guest, where the guest is doing a big parallel compile
<rwmjones>
if you click in where it says CPU/0 TCG (bottom left corner)
<rwmjones>
then you can see what one vCPU is doing in the qemu TCG emulation code
<rwmjones>
(all the vCPUs are basically the same)
<rwmjones>
so two things stand out ...
<rwmjones>
do_ld8_mmu does the page table walk (guts of it in riscv_cpu_tlb_fill) which is fairly slow, but mainly seems like the cost of reading guest RAM
<rwmjones>
but the big one is helper_lookup_tb_ptr
<rwmjones>
this is called when TCG code needs to jump to another translation block
<rwmjones>
the translation block (TB) pointers are held in a big hash table
<rwmjones>
hash of (current CPU state, physical address) -> TB
<rwmjones>
and both calculating the current CPU state & doing the lookup is really slow
<rwmjones>
(IIRC there is a one entry cache before we hit the hash table)
<rwmjones>
if we could eliminate the overhead of helper_lookup_tb_ptr by magic somehow we'd double emulation speed
<rwmjones>
get_physical_address = 12.8% total time