khem has quit [Quit: Bridge terminating on SIGTERM]
charlesap[m] has quit [Quit: Bridge terminating on SIGTERM]
psydroid has quit [Quit: Bridge terminating on SIGTERM]
kaji has quit [Quit: Bridge terminating on SIGTERM]
CarlosEDP has quit [Quit: Bridge terminating on SIGTERM]
EmanuelLoos[m] has quit [Quit: Bridge terminating on SIGTERM]
pierce has quit [Quit: Bridge terminating on SIGTERM]
zjason has joined #riscv
kaji has joined #riscv
pierce has joined #riscv
CarlosEDP has joined #riscv
EmanuelLoos[m] has joined #riscv
charlesap[m] has joined #riscv
khem has joined #riscv
davidlt has joined #riscv
iorem has joined #riscv
jwillikers has joined #riscv
BOKALDO has quit [Quit: Leaving]
cousteau has joined #riscv
<cousteau>
Hi
psydroid has joined #riscv
<cousteau>
Why are compresses instructions so important? I see many simple implementations including the C extension but I don't really see the point for it
<cousteau>
Like, won't adding a block to decode compressed instructions add more latency and area to the design? Or is it minimal compared to a benefit I still haven't quite understood?
<cousteau>
Oh, cache
<rjek>
If it's like ARM Thumb, the "decompression" is essentially free as it's just a bitmap process
<rjek>
ie, there is a single, obvious, simple mapping of each bit in a compressed instruction to each bit in the full-width instruction
<cousteau>
I was thinking just memory usage, which didn't sound as important (who cares if the .text section is 4 or 5 KB?), but I see how caches would play an important role
<cousteau>
rjek: but there really isn't; it may be quite straightforward but some decoding is still needed
<rjek>
Well, I don't know how they've implemented it: but it could just be a *different* instruction decoder being used to configure the ALU etc
<rjek>
Rather than a discrete step
<cousteau>
Plus many C instructions are like "do this if rd=0, do that if rd≠0"
<cousteau>
Well yeah I guess that would avoid the timing issues I mentioned
<cousteau>
But no longer "free"
<rjek>
Early implementations of ARM Thumb had terrible performance (it would do 16 bit loads for each individual instruction, decode into ARM, then decode the ARM) but modern parts there's no measurable performance penalty IIRC
<cousteau>
I mean, I don't know how it's internally implemented either but it is possible to see how the instructions are encoded
<cousteau>
(I suspect nobody does; they just type the behavioral code and let the synthesis tool figure out how to map that to gates and flip flops)
<cousteau>
rjek: hmm... The spec says that RVC "fetches 25-30% fewer instruction bits" (OK, seems reasonable), "which reduces I$ misses by 20-25%" (I don't quite get why not 25-30% but I guess I'd need to do the math), "or roughly the same performance impact as DOUBLING the I$ size" (...wait, what??)
<cousteau>
I don't get how being able to store 33-43% more instructions (due to the reduction of 25-30%) can be equivalent to increasing the cache size a 100%
<rjek>
Degrading returns from adding more cache. Doubling it doesn't double performance
hendursa1 has quit [Quit: hendursa1]
hendursaga has joined #riscv
<cousteau>
I guess that explains how it goes from 25-30% fewer bytes to 20-25% fewer misses, but not the equivalence to 2x cache
BOKALDO has joined #riscv
<cousteau>
If I understand correctly, if you somehow managed to make all instructions half as large, you'd be able to fit twice as many in the I$, so it would be equivalente to having 2x as much I$
<cousteau>
So if it were a 50% reduction I'd understand it were equivalent to a 2x increase
<rjek>
If you could halve the size of every instruction, you wouldn't need the wider representation :)
<cousteau>
I mean, halving every instruction would be equivalent to not halving any instruction and instead doubling the cache, unless I'm missing your point
<cousteau>
Lol, I'm reading the reference where they apparently state this 2x increase equivalence, and it's an MSc thesis from 2011. The instruction structure is so "wrong"! (just outdated, I guess)
<cousteau>
It has the rd on the MSbits of the instruction, followed by rs1 and rs2 to its right. It's funny to see how much the instruction set has evolved since its inception.
<cousteau>
It already includes the R4 instruction type though (btw, I'd have called it F rather than R4; I don't like there being a 2-char type where every other type uses only 1)
<cousteau>
You know, F as in Four registers, Fused multiply-add, or Floating point registers. Dunno, I think it's a cool mnemonic.
compscipunk has joined #riscv
<jrtc27>
rule of thumb is that cache miss rate scales with the square root of the cache size
<jrtc27>
well, inverse square root
<jrtc27>
1/sqrt(2) is 0.7, ie about a 30% decrease in miss rate
<jrtc27>
which matches your quote
<jrtc27>
as for decompression adding latency, not really, it's just a bit of combinatorial logic you shove very early in decode
<jrtc27>
it's not normally on a critical path for riscv
cousteau has quit [Ping timeout: 252 seconds]
cousteau has joined #riscv
cousteau has quit [Client Quit]
cousteau has joined #riscv
<xentrac>
thanks, I'd been wondering about that a lot, jrtc27
<cousteau>
Did anyone say anything after my last message? Did anyone *see* my last message? ("OK about the latency...")
<xentrac>
yes
<xentrac>
well no
<xentrac>
I mean your last messages was "...cool mnemonic."
<cousteau>
OK about the latency. If it's not on a critical path then I guess there's nothing to fear, and if the area overhead is negligible as it seems to be the case then there's nothing to lose with a C extension, I guess
<xentrac>
and then jrtc27 explained why doubling the effective instruction cache size would drop the cache miss rate by 30%
<cousteau>
Anyway. From jrtc27's formula, I get an 1.56-1.78x "bigger" cache, which is close to 2x
<jrtc27>
the bigger issue with C is having to make sure your uarch can handle unaligned ifetch for 32-bit instructions
<jrtc27>
it's not much logic but it can be a bit fiddly to get the edge cases right
<cousteau>
But then I don't get how reducing the instruction size by 25-30% reduces the cache miss rate by that much, and not 13-16% or so
<jrtc27>
why 13-16?
<cousteau>
sqrt(1-0.25~0.30) ≈ 1-0.13~0.16
<cousteau>
Because I still think that making instructions 2x smaller is equivalent to making the cache 2x bigger
<jrtc27>
it's equivalent to making the cache *lines* 2x bigger
<jrtc27>
which isn't the same thing
<xentrac>
maybe "fetches 25-30% fewer instruction bits" refers to how much traffic results from I$ misses, not how many instruction bits the fetch unit fetches from the I$
<jrtc27>
no it'll be the latter
<xentrac>
hmm, okay
<cousteau>
(or, in this case, making the instructions 0.70~0.75 times as big...
<cousteau>
Ooooh the instruction LINES, I think I see where this is going
<cousteau>
Was gonna ask if it was 2x bigger in terms of line size, number of lines, or number of ways
<jrtc27>
line size is hard to make bigger, byte select logic gets too deep
<cousteau>
Hmm
<jrtc27>
and also has implications for reservation granularity in some arches, plus cache management ops, plus people trying to avoid false sharing
<cousteau>
Wouldn't doubling the line size merely add 1 level to that?
<jrtc27>
so in practice needs to be architectural
<jrtc27>
yes but you have double the amount of logic
<jrtc27>
so it's both deeper and more congested
<cousteau>
And wouldn't I$ byte select actually need to be word select? You don't need to access individual bytes of the I$
<jrtc27>
either word of half-word depending on uarch if you support compressed
<cousteau>
And hence, what you gain for not making the line 2x longer, you lose by having a 2x density of words
jimwilson has quit [Quit: Leaving]
aburgess has quit [Ping timeout: 240 seconds]
<cousteau>
So you need to address as many 16b words as you would 32b words with the longer line size (and also handle misaligned 32-bit words)
jimwilson has joined #riscv
<cousteau>
But well, I'll believe you. There are probably implications on making cache lines "denser" vs longer vs more numerous that I would need to think about. For instance, a cache miss is less disastrous if fetching a single line from L2 takes half as long (because the line size is kept short)
<cousteau>
And maybe that's part of the reason
<cousteau>
Good; I'll think about it! I have a 2h+ plane trip ahead of me I can spend thinking about that
iorem has quit [Quit: Connection closed]
pjw_ is now known as pjw
jedix has quit [Ping timeout: 240 seconds]
jedix has joined #riscv
<jrtc27>
increased line size doesn't usually have a huge bearing on fill time
<jrtc27>
it's the latency not the bandwidth that kills you
<jrtc27>
though if your L1<->L2 interface is full width rather than bursting lines it'd increase routing congestion
smartin has quit [Read error: Connection reset by peer]
smartin has joined #riscv
smartin has quit [Remote host closed the connection]
smartin has joined #riscv
<cousteau>
OK, thanks for the info! I was thinking on a sequential burst model, didn't consider full width one-line-per-cycle transmission
<cousteau>
Gotta go, bye!
cousteau has quit [Quit: Bye]
rektide has quit [Remote host closed the connection]
<palmer>
is kito in here?
Narrat has joined #riscv
jwillikers has quit [Remote host closed the connection]
mahmutov has quit [Remote host closed the connection]
mahmutov has joined #riscv
Narrat has quit [Quit: They say a little knowledge is a dangerous thing, but it's not one half so bad as a lot of ignorance.]
winterflaw has quit [Remote host closed the connection]
winterflaw has joined #riscv
<sorear>
kito-cheng used to be here but not currently
winterflaw has quit [Ping timeout: 244 seconds]
Andre_H has quit [Ping timeout: 276 seconds]
pecastro has quit [Ping timeout: 240 seconds]
stikonas has joined #riscv
<stikonas>
Hi, am I supposed to use clone syscall instead of fork on linux? Somehow I looked a bit but fork syscall does not seem to exist
<sorear>
correct
<sorear>
same reason you have to use openat and not open
<stikonas>
yeah, I was already using openat...
vagrantc has quit [Quit: leaving]
<xentrac>
yeah, on Linux fork is a library function
<xentrac>
if I understand correctly
<stikonas>
xentrac: there is also a syscall, but looking at this it seems that newer libcs use clone syscall internaly rather than fork syscall. I guess older syscalls that have newer replacements are removed. Anyway, I have to use syscalls directly rather than via libc
<xentrac>
of course
<xentrac>
I think older syscalls that have newer replacements are only removed on new architectures
<xentrac>
Linus doesn't like to break userspace
<stikonas>
yes, that's what I meant by removed (only newer architectures)
solrize has joined #riscv
solrize has quit [Changing host]
solrize has joined #riscv
<solrize>
whee, sparkfun selling esp32-c3 modules and boards