#riscv on 2021-07-29 — irc logs at libera.irclog.whitequark.org

2021-05-20 20:58 sorear changed the topic of #riscv to: RISC-V instruction set architecture | https://riscv.org | Logs: https://libera.irclog.whitequark.org/riscv | Backup if libera.chat and freenode fall over: irc.oftc.net

00:02 smaeul has quit [Ping timeout: 276 seconds]

00:35 smaeul has joined #riscv

01:07 smaeul has quit [Quit: Down for maintenance...]

01:17 smaeul has joined #riscv

01:44 zjason has quit [Read error: No route to host]

02:44 jwillikers has quit [Remote host closed the connection]

03:00 radu242407 has quit [Ping timeout: 245 seconds]

03:41 dlan has quit [Ping timeout: 268 seconds]

03:47 dlan has joined #riscv

04:49 mahmutov_ has quit [Ping timeout: 258 seconds]

04:51 compscipunk has joined #riscv

04:52 compscipunk has quit [Client Quit]

05:00 compscipunk has joined #riscv

05:00 compscipunk has quit [Client Quit]

05:25 iorem has joined #riscv

06:02 iorem has quit [Ping timeout: 250 seconds]

06:12 hendursaga has quit [Ping timeout: 244 seconds]

06:14 hendursaga has joined #riscv

06:24 peeps[zen] has quit [Ping timeout: 258 seconds]

06:42 BOKALDO has joined #riscv

07:01 K285 has joined #riscv

07:20 pecastro has joined #riscv

07:37 winterflaw has joined #riscv

07:37 smartin has joined #riscv

07:45 peepsalot has joined #riscv

07:47 K285 has quit [Quit: Client closed]

07:51 iorem has joined #riscv

08:05 hendursa1 has joined #riscv

08:08 hendursaga has quit [Ping timeout: 244 seconds]

08:11 valentin has joined #riscv

08:52 iorem has quit [Ping timeout: 258 seconds]

09:05 TMM_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

09:05 TMM_ has joined #riscv

10:17 Ivyy has joined #riscv

10:36 khem has quit [Quit: Bridge terminating on SIGTERM]

10:36 charlesap[m] has quit [Quit: Bridge terminating on SIGTERM]

10:36 psydroid has quit [Quit: Bridge terminating on SIGTERM]

10:36 kaji has quit [Quit: Bridge terminating on SIGTERM]

10:36 CarlosEDP has quit [Quit: Bridge terminating on SIGTERM]

10:36 EmanuelLoos[m] has quit [Quit: Bridge terminating on SIGTERM]

10:36 pierce has quit [Quit: Bridge terminating on SIGTERM]

10:38 zjason has joined #riscv

10:40 kaji has joined #riscv

10:40 pierce has joined #riscv

10:40 CarlosEDP has joined #riscv

10:40 EmanuelLoos[m] has joined #riscv

10:40 charlesap[m] has joined #riscv

10:40 khem has joined #riscv

10:50 davidlt has joined #riscv

12:07 iorem has joined #riscv

12:28 jwillikers has joined #riscv

12:39 BOKALDO has quit [Quit: Leaving]

13:00 cousteau has joined #riscv

13:04 <cousteau> Hi

13:04 psydroid has joined #riscv

13:05 <cousteau> Why are compresses instructions so important? I see many simple implementations including the C extension but I don't really see the point for it

13:05 <cousteau> *compressed

13:06 <rjek> Smaller code → better cache usage → higher performance

13:06 <cousteau> Like, won't adding a block to decode compressed instructions add more latency and area to the design? Or is it minimal compared to a benefit I still haven't quite understood?

13:07 <cousteau> Oh, cache

13:07 <rjek> If it's like ARM Thumb, the "decompression" is essentially free as it's just a bitmap process

13:07 <rjek> ie, there is a single, obvious, simple mapping of each bit in a compressed instruction to each bit in the full-width instruction

13:08 <cousteau> I was thinking just memory usage, which didn't sound as important (who cares if the .text section is 4 or 5 KB?), but I see how caches would play an important role

13:09 <cousteau> rjek: but there really isn't; it may be quite straightforward but some decoding is still needed

13:10 <rjek> Well, I don't know how they've implemented it: but it could just be a *different* instruction decoder being used to configure the ALU etc

13:10 <rjek> Rather than a discrete step

13:13 <cousteau> Plus many C instructions are like "do this if rd=0, do that if rd≠0"

13:13 <cousteau> Well yeah I guess that would avoid the timing issues I mentioned

13:13 <cousteau> But no longer "free"

13:14 <rjek> Early implementations of ARM Thumb had terrible performance (it would do 16 bit loads for each individual instruction, decode into ARM, then decode the ARM) but modern parts there's no measurable performance penalty IIRC

13:14 <cousteau> I mean, I don't know how it's internally implemented either but it is possible to see how the instructions are encoded

13:15 <cousteau> (I suspect nobody does; they just type the behavioral code and let the synthesis tool figure out how to map that to gates and flip flops)

13:42 <cousteau> rjek: hmm... The spec says that RVC "fetches 25-30% fewer instruction bits" (OK, seems reasonable), "which reduces I$ misses by 20-25%" (I don't quite get why not 25-30% but I guess I'd need to do the math), "or roughly the same performance impact as DOUBLING the I$ size" (...wait, what??)

13:44 <cousteau> I don't get how being able to store 33-43% more instructions (due to the reduction of 25-30%) can be equivalent to increasing the cache size a 100%

13:45 <rjek> Degrading returns from adding more cache. Doubling it doesn't double performance

13:46 hendursa1 has quit [Quit: hendursa1]

13:46 hendursaga has joined #riscv

13:48 <cousteau> I guess that explains how it goes from 25-30% fewer bytes to 20-25% fewer misses, but not the equivalence to 2x cache

13:49 BOKALDO has joined #riscv

13:49 <cousteau> If I understand correctly, if you somehow managed to make all instructions half as large, you'd be able to fit twice as many in the I$, so it would be equivalente to having 2x as much I$

13:50 <cousteau> So if it were a 50% reduction I'd understand it were equivalent to a 2x increase

13:57 <rjek> If you could halve the size of every instruction, you wouldn't need the wider representation :)

14:01 <cousteau> I mean, halving every instruction would be equivalent to not halving any instruction and instead doubling the cache, unless I'm missing your point

14:03 <cousteau> Lol, I'm reading the reference where they apparently state this 2x increase equivalence, and it's an MSc thesis from 2011. The instruction structure is so "wrong"! (just outdated, I guess)

14:05 <cousteau> It has the rd on the MSbits of the instruction, followed by rs1 and rs2 to its right. It's funny to see how much the instruction set has evolved since its inception.

14:06 <cousteau> It already includes the R4 instruction type though (btw, I'd have called it F rather than R4; I don't like there being a 2-char type where every other type uses only 1)

14:07 <cousteau> You know, F as in Four registers, Fused multiply-add, or Floating point registers. Dunno, I think it's a cool mnemonic.

14:28 compscipunk has joined #riscv

14:28 <jrtc27> rule of thumb is that cache miss rate scales with the square root of the cache size

14:28 <jrtc27> well, inverse square root

14:29 <jrtc27> 1/sqrt(2) is 0.7, ie about a 30% decrease in miss rate

14:29 <jrtc27> which matches your quote

14:30 <jrtc27> as for decompression adding latency, not really, it's just a bit of combinatorial logic you shove very early in decode

14:30 <jrtc27> it's not normally on a critical path for riscv

14:39 cousteau has quit [Ping timeout: 252 seconds]

14:41 cousteau has joined #riscv

14:41 cousteau has quit [Client Quit]

14:41 cousteau has joined #riscv

14:41 <xentrac> thanks, I'd been wondering about that a lot, jrtc27

14:42 <cousteau> Did anyone say anything after my last message? Did anyone *see* my last message? ("OK about the latency...")

14:43 <xentrac> yes

14:43 <xentrac> well no

14:43 <xentrac> I mean your last messages was "...cool mnemonic."

14:44 <cousteau> OK about the latency. If it's not on a critical path then I guess there's nothing to fear, and if the area overhead is negligible as it seems to be the case then there's nothing to lose with a C extension, I guess

14:44 <xentrac> and then jrtc27 explained why doubling the effective instruction cache size would drop the cache miss rate by 30%

14:44 <cousteau> Anyway. From jrtc27's formula, I get an 1.56-1.78x "bigger" cache, which is close to 2x

14:47 <jrtc27> the bigger issue with C is having to make sure your uarch can handle unaligned ifetch for 32-bit instructions

14:47 <jrtc27> it's not much logic but it can be a bit fiddly to get the edge cases right

14:48 <cousteau> But then I don't get how reducing the instruction size by 25-30% reduces the cache miss rate by that much, and not 13-16% or so

14:49 <jrtc27> why 13-16?

14:50 <cousteau> sqrt(1-0.25~0.30) ≈ 1-0.13~0.16

14:51 <cousteau> Because I still think that making instructions 2x smaller is equivalent to making the cache 2x bigger

14:51 <jrtc27> it's equivalent to making the cache *lines* 2x bigger

14:51 <jrtc27> which isn't the same thing

14:51 <xentrac> maybe "fetches 25-30% fewer instruction bits" refers to how much traffic results from I$ misses, not how many instruction bits the fetch unit fetches from the I$

14:51 <jrtc27> no it'll be the latter

14:52 <xentrac> hmm, okay

14:52 <cousteau> (or, in this case, making the instructions 0.70~0.75 times as big...

14:52 <cousteau> Ooooh the instruction LINES, I think I see where this is going

14:52 <cousteau> Was gonna ask if it was 2x bigger in terms of line size, number of lines, or number of ways

14:53 <jrtc27> line size is hard to make bigger, byte select logic gets too deep

14:53 <cousteau> Hmm

14:53 <jrtc27> and also has implications for reservation granularity in some arches, plus cache management ops, plus people trying to avoid false sharing

14:53 <cousteau> Wouldn't doubling the line size merely add 1 level to that?

14:53 <jrtc27> so in practice needs to be architectural

14:54 <jrtc27> yes but you have double the amount of logic

14:54 <jrtc27> so it's both deeper and more congested

14:54 <cousteau> And wouldn't I$ byte select actually need to be word select? You don't need to access individual bytes of the I$

14:54 <jrtc27> either word of half-word depending on uarch if you support compressed

14:55 <cousteau> And hence, what you gain for not making the line 2x longer, you lose by having a 2x density of words

15:00 jimwilson has quit [Quit: Leaving]

15:03 aburgess has quit [Ping timeout: 240 seconds]

15:03 <cousteau> So you need to address as many 16b words as you would 32b words with the longer line size (and also handle misaligned 32-bit words)

15:03 jimwilson has joined #riscv

15:06 <cousteau> But well, I'll believe you. There are probably implications on making cache lines "denser" vs longer vs more numerous that I would need to think about. For instance, a cache miss is less disastrous if fetching a single line from L2 takes half as long (because the line size is kept short)

15:07 <cousteau> And maybe that's part of the reason

15:08 <cousteau> Good; I'll think about it! I have a 2h+ plane trip ahead of me I can spend thinking about that

15:33 iorem has quit [Quit: Connection closed]

15:45 pjw_ is now known as pjw

15:47 jedix has quit [Ping timeout: 240 seconds]

15:47 jedix has joined #riscv

15:49 <jrtc27> increased line size doesn't usually have a huge bearing on fill time

15:49 <jrtc27> it's the latency not the bandwidth that kills you

15:49 <jrtc27> though if your L1<->L2 interface is full width rather than bursting lines it'd increase routing congestion

15:57 smartin has quit [Read error: Connection reset by peer]

15:59 smartin has joined #riscv

16:23 smartin has quit [Remote host closed the connection]

16:24 smartin has joined #riscv

16:28 <cousteau> OK, thanks for the info! I was thinking on a sequential burst model, didn't consider full width one-line-per-cycle transmission

16:28 <cousteau> Gotta go, bye!

16:28 cousteau has quit [Quit: Bye]

16:40 rektide has quit [Remote host closed the connection]

16:48 <palmer> is kito in here?

17:08 Narrat has joined #riscv

17:15 jwillikers has quit [Remote host closed the connection]

17:21 <jrtc27> not that I know of

18:05 Andre_H has joined #riscv

18:11 adjtm has quit [Quit: Leaving]

18:22 mahmutov has joined #riscv

18:37 aburgess has joined #riscv

18:41 mahmutov has quit [Ping timeout: 256 seconds]

18:53 mahmutov has joined #riscv

19:01 zjason` has joined #riscv

19:04 davidlt has quit [Ping timeout: 245 seconds]

19:05 zjason has quit [Ping timeout: 265 seconds]

19:05 vagrantc has joined #riscv

19:08 solrize has quit [Ping timeout: 256 seconds]

19:20 TMM_ has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]

19:20 TMM_ has joined #riscv

19:21 BOKALDO has quit [Quit: Leaving]

20:16 adjtm has joined #riscv

20:25 devcpu has quit [Quit: leaving]

20:27 devcpu has joined #riscv

20:47 smartin has quit [Quit: smartin]

20:58 Ivyy has quit [Remote host closed the connection]

21:21 mahmutov has quit [Remote host closed the connection]

21:21 mahmutov has joined #riscv

21:55 Narrat has quit [Quit: They say a little knowledge is a dangerous thing, but it's not one half so bad as a lot of ignorance.]

22:25 winterflaw has quit [Remote host closed the connection]

22:25 winterflaw has joined #riscv

22:49 <sorear> kito-cheng used to be here but not currently

22:52 winterflaw has quit [Ping timeout: 244 seconds]

22:58 Andre_H has quit [Ping timeout: 276 seconds]

23:17 pecastro has quit [Ping timeout: 240 seconds]

23:26 stikonas has joined #riscv

23:27 <stikonas> Hi, am I supposed to use clone syscall instead of fork on linux? Somehow I looked a bit but fork syscall does not seem to exist

23:28 <sorear> correct

23:29 <sorear> same reason you have to use openat and not open

23:30 <stikonas> yeah, I was already using openat...

23:31 vagrantc has quit [Quit: leaving]

23:34 <xentrac> yeah, on Linux fork is a library function

23:34 <xentrac> if I understand correctly

23:42 <stikonas> xentrac: there is also a syscall, but looking at this it seems that newer libcs use clone syscall internaly rather than fork syscall. I guess older syscalls that have newer replacements are removed. Anyway, I have to use syscalls directly rather than via libc

23:45 <xentrac> of course

23:45 <xentrac> I think older syscalls that have newer replacements are only removed on new architectures

23:46 <xentrac> Linus doesn't like to break userspace

23:48 <stikonas> yes, that's what I meant by removed (only newer architectures)

23:54 solrize has joined #riscv

23:54 solrize has quit [Changing host]

23:54 solrize has joined #riscv

23:55 <solrize> whee, sparkfun selling esp32-c3 modules and boards

23:55 <xentrac> sparkfun are great

23:56 <solrize> https://www.sparkfun.com/products/18036 well its an espressif board but yeah

23:57 <xentrac> have you looked at their Ambiq boards? that's the thing that I wish I could get my hands on

23:57 <solrize> https://www.sparkfun.com/products/18034 hmm not the same module

23:57 <solrize> they sell the ambiq boards, a guy i know ordered the first one

23:57 <xentrac> has he measured its power consumption?

23:57 <solrize> i don't think so

23:58 <xentrac> that's the part that's really exciting to me (I guess that's sort of Ambiq's whole sales pitch)

23:58 <solrize> there is starting to be other super low power stuff without using sub-threshold, i guess just by using smaller processors

23:58 <solrize> and maybe suffering from higher leakage current

23:58 <xentrac> I haven't seen anything else that comes close (to the Ambiq datasheet numbers)

23:59 <solrize> i like the ambiq idea but also feel like i'm buying into software bloat by even desiring such large mcu's