<solrize>
hey can i go way off topic here for a little while, or move somewhere else? i want to talk about some AVR8 code
<solrize>
i'm looking at an AVR flashlight controller which fills up the code space on the smaller AVR parts, and it seems to me that the avr-gcc output is not all that dense, so i'm wondering about the idea of using a bytecode interpreter or similar
<sorear>
sure i guess, noone else seems to want the floor right now, although i wonder what makes this the most attractive channel for you
<solrize>
i remember you wrote a post about code density on different cpus so i looked for you here
<solrize>
figuring you might have thoughts on the topic
<solrize>
sec
<sorear>
ah. I don't recall ever making "a post" on that subject although it's come up on IRC a few times
<sorear>
I also have only the most passing knowledge of AVR instruction encoding
<solrize>
hmm maybe i'm confusing you with someone... it wasn't about avr or hardware cpus as much as encodings in general. it compared forth with smalltalk
<solrize>
or the question of machine code (C compiler output that does a fair amount of 16 bit ops) vs interpreted code on 8 bit cpus in general
<sorear>
the one that comes up somewhat regularly is vincent weaver's work (which I rather disagree with on the grounds that his choice of mostly compression-related benchmarks is not representative of benchmarks I would pick), but that doesn't address either forth or smalltalk
<solrize>
hmm ok i'll see if i can find vincent weaver's work and also will look for the post i'm thinking of
<sorear>
this is actually the first time i've heard of anyone using *smalltalk* specifically as a base for deeply embedded systems; forth is much more well-trodden ground
<solrize>
the smalltalk comparison was only about code density
<solrize>
this flashlight thing might have been an ok forth application though
<meowray>
"Yes, issue is embedded, people care a lot about code size there so can’t change the implementation until binutils has relaxation support." citation needed for the embedded claim
<jrtc27>
yes, well, I've given up fighting over 10/12 bytes there
<wingsorc>
I care about code size you can cite me :)
<jimwilson>
I have pointed at uses of undefined weak in newlib many times. Particularly in crt0.S.
<jimwilson>
The main issue is with naive users that just build a toolchain, build a benchmark, and then decide that RISC-V is broken because code is larger than ARM, without any attempt to understand what is actually going on. This is a problem for the entire RISC-V community. psABI changes that increase code size are reckless, and I won't agree to them.
<wingsorc>
to be honest people roll their own crt0.S
<jimwilson>
but a naive user looking at RISC-V for the first time for a quick evaluation isn't going to do that
<wingsorc>
true. Actually we had people coming in complaining that RISC-V code was 10% larger than ARM
<wingsorc>
I don't remember the exact configuration that was used though...
<meowray>
the people who roll their own crt0.o very likely need -mcmodel=medany -fno-pic ..
<meowray>
s/likely/unlikely/
haritz has joined #riscv
haritz has quit [Changing host]
haritz has joined #riscv
BOKALDO has quit [Quit: Leaving]
zjason has quit [Read error: Connection reset by peer]
zjason has joined #riscv
<solrize>
is risc-v code larger than arm in real life?
<solrize>
is it a matter of adding a feature to binutils (relaxation = shrinking down variable length operations when possible?)
<solrize>
brb
GenTooMan has quit [Ping timeout: 258 seconds]
<jrtc27>
the answer is likely "which Arm, which RISC-V and what software"...
GenTooMan has joined #riscv
<jimwilson>
for embedded code, yes, risc-v is larger than arm in real life, the B extension helps a little, the zce* extensions will help more
<jimwilson>
the C extension was designed using SPEC which is a good unix benchmark, but useless for embedded, this is why we have compressed float/double load/store, because SPEC needs them, but not compressed char/short load/store, because SPEC doesn't need them, even though many embedded systems have no float, and have a lot of char/short data to reduce data size, so this hurts embedded code size, but zce* will fix this
GenTooMan has quit [Ping timeout: 248 seconds]
<jrtc27>
Zce ranges from "this is an obvious omission" to "what on earth no that's not what RISC-V should look like" IMO..
<jrtc27>
hopefully the latter ones are not needed to be competitive for code size, because I really don't like them...
<jrtc27>
how much has GCC been optimised for code size, too? I know Craig and people keep finding new code size wins in LLVM
<jrtc27>
some of it could just be a lack of having time (money...) poured into it
GenTooMan has joined #riscv
<jimwilson>
gcc is well optimized for dhrystone and coremark code size and performance
<jimwilson>
we get slightly better results for SPEC CPU2006 with gcc than llvm, but we have more people working on llvm than gcc now, so I expect that to eventually change
<jrtc27>
I know a lack of linker relaxation support does hurt LLD, we see that with our tiny set of embedded benchmarks
<jimwilson>
there were some jump threading patches in llvm recently that helped narrow the gap to gcc
<jrtc27>
oh I remember that one, caught my eye as it mentioned coremark explicitly
<solrize>
i'm glad this stuff is being addressed, like 1 and 2 byte operations
dermato has joined #riscv
<solrize>
i still want to see bignum benchmarks to check the claim that int overflow detection doesn't matter
<jrtc27>
what do you mean? why would trapping be helpful?
<jrtc27>
(or flags)
<jrtc27>
surely you'd need exactly the same amount of code to proactively detect overflow and allocate more space as to reactively detect it?
<solrize>
well on most cpus if you want a multi precision add, you use a carry flag, and there is an add with carry instruction
<solrize>
and if you divide by 0 there is a hardware trap
<solrize>
and ideally since int overflow is usually a bug, a hw trap would help there too
<solrize>
so you have to emit extra instructions to test all that stuff
<jrtc27>
if I wanted to make add-with-carry efficient I'd probably have c.slti[u] exist and then do c.slti[u]; c.addi
<jrtc27>
and then macro-op fuse that
<jrtc27>
uh, no, you do not want to trap on int overflow
<jrtc27>
mips tried that, it was unused
vagrantc has joined #riscv
<jrtc27>
everything just used the non-trapping instruction
<solrize>
were the trapping ones slower or anything like that?
<solrize>
and mips, that was before people cared about this stuff
<jrtc27>
it was mips, everything was slow
<jrtc27>
but, you just broke too much code
<solrize>
if code depended on non-trapping it was already broken--signed int overflow in C is UB
<jrtc27>
r6 removed the trapping version
<jrtc27>
sure
<jrtc27>
lots of things are UB
<jrtc27>
shitty code still exists
<jrtc27>
and people like to assume two's complement
<solrize>
thus the desirability of traps, to flag the shitty code instead of running it and letting it corrupt stuff
<solrize>
if they want 2s complement they can use unsigned or -fwrapv
<solrize>
which disables some optimizations
<jrtc27>
I like your optimism that this forces people to fix their code rather than makes people just ignore mips
<solrize>
they ignore mips for many other reasons why not one more
<solrize>
anyway it's a significant sticking point, if people want C to always allow wrapping then they should take it up with the C standard committee. unintentional overflow may not happen much on 64 bit machines but it was a real issue with 32 bit because it often escaped detection. with 16 bit it happened so much that it usually got caught
<jrtc27>
-fsanitize=undefined
<solrize>
hmm ok if that reliability catches overflow, but i mean if it inserts a bunch of extra code and slows down the program then people won't use it
<solrize>
i tried -trapv and there wasn't much difference on x86
<jrtc27>
well it does a whole bunch of things, integer overflow detection being just one of them
<jrtc27>
ubsan is pretty cheap, it's things like msan where it gets slow
<jrtc27>
the headline figure for msan is ~3 times slower, and ~2 times slower for asan
<solrize>
nice thanks right now i primarily use gcc
<solrize>
but i think gcc also has sanitize undefined
<jrtc27>
gcc has support for some of them, don't know exactly what though
<solrize>
yeah
<jrtc27>
yeah it vendors parts of llvm in its tree
<solrize>
wow interesting i didn't know that
<jrtc27>
(the run-time parts of the sanitizers, in libsanitizer)
<solrize>
thanks
valentin has quit [Quit: Leaving]
peeps[zen] has quit [Read error: Connection reset by peer]
peepsalot has joined #riscv
nvmd has quit [Quit: Later, nerds.]
<meowray>
-fsanitize-trap=undefined is needed to make ubsan cheap
<dh`>
with 16 bit it happened so much that it usually got caught
<dh`>
so you'd think, but virtually every DOS game has some 16-bit overflow in it
<dh`>
I remember in the original railroad tycoon there was a whole succession of 16-bit overflows you'd hit as you expanded your railroad
<sorear>
zce is surprisingly reasonable imo... i'd like to see the detailed benchmark results (later), hopefully this wasn't just tested on one decompression algorithm
<jrtc27>
which parts of zce?
<jrtc27>
most of it is fine
<jrtc27>
a couple of the instructions are way too specialist, and a couple are just "no" (e.g. tbljal, no, don't do that, please)
<jrtc27>
push/pop, meh, I hate it but people do that on microcontroller ISAs
<sorear>
tbljal is close to word for word something I worked out months ago while trying to come up with a non-terrible version of the andes code density instruction
<jrtc27>
if you want to do tbljal, make it a less architecturally crippled version and just add a load-and-branch instruction...
<jrtc27>
what I don't like is that it's using a new CSR
<jrtc27>
as the implicit base
<sorear>
hmm, if it used gp it'd be compatible with fdpic shared libs... or you could make it truncate pc
<jimwilson>
push/pop and tlbjal are the ones that give the most benefit, but you don't have to implement them on unix parts where performance matters more than code size
<jrtc27>
did they consider a generalised load-and-branch?
<jrtc27>
because that has wider applicability
<jrtc27>
and yeah you could have a compressed form that used say gp as the base
<sorear>
I-types don't grow on trees, especially if you insist on encoding imm[0:1] despite the fact it will always be zero
<jrtc27>
you could make it a J-type at least and shave off bit 0
<sorear>
J = 8 times the space of I
<jrtc27>
oh right
<jrtc27>
hmm
<jimwilson>
I don't recall discussion of load-and-branch, but I haven't followed all of the discussions
wingsorc__ has joined #riscv
<jrtc27>
yeah I haven't either for various reasons
<jrtc27>
still have concerns about the mismatch between the code corpus in use and the intended application space for the more interesting instructions...
wingsorc has quit [Read error: Connection reset by peer]