<deathmist>
hmm reverting the commit didn't seem to affect anything... I'll have to come back to this tomorrow
dlan has quit [Ping timeout: 244 seconds]
Tenkawa has joined #riscv
dlan has joined #riscv
<Tenkawa>
mps: mine did keep lower cpu load like yours after I tried again... it was just really slow to go down after the boot.. I wasn't waiting long enough
<Tenkawa>
So I do like the load avg workaround (which works with my xfs chainload need)
Tenkawa has quit [Quit: Was I really ever here?]
Trifton has joined #riscv
<notgull>
How do RiscV systems boot? Do you have access to the usual BIOS interrupts?
<muurkha>
no
<muurkha>
the RISC-V ISA spec itself doesn't specify that at all, and different implementations of the ISA vary a lot
<notgull>
I see, so SBI is generally what is used?
<sorear>
that's not really what SBI does
<notgull>
Huh, so how do boot loaders, for instance, read from the disk before drivers are set up?
<sorear>
depends on the boot stage and the platform
<sorear>
u-boot has its own set of disk drivers, or you might use the UEFI disk interface
<notgull>
Hmm, so there's no real standard?
<notgull>
Oh, so UEFI can be used in this instance?
<sorear>
there are several real standards, but trying to be all things to all people has unavoidable costs
<notgull>
I get it, so I have to research which platform
<notgull>
Im developing for?
<sorear>
yes
<muurkha>
yeah. RISC-V is more like "8086" and less like "IBM PC"
<notgull>
I get it. Thanks!
<sorear>
if you want to support multiple platforms, you can take a device tree or UEFI/ACPI approach
<notgull>
👍
Narrat has quit [Quit: They say a little knowledge is a dangerous thing, but it's not one half so bad as a lot of ignorance.]
MaxGanzII_ has quit [Ping timeout: 246 seconds]
drmpeg has quit [Ping timeout: 248 seconds]
drmpeg has joined #riscv
<muurkha>
how widespread is the use of x8 as a frame pointer these days?
<muurkha>
I feel like the ABI's requirement (?) of always maintaining 16-byte alignment coupled with the lack of LDM/STM instructions or preincrement/postincrement addressing modes means that the regular stack pointer is pretty much a frame pointer; you can't push a register on the stack because that would break alignment. instead you have to addi to allocate a new stack frame, at which point you start storing
<muurkha>
stuff in it
<sorear>
how are you operationally defining "is pretty much a frame pointer"?
guerby has joined #riscv
<muurkha>
well, I mean what you push and pop are entire activation records, not register values
<muurkha>
so if you wanted to maintain a traversable linked list of stack frames without any debugging metadata, you could do it without a separate frame pointer register
<muurkha>
yeah, that's exactly what I was thinking about
<sorear>
you need the link to have a linked list
<muurkha>
yes, for sure
<sorear>
x8 also gets used in non-ABI fashion for local variable access in functions that use VLAs or alloca
<muurkha>
is struct stackframe in asm/stacktrace.h?
<muurkha>
aah, thanks, I hadn't thought about that
<muurkha>
I forget, the RISC-V ABI doesn't define a red zone, does it?
<muurkha>
not that that would help with VLAs...
<sorear>
no red zone, everything below sp is volatile
<muurkha>
so you have to displace sp to alloca(), at which point you can no longer index your local vars off sp
<muurkha>
also you can no longer addi to restore sp in the epilogue; you have to read the saved sp out of the stack frame
<muurkha>
thanks!
<sorear>
with the current implementation the saved sp *is* the stack frame, and also the dwarf cfa
<sorear>
which ironically means that you do in fact use addi to restore sp, since mv sp,s0 is an alias for addi x1,x8,0
<sorear>
(locals are negative(x8), stack arguments are positive(x8))
<muurkha>
current implementation in what? I don't know anything about dwarf unfortunately
<muurkha>
is this for C variadic-supporting calling conventions where the callee may not know enough to pop its own arguments?
<sorear>
i'm familiar with frame pointer discipline in gcc and llvm, which is essentially the same and depends on -fno-omit-frame-pointer
<sorear>
the callee never pops its own arguments
<sorear>
functions with more than 8 words of arguments are rare enough that nobody's bothered to define a PASCAL/STDCALL for riscv
<muurkha>
well, in amd64 SBCL and most Pascal implementations it does
<muurkha>
I don't know if SBCL supports RISC-V (yet?)
<muurkha>
the note about how mv is actually addi is well-taken. I wasn't thinking about that :)
<dh`>
basically it's the same as mips
<muurkha>
you could say that about most things in RISC-V :)
<dh`>
there are certain disadvantages to fomiting on the frame pointer by default, but mostly you come out ahead
<dh`>
especially in a world that seems to have accepted the notion of having enormous unwind tables and libunwind all over the place
* muurkha
towels off the frame pointer, eyeing dh` evilly
<sorear>
if sbcl/risc-v is not using the standard calling convention (likely, if it needs to support fully general tail calls) it's out of scope for the ABI
<muurkha>
yeah, the tradeoffs that make sense may be different in a world where development machines routinely have tens of gigabytes of RAM
<muurkha>
agreed, sorear
<jrtc27>
c.f. GHC's STG that does its own wildly-different thing internally
<jrtc27>
(full of CPS)
<muurkha>
hmm, I feel like combinator graph reduction is really kind of different from CPS...
<muurkha>
I don't remember whether SBCL supports fully general tail calls. CL doesn't require it
<sorear>
I know V8 does some weird ABI stuff for tail call reasons
ntwk has quit [Quit: ntwk]
<muurkha>
why would fully general tail calls require violating the ABI?
<sorear>
more annoying is that riscv copies the arm/mips frame layout with pc and sp at the high addresses... except pc and sp are *low* registers on riscv, which means that the saved register area is in reverse register order, but they decided to make cm.push use forward register order, so it's incompatible with frame pointer ABI
<muurkha>
cm.push?
<sorear>
if A() calls B(x) and there are no argument registers, A needs to allocate stack space for x, but since it's a tail call A cannot deallocate stack space, which leaves B as the only option
<sorear>
from Zcmp
<muurkha>
aha
<jrtc27>
yeah cm.push not putting ra and sp round the other way sucks
<jrtc27>
(as for the frame layout, what's specified is whatever gcc did long ago...)
<muurkha>
sorear: I'm not following you about A() and B(x)
<muurkha>
by B(x) do you mean B(a, b, c, d, e, f, g, h, x), so you ran out of argument registers?
<muurkha>
also I think you can totally deallocate stack space; it's just addi sp, sp, 16, once you're done storing whatever you needed to store on the stack for arguments
<sorear>
muurkha: I said "and there are no argument registers"
<muurkha>
but in the RISC-V ABI there are argument registers, 8 of them?
<muurkha>
(not counting floating-point)
<dh`>
yes but obviously the same thing happens if you run out
<sorear>
are we not discussing hypotheticals here?
<muurkha>
oh, I thought we were talking about whether the RISC-V ABI prevented general tail-call elimination?
<sorear>
for the purposes of answering that question, 8 and 0 are both finite numbers and therefore equivalent
<muurkha>
(SBCL always defines its own ABI on every platform and it's always weird as hell)
<muurkha>
okay. so you need to allocate stack space for x, so that on entry to B, sp points at x, right?
<sorear>
yes
<sorear>
but when B returns to A and B's caller, 0(sp) is part of the caller's stack frame
<muurkha>
can't A just allocate stack space to call B with with addi sp, sp, -16?
<muurkha>
oh, now I understand
<muurkha>
that won't work if the A and B's caller is responsible for deallocating it
<muurkha>
as it must be in ABIs that support C varargs
<muurkha>
is that what you were saying?
<sorear>
[enormous unwind tables] I'm not very enthusiastic about frame pointers these days because I don't see an unsymbolized list of return addresses as particularly useful; if you have enough information to symbolize it, you can decode a stack dump and there are fewer things that can go wrong
<muurkha>
since it's a tail call A cannot deallocate stack space after B returns
<muurkha>
that was the part I was failing to grasp
<sorear>
frame pointers were invented for ancient compilers that change sp mid-expression to handle argument pushing and can't track that in their symbol tables, and are being kept alive as a half-useful workaround for DWARF CFI being barely fit for purpose
<sorear>
[after B returns] precisely
<muurkha>
frame pointers are also useful for spaghetti stacks
<sorear>
i'd say something about 16-bit x86 and its lack of [SP+imm] but I'm not sure if that's a chicken or an egg
<muurkha>
yeah, [bp+imm] is pretty important on the 8086
<muurkha>
but clearly that was designed because the designers were previously familiar with frame pointers
<muurkha>
maybe due to the iAPX432's B5000 heritage? or maybe the 432 started later
<muurkha>
RVC kind of swings the other way: there's c.lwsp, c.swsp, etc., but no c.lwfp and c.swfp. you can use c.lw and c.sw to index off x8 but you only get 5 bits of offset and can only access the 8 RVC registers
<muurkha>
for things like Smalltalk and Scheme you'd maybe like a "self pointer" or "closure pointer" register, but RVC was optimized for C, not for Smalltalk
<muurkha>
and the penalty of having to use a full-width instruction is a lot less severe than the corresponding things with Thumb-1 or 8086
<muurkha>
does the ABI require you to leave gp and tp unchanged so your callees have access to them? it's not clear to me in the version I'm reading here, I'm just inferring that from their names. if so, does that also apply to interrupt handlers, or is it okay to save them, clobber them, and then restore them before you call a callee?
<dh`>
it only matters if you care about the debugger being able to cope
madge has joined #riscv
vagrantc has quit [Quit: leaving]
<sorear>
anything that you expect to run in a unix shared libraries environment needs to leave gp and tp unchanged at all times, because the main program might install a signal handler that accesses a _Thread variable
<sorear>
if you control the interrupt process and can install a good gp/tp before running the interrupt handler, you have more freedom
kaol has quit [Server closed connection]
<sorear>
[optimized for C] i would say that the more you optimize, especially type-aware and flow-aware optimizations, the more all languages converge on something that resembles RISC instructions
kaol has joined #riscv
<sorear>
if it were truly "optimized for C" it'd look more like VAX with complex addressing modes and memory-memory instructions
<dh`>
maybe not, generating those from a C compiler isn't exactly trivial
<dh`>
anyone remember Hobbit?
billchenchina has joined #riscv
BootLayer has joined #riscv
<muurkha>
the AT&T chip?
<muurkha>
sorear: thanks! that's kind of what I thought
<muurkha>
I think it's reasonable that everything ends up resembling RISC instructions
<muurkha>
the particular thing I was talking about being optimized for C was that there's no compressed instructions for loading and storing instance variables/closure variables
<muurkha>
maybe that's not really so important, since even in Smalltalk or Scheme you end up accessing local variables a lot more often than those
EchelonX has joined #riscv
madge has quit [Quit: madge]
<sorear>
if it's c++ your instance pointer will normally be in a0, and you can use compressed instructions to access instance variables...
<dh`>
yeah, the AT&T chip
<dh`>
also re closure variables, ordinarily your closure pointer's going to be an argument...
zjason` is now known as zjason
<muurkha>
a way RISC-V could be more optimized for C would be to have more i386-like or ARM-like addressing modes in its load/store instructions. Smalltalk, Java, ML, and Lisp only need simple base+offset access modes, because their stack frames and records are simple vectors
<muurkha>
i386 has basereg+offsetreg(*scale)+immediate
<dh`>
uh, all of those languages have arrays of some kind
<muurkha>
yeah, but not embedded inside another object
<dh`>
doesn't matter, offsets still aren't fixed
<muurkha>
fp and sp are normally arguments (passed from the caller) and also call-preserved; that's what you'd want for instance pointers too
<dh`>
to access the a1'th element of the array in a0, you do shl t1, a1, 2; add t1, t1, a0; lw t1, 0(t1)
<muurkha>
hmm? I mean in C if you have an array that is a local variable, or an access to a struct field inside an array, you index it with frame pointer + immediate offset (to the beginning of the array, or to the struct field) + scaled index (for the index into the array)
<dh`>
if the array is at some offset inside a struct, that offset replaces the 0 in the lw
<muurkha>
yes
<dh`>
(unless it's too large, but that's a different issue)
<dh`>
so whether arrays are embedded in structs or not is immaterial
<muurkha>
but typically adding three addends like that is a short enough path length to fit inside a clock cycle
<dh`>
you save one instruction by having a lw t1, (a0 + t1) instruction, like sparc did
<dh`>
and another by assimilating the shift like x86
<muurkha>
so on an in-order microarchitecture you can win by having an x86-like addressing mode
<dh`>
I think the answer to that is supposed to be "micro-op fusion"
<muurkha>
a big complicated chip can do that with micro-op fusion, sure
<sorear>
this is your irregular reminder that carry-save adders exist
<muurkha>
yes, that's why three addends isn't especially slower than two
<sorear>
you can forward a sum into another addition or subtraction at negligible cost
<muurkha>
but you still need to wait for the carries to propagate to form the effective address to put on the memory bus
<dh`>
and this is why sparc (and mips64 too) had register + register addressing
<muurkha>
yeah
<dh`>
the reason riscv doesn't is that it doesn't fit in the instruction word, or alternatively does but only at the cost of making it a lot more irregular
<muurkha>
plausibly, yeah. it helps if you can use an instruction format with fewer registers, like RVC
Jackneill_ has joined #riscv
<muurkha>
usually the biggest cost of this sort of thing is that you need three ports on the register file, which makes the bits bigger than if you only need two. but you need three ports anyway to do regular RISC-V instructions in one cycle
<muurkha>
(or, for that matter, regular 8086 instructions)
<dh`>
arguably, 16 registers is enough if you don't waste several of them like arm32 did
<sorear>
x86 has 4 operands (segment base, base reg or pc, scaled index, displacement), arm and mips only does base reg and index OR displacement
<muurkha>
the number of registers depends on what you're doing
<sorear>
regular risc-v instructions are all 2R1W, which is enough for base + scaled index loads but not stores
<muurkha>
yes, true!
<dh`>
I can't remember the last time I saw code that had > 16 locals all live at once that didn't also need a rewrite
<muurkha>
as a trivial example, emulating an arm32 can be significantly faster if you can have 16 locals all live at once
<dh`>
maybe aggressive inlining changes that
<muurkha>
or, say, 19 or so
<sorear>
i have a suspicion it was mostly sized for dgemm
<muurkha>
also I think there are cryptographic algorithms that would get a significant speedup that way
<dh`>
maybe, I haven't looked in any crypto sources in a long time
<muurkha>
sorear: what, ARM's 16 registers?
<dh`>
by tradition they're usually encrypted after all
<sorear>
four independent 2d arrays with general strides and upper/lower bounds, and if you can keep the floats separate that's great
davidlt has joined #riscv
<gurki>
i hope that the hpc folks will have some nice proposals for specific instructions; as it is riscv performance is rather abysmal in comparison to "classical" architectures :S
<gurki>
but then, we kinda lack hardware thats even meant to compete to begin with
<muurkha>
what kind of classical architectures do you mean?
<gurki>
x86, arm
<muurkha>
amd64?
<muurkha>
I had assumed you meant, like, Cray-1
<gurki>
i consider that a part of x86 :3
<gurki>
(im aware its an extension)
<muurkha>
I don't think there's ever been an amd64 chip with a part number ending in "86"
<gurki>
well its even worse for gpus but thats by no metric a fair or reasonable comparison so i skipped it
<dh`>
there hasn't been anything with a part number ending in "86" for a good twenty years
<muurkha>
I feel like mostly RISC-V performance sucks compared to things like the M1 because people aren't fabbing RISC-V parts in those process nodes
<dh`>
thirty if you don't count cyrix and early amd stuff
<muurkha>
I mean *also* there's microarchitecture stuff like scoreboards and branch predictors
<gurki>
nah. fabbing at 5nm doesnt get you _that_much_
<dh`>
actually I bet someone still makes 486s for industrial/hardened apps
crabbedhaloablut has joined #riscv
<muurkha>
but I think that's sort of only useful if you have the real estate for all the functional units
<muurkha>
dh`: I think you can still get an 80186
<dh`>
could be
<gurki>
thats the thing. nowadays cpus are fast since they kinda are isa + big blob of stuff that actually makes it fast