<jrtc27>
H was held up because people felt it was dependent on AIA
<jrtc27>
or, really, they wanted to make sure that AIA wouldn't cause issues with a premature H ratification
<jrtc27>
but, that's silly, H is pretty simple
<xentrac>
I think probably someone is confusing the H extension with H mode
<xentrac>
again
<sorear>
let's not ascribe to malice that which can be explained by the universal slowness of bureaucracies
<jrtc27>
exactly
crabbedhaloablut has quit [Ping timeout: 244 seconds]
<jrtc27>
save the tin foil hats for another day
crabbedhaloablut has joined #riscv
<xentrac>
do we really want bureaucracies on the critical path for technical progress then?
<jrtc27>
yes, because the alternative is a clusterfuck of a mess where there's zero standardisation
<jrtc27>
and it's not really bureaucracy in this case, it's just that many involved are volunteers so progress is slow
<xentrac>
that sounds implausible
<jrtc27>
who decides what the standard extensions are if there's zero bureaucracy?
<xentrac>
(that the only alternative is zero standardization, I mean, not the volunteers)
<xentrac>
there have been a lot of alternatives tried for that in the past. some have worked better than others
<xentrac>
right now I'm reading through the PDF spec, which came out of Adobe's bureaucracy and then miraculously got pushed through as ISO 32000, and it's an amazing clusterfuck of a mess
<jrtc27>
that's the complete opposite
<xentrac>
did you know that in PDF strings, a line break represents 0x0a, regardless of whether the PDF file encodes it as 0x0a, 0x0d, or 0x0d 0x0a, while in annotation text, you have to separate paragraphs with 0x0d characters?
<jrtc27>
developed behind closed doors then shoved into the open and declared a standard
<xentrac>
yup. and it's the epitome of bureaucracy
<jrtc27>
your argument seems very confused
<jrtc27>
your example is entirely different to what you think the issue with riscv is
<jrtc27>
bureaucracy can encompass many things
<jrtc27>
and be involved in many different ways
<xentrac>
sorry, allow me to attempt to state my position more clearly; it's hardly surprising that the above is confusing!
<xentrac>
sorear ascribes the slow development of the H extension to the universal slowness of bureaucracies. plausibly this is correct. slow development of technical progress is a drawback, because it hurts everyone everywhere; they benefit only once the progress is actually realized and available. you argue that the bureaucracy involved in the slow development is worth the drawback, because "the alternative is
<xentrac>
a clusterfuck of a mess where there's zero standardisation". but in fact standardization without bureaucracy or development slowdowns or a clusterfuck of a mess is possible, and the available evidence suggests that adding more bureaucracy tends to increase clusterfucks of a mess, not decrease them, the PDF standard being today's exemplary shit sandwich.
<xentrac>
you offered a second hypothesis: progress is slow because many involved are volunteers. many certainly are, but it's not clear that this is a situation that tends to slow progress
<xentrac>
I mean, that's also the case with, for example, 3-D printers, and progress there is enormously more rapid than before the volunteers showed up
<xentrac>
from my point of view, the crucial question is whether the situation enables the volunteers or other workers to build on one another's progress, or instead causes them to get in one another's way
<sorear>
"progress" is a nasty term, can mean whatever people want it to
<xentrac>
there's a whole ideological question there, it's true
<sorear>
and there's nothing stopping you from building on others' work, the drafts exist today
<xentrac>
maybe not; certainly you can tape out silicon implementing the draft extension
<sorear>
if you want something that will be supported in X0 years with no further action on your part, then yes, you have a schelling point problem with a large number of stakeholders and it will take a while for that to reach equilibrium
<xentrac>
yup
pierce has joined #riscv
<sorear>
i could say some things about the real problem being trying to make an ISA be all things to all people
pierce has quit [Ping timeout: 256 seconds]
<xentrac>
it's a real problem, yeah
<jrtc27>
that's primarily because the embedded world is very different to everything else
<jrtc27>
I wouldn't be opposed to the idea of two privileged specs, one embedded and one application
<jrtc27>
but that ship has saild
<jrtc27>
*ed
<xentrac>
interesting, what's the relevant difference? paging?
<jrtc27>
(and that's kind of what the fast interrupt group is trying to do, but you just can't do it the way they're trying to do and have it fit properly into the existing privileged ISA)
<jrtc27>
paging is already optional
<xentrac>
it is, yes
<xentrac>
what's the friction with the existing privileged ISA and fast interrupts?
<jrtc27>
but the basic problem is that it's taking Arm-M's interrupt design and trying to bolt it onto RISC-V which looks more like Arm-A
<jrtc27>
my primary objection is the fact that there are cases where xEPC ends up holding *data pointers* not *code pointers*
<jrtc27>
which is complete nonsense
<jrtc27>
a data pointer is never a program counter
<jrtc27>
and thus should never be the exception program counter
<jrtc27>
but, really, I still don't get why they have all this special handling for resumable interrupt handling
<jrtc27>
because restarting entirely should function identically...
<jrtc27>
the only real justification is possible hardware simplicity due to being able to break things up slightly more
<jrtc27>
but... that's a poor one when it leads to such an insane architecture
<xentrac>
agreed, a data pointer is never a program counter
<jrtc27>
(specifically, if there is an access fault in loading the interrupt handler address from the handler table, both xepc and xtval get set to the address of the entry in the table...)
<jrtc27>
(which is nuts)
<xentrac>
shades of VAX
<jrtc27>
(previously it was just xepc, which I pointed out was a real WAT for software to handle because access fault handlers will look at xtval already, so their "solution" was to just write it to both rather than fix the insanity)
<xentrac>
krste is claiming "I don't believe in general we can avoid resuming table load (versus restart of original interrupted instruction)." but I don't understand his argumet
<jrtc27>
the arguments are (a) possible slight microarchitectural simplicity (b) edge-triggered interrupts, but if you lose them in that case *then your core is already broken because if you temporarily masked interrupts you would have lost that one*
<jrtc27>
(i.e. the interrupt controller MUST latch edge-triggered interrupts internally and not view them as completed until it has successfully vectored to the handler)
<jrtc27>
(which is a requirement *anyway* and renders (b) completely invalid an argument)
<xentrac>
hmm, you don't have to implement interrupt masking in such a way as to fail to latch edge-triggered interrupts
<jrtc27>
right, but the same thing that makes interrupt masking not lose interrupts is the same thing that makes trapping on loading from the handler table not lose interrupts
<xentrac>
but I guess you could implement it in such a way as to clear the latch too early in a case like this
<xentrac>
right
<jrtc27>
yeah
<jrtc27>
which is just crappy hardware
<jrtc27>
not a spec issue
<jrtc27>
well, it's a spec issue in that such hardware should be non-compliant
<xentrac>
it sounds like it but i'm not confident enough to be sure whether you or krste is right
<xentrac>
since you both know a lot more about hardware design than i do
<jrtc27>
it's quite possible there is another reason that he *hasn't* said
<jrtc27>
but what he *has* said does not change my mind
<jrtc27>
and I will never be convinced that a data pointer should end up in xepc
<jrtc27>
ever
<xentrac>
agreed :)
<jrtc27>
doesn't mean *my* proposal is necessarily the right one though
<jrtc27>
if there are things I'm missing
<xentrac>
it does make sense that fast interrupts would be critically important for real-time stuff in a way that they just aren't for a cellphone or laptop
<xentrac>
the worst-case latency vs. average-case throughput tradeoff pervads systems design
<xentrac>
*e
<sorear>
it's not a data pointer
<sorear>
it's an instruction in a special ISA mode where jumps have a 32-bit immediate
<sorear>
and it's not reasonable to roll back the interrupt after a subsequent instruction page fault because the instruction has already been vectored
Sos has quit [Quit: Leaving]
<xentrac>
haha
<xentrac>
that's a novel concept
<xentrac>
my ISA doesn't have variable-length instructions! it just has a bunch of mode switches that only ever stay on for a single instruction!
<sorear>
there's only one reasonable way to do this. are you going to create an entire separate redundant set of paths for loading a word from memory, trapping if it's not accessible, and using it to redirect fetch, or are you going to use the one that's already there?
<xentrac>
couldn't you just prohibit the OS from paging out the page tables its real-time interrupts need to find their interrupt handlers? maybe I'm misunderstanding the scenario
<xentrac>
I feel like a page fault in an interrupt handler in response to a time-critical IRQ line is going to be bad news no matter what
<sorear>
congratulations! you just introduced a hypervisor vulnerability
<xentrac>
hmm, so you're running your real-time tasks under a real-time OS that's running in U-mode under an S-mode hypervisor?
<sorear>
you need to not be able to break a hypervisor by doing weird things under it
<xentrac>
but the hypervisor is letting the RTOS code set up page tables that it will then use to run interrupt handlers? I suppose the hypervisor has to translate both the page tables and the interrupt vectors before handing them off to the hardware, right?
<sorear>
"why are you trying to run real-time tasks" is not a valid excuse for deadlocking or otherwise violating security properties
<xentrac>
oh, agreed!
<xentrac>
I mean I feel like if the hypervisor is letting the RTOS set up its own interrupt vectors that point to unmapped memory, the RTOS can probably get up to other mischief as well by pointing those interrupt vectors at random bits of hypervisor code
<sorear>
why is there hypervisor code in the guest address space?
<xentrac>
why would there be?
<sorear>
you just claimed there was
<xentrac>
I did?
<sorear>
> pointing those interrupt vectors at random bits of hypervisor code
<sorear>
you can't do that unless it's in the relevant address space
<xentrac>
presumably the interrupts in question aren't going to be handled in U-mode after context-switching to an RTOS task address space, right?
<sorear>
every part of this is wrong
<xentrac>
hardly surprising!
<xentrac>
what's the truth?
<sorear>
let's start from scratch
<xentrac>
okay!
<sorear>
you have a core which implements the vectoring modes described in the base spec. you want to remove the 21-bit restriction on handler addresses that emerges from the use of normal jumps. the easy way to do this is to add 1 flop to your fetch/decode unit "we're fetching a handler address", then when you fetch a 32-bit word with that bit set, treat the whole thing as a jump
<xentrac>
okay
<sorear>
there's a much harder way with no real advantages where when the interrupt arrives you stop executing instructions, inject a command into the data memory system to fetch a vector, then use a dedicated state machine to vector to that when it arrives... now why would you do this on cores intended to compete with M4
<xentrac>
we're talking about physical interrupts here, right? where some external hardware needs attention either very rarely or with very low latency?
<xentrac>
couldn't you point the normal jumps at a table of two-instruction trampolines somewhere in the low 2 MiB of RAM?
<xentrac>
I mean that costs you an extra pipeline flush and two more instructions; is that the cost we're trying to avoid?
<sorear>
i think so, but I don't want to get into an argument about whether saving 10 cycles on an interrupt in 2021 is actually useful
<xentrac>
it's definitely useful in some cases
ewdwasright has quit [Ping timeout: 265 seconds]
<xentrac>
when we're handling these interrupts in the hypervisor scenario, the hypervisor might not be running the RTOS at all, right? it might be running some best-effort task, like Linux or something
gector has joined #riscv
<xentrac>
so the current page table wouldn't even be the RTOS guest's page table (or any of them if it has more than one)?
<sorear>
no-one said anything about actually running this stuff under hypervisors productively
<xentrac>
well, if it's *unproductively*, then the hypervisor can emulate all the interrupt handling in software because it doesn't have to be fast, right?
<sorear>
the simplest possible thing to do is to treat it like any other instruction fetch, which is what the issue does
<sorear>
everything you've proposed complicates things
<xentrac>
you mean jrtc27?
<jrtc27>
if you're loading your handler address through your icache then you need to be veeeeeeeery careful about coherence
<sorear>
no, I mean you, you keep saying bizarre things about hypervisors
<jrtc27>
but yes I agree it makes sense to load it around the fetch unit not your load store unit
<jrtc27>
but that's a microarchitectural implementation detail
<jrtc27>
it shouldn't pollute the spec
<xentrac>
the only thing I proposed was that it isn't worth worrying too much about what happens when your fast interrupt handler needs to get paged in from disk
<xentrac>
I was trying to understand the concern you raised about hypervisor vulnerabilities
<xentrac>
i'm asking these questions because I've never built a system like this and so I think it's very likely that I'm missing the forest for the trees
<xentrac>
it sounds like you were talking about a case where the RTOS is running under a hypervisor, but not "productively", which I interpreted as either it doesn't have access to real hardware or it doesn't need to meet real-time deadlines; is that what you meant?
<sorear>
there's no RTOS here. there's a hypervisor, and someone trying to break out of the hypervisor by entering states that you say "isn't worth worrying about what happens"
<sorear>
anyway, we've already spent more time on this than CLIC will ever save
<xentrac>
well, but they're states that the hypervisor has to prevent the guest from entering anyway, aren't they?
<sorear>
I'm not interested in continuing this.
<sorear>
&
<xentrac>
okay. well, thank you for your explanation so far!
<xentrac>
sorry I was so dense that I couldn't understand what you were trying to explain to me :(
<sorear>
not a fault situation, I'm not upset and am not looking for an apology
<xentrac>
okay!
<sorear>
i'm just saying (1) it's important to specify _some_ behavior or set of behaviors in every possible situation so that security can be exhaustively analyzed (2) there's an obvious way to implement the functionality (3) the obvious implementation does a specific thing in off-nominal cases, which is good enough to specify
<sorear>
if that helps, great, if not, can try again tomorrow, I'm exhausted for now
<sorear>
jrtc27: i'm probably going to get around to auditing this approximately never but there does need to be a fence.i requirement yes
<jrtc27>
which is.. awful :P
<sorear>
vector table is going to be in ROM in relevant cases
<jrtc27>
(your proposal of specifying it as a separate execution mode where every instruction is XLEN bytes and is a jump to that absolute address is an... interesting way of making everything "fit")
<xentrac>
yeah, I agree with (1) and I think your argument for (2) is plausible
<xentrac>
not sure about (3)
<sorear>
sorry for my impatience
* jrtc27
has had fun discovering a hole in our C++ spatial safety implementation :)
<dh`>
is it really faster to make a special-case table fetch like this vs. doing it as the first instruction of the trap handler?
<jrtc27>
std::make_shared allocates the control block inline with the data being pointed to
<dh`>
granted the latter requires jiggering all the bits so you can do it that way, which riscv doesn't have but mips did at one point
<sorear>
i just don't think CLIC is especially useful, it only saves a handful of cycles over the baseline arch, it's a far less complete solution for "write executives in C" than arm-M has, and if you really cared about cycle-precise event handling in 2021 you'd be using some kind of efpga, which wasn't an option back when cortex-M was new
<jrtc27>
yet it's implemented in the SiFive E whatever core
<sorear>
yes, because they need to make a table of arm features and sifive features, doesn't matter if it's actually equivalently useful
<jrtc27>
I agree it's really not very useful, but it exists, so I want to at least try and make it not awful
<jrtc27>
not that I'll ever have to care about it
<jrtc27>
CHERI-RISC-V can just stick with a sane CLINT
<jrtc27>
and whatever AIA eventually ends up being
<sorear>
anyway tell me more about shared_ptr
<jrtc27>
and maybe by the time that ratification happens people will be starting to take CHERI seriously within RISC-V
<jrtc27>
oh
<jrtc27>
shared_ptr itself is fine
<jrtc27>
but it has two jobs: store the pointer and track ref counts, both shared (strong) and weak
<jrtc27>
if you do shared_ptr(new Foo) then you get a control block and a pointer to your data
<jrtc27>
if you do make_shared<Foo>() then the control block and Foo are combined into one allocation
<jrtc27>
so oops the bounds of the capability include the control block
<sorear>
is that a problem? is shared_ptr enforcing any kind of access boundary? people could implement their own shared_ptr without your mitigation
<jrtc27>
it means you can corrupt ref counts
<jrtc27>
and sure, they can, but they can pick up the pieces if they do that
<jrtc27>
same as implementing their own custom allocator in generla
<sorear>
corrupting ref counts sounds like a temporal safety problem
gector has quit [Ping timeout: 252 seconds]
<jrtc27>
yeah, my guess is this one is *probably* fine in practice if you've already got our heap temporal safety turned on
<jrtc27>
either you bump the counts up too high and cause resource leaks
<jrtc27>
or you decrease them and cause early free, which will either be safe due to quarantine or, post-revocation, deterministically trap
<jrtc27>
which isn't great, but it's fail-safe, so long as DoS isn't a concern
<jrtc27>
and, well, fail-stop is the way we roll
<jrtc27>
and the only thing you really can do at that point
<sorear>
if you're relying on caps for something, and you don't have temporal safety, I think you've already lost because you can hold on to the "payload" cap after the ref count hits zero
<jrtc27>
agreed
<jrtc27>
well, depends on your threat model
<xentrac>
sorear: no apology necessary, you have no obligation to explain yourself to me. i'm very interested in the topic of how to avoid having hypervisor vulnerabilities, and if you have insights I can learn from, I'd be delighted, but you don't owe me any of your time
<jrtc27>
but it's certainly important for stopping a lot of memory safety vulnerabilities from being exploitable
<jrtc27>
anyway, I hear birds chirping, must mean it's time for me to sleep
<dh`>
here at this time of year they start up at like 3am, it's crazy
<dh`>
not that this invalidates the conclusion
<xentrac>
not sure about the efpga thing. the cheapest microcontroller is 3¢, the cheapest 32-bit microcontroller is something like 140¢, and the cheapest FPGA is more like 180¢
<xentrac>
interestingly the 12¢ version of the 3¢ microcontroller has a totally different answer to interrupt latency problems
<sorear>
mouser has the ice40lp384 for 120¢ @ 1000 and that's a couple times bigger than what I have in mind
<xentrac>
oh cool, that's cheaper than digi-key (@100)
<jrtc27>
(also, it's interesting to note that this "optimisation" does have noticeable side-effects: weak_ptr's have to keep the control block alive after all shared_ptr's are gone, which means shared_ptr(new Foo) can delete the pointer and thus free the memory for Foo, but make_shared() can't free the storage without also freeing the control block, so can only run the destructor for Foo, keeping the memory still around until all weak references disappear)
<xentrac>
although maybe that's because they don't have the lp384 in stock, just the ul640
<xentrac>
does mouser also have cheaper stm32 clones than that?
<sorear>
but at this level you're mostly paying for the package, not the circuit
<sorear>
didn't look
<xentrac>
mostly but an avr is still 40¢
<xentrac>
the 12¢ padauk chips use round-robin hardware multithreading with the idea that you can dedicate one of the threads to busy-waiting on your I/O when necessary, so you have worst-case response of 125ns with a 16MHz clock
<xentrac>
similar to the propeller or the ga144
<xentrac>
hard to beat that with interrupt response. but it's also hard to represent as a checkmark in an arm vs. brand-x comparison table
<sorear>
i feel like that approach makes more sense than trying to do anything with multi-level interrupts
<xentrac>
me too
<xentrac>
jrtc27: it's dismaying that weak references can retain the control block indefinitely, particularly if it's part of the same allocation that contains Foo, which could thus be arbitrarily large
<xentrac>
it's not really a new approach i guess, it's how the cdc 6600 did i/o too
<xentrac>
with the costs I was just thinking that maybe risc-v microcontrollers will continue to be cheaper than fpgas for a significant amount of time
gector has joined #riscv
<xentrac>
(or start to be, I guess; not sure how much a GD32VF103 is but it's probably not <120¢)
davidlt has joined #riscv
JSharp is now known as JSharp_
JSharp_ is now known as JSharp__
JSharp__ is now known as jaesharp
jaesharp is now known as jaesharp_
jaesharp_ is now known as jaesharp__
gector has quit [Ping timeout: 252 seconds]
jaesharp__ is now known as JSharp
JSharp is now known as Rachel
Rachel is now known as Rachel_
Rachel_ is now known as Rachel__
Rachel__ is now known as JSharp
frost has joined #riscv
davidlt has quit [Ping timeout: 268 seconds]
<GreaseMonkey>
hmm, is there an SVD for the FU740?
<xentrac>
solrize: this is a good point. also btw the RP2040 has a totally different way to do low-latency response things called "pioasm" that doesn't require fast interrupt response and is also cheaper than an FPGA, called . or would be if you could buy it separately (is there anything similar out there as a standalone chip)?
<xentrac>
I haven't actually tried pioasm so I don't know if it's as nifty as it sounds
mahmutov has joined #riscv
<xentrac>
ugh, editing fail
<xentrac>
two PIO blocks each containing four state machines, each with four 32-bit registers, that share a 32-instruction program memory; each state machine is connected to the AHB-Lite bus through a 4-deep FIFO each way, and then the state machines can twiddle and read I/O pins
<xentrac>
it looks like if you wanted to program a quadrature encoder or a PDM generator or SDIO or something it would be a lot easier to do with this pioasm coprocessor than with a CPLD and maybe even easier than with a general-purpose computer
choozy has joined #riscv
wingsorc has quit [Quit: Leaving]
chronon has joined #riscv
<xentrac>
(they can also tug on the main processor's IRQs, of course)
<xentrac>
(hmm, PDM might be beyond its capacity...)
Sos has quit [Ping timeout: 250 seconds]
Sos has joined #riscv
<solrize>
if you look at an rp2040 die shot, it is almost all ram arrays, so no reason they couldn't have made the pio's much more powerful or added realtime risc-v cores or whatever. maybe some future chip will do that. the beaglebone has realtime coprocessors (PRU's) for stuff like that. There are two of them each with 32-bit registers and 8k of ram partly shared with the host cpu
<solrize>
i think the chip is also in the $1 range
<xentrac>
sweet :
<xentrac>
:)
<xentrac>
I think the only ALUish operations in pioasm are bit shifts, decrements, and zero tests, but I'm not done with the datasheet yet
<xentrac>
so I don't think you can do, say, a digital differential analyzer in it
<xentrac>
but you can drive a lot of common wire protocols in programs that are only two to six instructions long, at the chip's full clock speed and in lockstep between the state machines
<xentrac>
oh I guess there's an equality test too
<xentrac>
I think this is the first ISA I've seen designed after 01980 with an XEQ instruction (but called OUT EXEC)
<jrtc27>
"Besides making people’s eyes bulge"
<jrtc27>
heh, indeed
<jrtc27>
out exec and mov exec are not normal things...
<xentrac>
haha
<xentrac>
I guess nowadays single-stepping through ROM for debugging is usually handled with an 8086-style trace flag or something?
<xentrac>
I think that was one of the original uses for XEQ
<jrtc27>
yes, if an ISA is to be taken seriously it should have hardware single-step and breakpoint functionality
<jrtc27>
RISC-V only partially fulfils that
<jrtc27>
in that its hardware debug support only exists for bare-metal debugging, there's nothing yet that operating systems can use
<xentrac>
that's gonna be a problem if you're running a user task from ROM, which I guess is a thing you might want to do
<xentrac>
these days RAM is faster, right? so if you have enough RAM to copy the code into, it won't make the user task run *slower*, and then you can single-step and breakpoint by dropping little turds into its instruction stream and fence.i'ing them
<dh`>
yes but execute-in-place is desirable
<xentrac>
yeah, I'm just saying, it sounds like a pain in the ass to implement in software, but it sounds like it's at least not impossible?
<jrtc27>
you can do it in software with a tiny little buffer
<jrtc27>
you just have to be careful about things that read pc (ie jalr and auipc, plus potentially exceptions)
<xentrac>
and you also have to fence.i your buffer updates, right?
<xentrac>
I guess all those fence.is could get pretty expensive if you're single-stepping under program control instead of interactively
<jrtc27>
fence.i is going to be lost in the noise
<xentrac>
cool, I was thinking of the ridiculous cost of full cache flushes on old MIPS
<xentrac>
naturally enough none of the risc-v specs tell you how fast things are
<xentrac>
so I thought modern RISC-V implementations might run into the same kind of swamp with heavily self-modifying code like that
<xentrac>
you extend the tiny-little-buffer/be-careful approach a little further and before you know it you're writing qemu
<dh`>
well
<jrtc27>
depends on the core...
<dh`>
single-stepping one instruction at a time so the debugger can step one line
<dh`>
is usually pretty slow
<dh`>
but mostly you don't notice
<xentrac>
yeah, the case where I noticed that kind of thing recently was when I tried GDB's reverse-debugging via record-and-replay thing a couple of months ago
<xentrac>
in theory it's magic: you run the program until you reproduce the bug, locate the problem in memory, set a watchpoint, and execute backwards to see how that memory got that way
<xentrac>
and, yeah, instruction-wise record-and-replay via ptrace isn't ever gonna be fast, right? you pay a few hundred nanoseconds in context switch overhead for every instruction
<xentrac>
so maybe your 15-millisecond program will take 15 seconds to run. that's totally fine!
<xentrac>
but GDB never ceases to find new ways to disappoint me, because actually it took 20 minutes, which is not fine at all
<dh`>
heh
<xentrac>
also, 4 gigs of RAM, which is okay
<xentrac>
but wouldn't always be
<xentrac>
If I had to prioritize, hardware watchpoint support seems a lot more important than hardware breakpoint support
<xentrac>
(when your memory is virtual, anyway. if you have to XIP then not having hardware breakpoints will make baby xentrac cry)
<xentrac>
just because software watchpoint support usually has the same kind of slowdown as record-and-replay
<dh`>
right
<jrtc27>
rr gets iffy for lr/sc
<jrtc27>
or any kind of interesting concurrency tbh
<dh`>
I was talking to someone about that a while back and iirc I brought their concerns here and nobody was very interested :-)
<solrize>
gdb supports hw watchpoints but i didn't know it had reverse execution at all
<dh`>
my recollection is that I tried it and found it didn't actually work
<dh`>
but that was some time back
<jrtc27>
it's easier on x86 where you don't have lr/sc and have a relatively strong memory model
<jrtc27>
or, perhaps, not easier, but you can get away with things more as you're less likely to hit issues
<jrtc27>
good luck doing rr on a concurrent process on alpha
<xentrac>
just to be clear, I wasn't using rr-project, which has a GDB interface and reportedly works a lot better; I was using GDB's internal record-and-replay functionality
<xentrac>
not sure whether you meant rr-project by "rr", or the generic functionality it provides
<xentrac>
I agree that concurrency is a huge problem, even without lr/sc, and I think that's where most of the effort goes
<xentrac>
dh`: I had to fight with it a lot to get it to work
choozy has quit [Remote host closed the connection]
<dh`>
in a sense one of the reasons record/replay is interesting is specifically to capture particular concurrente xecutions
<solrize>
as this is #riscv i wonder if it's possible to make a special riscv cpu in an fpga, that does the trace recording automatically, making an undo record for each instruction as it runs
mahmutov has quit [Ping timeout: 258 seconds]
<dh`>
probably
<xentrac>
then set args; start; set record full insn-number-max 2000000; record; c
<dh`>
the question is where you stream the undo log to
<solrize>
dram
<xentrac>
ideally, a SAN
<solrize>
or if it's only 200000 and it's a big fpga then maybe on chip ram blocks
<xentrac>
(perhaps obviously the above gdb line was on amd64)
<xentrac>
IIRC GDB also doesn't support streaming the undo log out to disk
<jrtc27>
we don't have it on our riscv cores currently, but our mips core could log a trace of every instruction to a circular buffer that you could trigger on various conditions and then dump out over jtag
<xentrac>
nice! including data read and written, or just the instructions?
<solrize>
yeah that would work, especially if the trace output has enough data to reverse the insn
<jrtc27>
yes, it'd include register and memory writes
<xentrac>
fabulous
<xentrac>
one of gdb's record backends is an Intel feature that records only branches
<jrtc27>
more used to find hardware issues than software issues though :D
<xentrac>
I guess if you were designing a CPU to maximize replayability (rather than, say, efficiency) you'd default to XCHG rather than MOV
<xentrac>
so the occasional OVERWRITE or MUL instruction would be logged, but the XCHGs and ADDs and SUBs wouldn't
<xentrac>
unless you were trying to find hardware problems, of course
valentin has quit [Remote host closed the connection]
Andre_H has quit [Ping timeout: 250 seconds]
<xentrac>
heh, there's a section in the datasheet about how to perform an addition with pioasm
<xentrac>
of the rp2040
<xentrac>
"A full 32-bit addition takes only around one minute at 125 MHz. The program pulls two numbers from the TX FIFO and pushes their sum to the RX FIFO, which is perfect for use either with the system DMA, or directly by the processor."
<xentrac>
and yet you can do PWM in 7 instructions, I²C in 19 instructions (with DMA, and even clock stretching!), and the WS2812 protocol in 4 instructions
<xentrac>
and apparently there's an example in the SDK book that uses pioasm to get a 125Msps logic analyzer, piping the data into RAM via DMA
<dh`>
heh, only one minute
<dh`>
that also sounds like it'd be a fun widget to muck about with formal verification for