<cr1901>
And then spent a large amount of time debugging "huh, writes don't work... why?"
<cr1901>
You can reasonably argue, "yes I should know better and I should've read the docs". But watching Number Go Down is fun :).
<whitequark[cis]>
ah, the C approach to optimization
<whitequark[cis]>
"who cares if it's correct so long as it's fast"
<whitequark[cis]>
sorry, I should mean "hacker news post about C" approach to optimization
<whitequark[cis]>
I have no idea why the hell alignment=2 helps but I have a suspicion
<whitequark[cis]>
it adds registers on the datapath and maybe that breaks up comb paths that would otherwise synthesize into too many LUTs because abc does not care about area very much
<whitequark[cis]>
can i clone this somehow?
<cr1901>
sure... git clone doesn't work?
<cr1901>
I feel like I misunderstand the question
frgo has quit [Ping timeout: 256 seconds]
<whitequark[cis]>
I don't know where to clone from
<cr1901>
Do you mind if I go on a related tangent while it's on my mind?
frgo has joined #amaranth-lang
<whitequark[cis]>
sure!
<cr1901>
Regarding -dff, I now believe abc9 -dff was in fact, correct to optimize the design down to nothing, and abc9 without -dff couldn't prove it.
<cr1901>
If csr.Decoder's addr_width is too big when attached to a wishbone bus, it'll generate addresses to the subordinate bus that are impossible for downstream peripherals to match.
<whitequark[cis]>
ahhh
<cr1901>
I didn't save the traces I saw of this behavior yesterday
<cr1901>
I'll see if I can dup
<cr1901>
Anyways, if no peripherals can be read/written, abc9 decides "well, all this logic is useless", and optimizes the design away
<cr1901>
I just dup'd. If you wish to try this yourself, change line 380 of attosoc.py _after applying my patch_ to "periph_bus = csr.Decoder(addr_width=25,"
<cr1901>
(Half of me typing this out is for future-me anyway :P)
<cr1901>
> for a single write to a single 8-bit register to succeed you would have to write to these two addresess <-- I am not accounting for this
<cr1901>
oooooh wait
frgo has quit [Ping timeout: 260 seconds]
<cr1901>
disregard please
frgo has joined #amaranth-lang
<cr1901>
whitequark[cis]: I took one of my custom tests that tests the SoCs, and I rewrote it to play around and look at waveforms. This, along w your help (tyvm!), will let me iterate quickly on good, bad, and ugly alignments
<cr1901>
Probably what I should've done hours ago...
<cr1901>
So I don't have to remember those pesky opcodes :P
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 276 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
<cr1901>
Also, just lost 20 minutes because I was doing a "sw" on an unaligned boundary and forgot "oh wait, my core doesn't implement unaligned stores" lmfao
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 276 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 264 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 246 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 260 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 260 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 260 seconds]
Degi_ has joined #amaranth-lang
Degi has quit [Ping timeout: 252 seconds]
Degi_ is now known as Degi
<cr1901>
jfng: Now that I've figured out what I've been doing wrong, I've tried: without using alignment tricks, it's not possible to fit an amaranth-soc into icestick/1280 LUTs using the CSR bus. I'm hovering in the low 1320s, for any combination of address decoding that doesn't optimize out parts of the design. I guess the atomic stuff is just too much overhead.
<whitequark[cis]>
you are only using csr.Elements that are 8 bit wide, is that correct?
<cr1901>
yes
<whitequark[cis]>
then the atomic stuff never comes into play.
<whitequark[cis]>
why do you think it adds too much overhead? have you verified that the overhead comes from it?
<whitequark[cis]>
it is the increase of alignment that starts to make use of the atomic writes, so what you're describing doesn't make a lot of sense in principle
<whitequark[cis]>
it sounds like there is some other change that is happening to your design that saves so many LUTs that it works even despite the overhead of the atomic transactions
<cr1901>
Well I took out all alignment in my code except alignment=25 in wishbone.Decoder
<cr1901>
>why do you think it adds too much overhead? Because I have a wishbone version of the SoC that fits, and the only thing I changed was adding a Wishbone2CSR bus
<whitequark[cis]>
there is also the overhead of CSR decoding
<cr1901>
I was under the assumption that the overhead from using Wishbone2CSR (extra decoding) would be offset by WB xactions being more complex
<whitequark[cis]>
more complex?
<cr1901>
At least, CSR doesn't need to ack the CPU
<cr1901>
doesn't need to check for CYC & STB & ~WE & SEL
<whitequark[cis]>
that check is uh, one LUT
<whitequark[cis]>
so if you have more than one LUT worth of decoding it's net negative
<whitequark[cis]>
the main advantage of the CSR bus is that it's narrower and therefore the data path has fewer ... everything on it
<whitequark[cis]>
(in terms of size, that is)
<cr1901>
I guess my design is such that I can't overcome the initial overhead to start seeing space savings
<whitequark[cis]>
yes, exactly
<whitequark[cis]>
it also sounds like there's something else going on
<whitequark[cis]>
but it's hard to see without digging into your design
<cr1901>
My WB version has registers per SEL line instead of at incrementing addresses
<cr1901>
so you can do "store-word" and it would write/read all registers at once
<cr1901>
(so "don't do that :P")
<cr1901>
And I finally made the design consistently fit by increasing the amount of memory chunk allocated to each peripheral (to reduce the amount of decoding logic)
<cr1901>
I assumed that I could effectively use the same trick for csr.Decoder, but I think the synthesizer either has less opportunity to optimize OR saw decode optimizations without me having to "guide" it
<cr1901>
whitequark[cis]: https://github.com/cr1901/sentinel/blob/refactor/examples/attosoc.py This is the version of the SoC that works. I did want to test jfng's PR; I think the idea of a full amaranth-soc fitting onto icestick using proper design principles is really cool lol
<cr1901>
But maybe a boneless SoC can fit that niche
<whitequark[cis]>
and boneless isn't even very small yet
phire has quit [Quit: No Ping reply in 180 seconds.]
phire has joined #amaranth-lang
<cr1901>
I'm gonna go back to alignment tricks and see what happens. Definitely weird as hell that those work at all tho
<whitequark[cis]>
have you looked at what consumes your area?
<cr1901>
No, and tho I wrote a script a while back to iterate what module consumes what resources, I didn't find it reliable and took it out of the repo
<whitequark[cis]>
I don't think that blindly trying things at random and checking if anything changes is the approach to take here
<whitequark[cis]>
(funnily, someone made a machine learning script to find out which vivado options save the most area, and after many years of compilation it found drumroll the "optimize for area" flag and not much else)
<whitequark[cis]>
(but the authors got a paper out of it so hey)
<cr1901>
Haaaaaa...
<Darius>
whitequark[cis]: still useful to test the null hypothesis..
<whitequark[cis]>
Darius: not if it changes behavior to something non-equivalent
<Darius>
that's why you have bench tests right? :D
<cr1901>
> I don't think that blindly trying things at random and checking if anything changes is the approach to take here. Fiiiine, I did some measurements; the CSR version performs worse on all relevant modules except the decoder, which is 3 measly LUTs bigger in the wb-only version
<whitequark[cis]>
what makes them bigger?
<cr1901>
I don't know rn; these modules are small enough I could do yosys show commands to look them over. Or run synth_ice40 halfway, write out the verilog for both versions, and diff the two
<cr1901>
(I really need to spin out the scripts I have for this into a CLI)
<whitequark[cis]>
they should
ravenslofty[m] has joined #amaranth-lang
<ravenslofty[m]>
(do Discord replies go through the bridge to Matrix or wherever?)
<whitequark[cis]>
try replying to this?
<ravenslofty[m]>
test
<whitequark[cis]>
they do not
<cr1901>
I got the test reply
<whitequark[cis]>
i thought they did, huh
<tpw_rules>
i did too
<whitequark[cis]>
oh, i thought ravenslofty was asking whether they're formatted as replies or just appear standalone
<whitequark[cis]>
(they go through but do not appear as replies)
<ravenslofty[m]>
I was going to reply to "(but the authors got a paper out of it so hey)" but then wondered if that context would be preserved through the bridge or if it would appear out-of-context
<ravenslofty[m]>
anyway: I am aware of a certain company which proudly admits they have gotten Yosys to be competitive with commercial tooling in area
<ravenslofty[m]>
and their secret ingredient is looping abc calls
<whitequark[cis]>
right
<whitequark[cis]>
is it antmicro
<ravenslofty[m]>
it is not antmicro
<ravenslofty[m]>
I have respect for antmicro
<ravenslofty[m]>
I do not have respect for this company
<whitequark[cis]>
i see
<ravenslofty[m]>
not least because I used to work for its CTO, who traumatised me enough that the subject of technology mapping is something I do not feel comfortable working on still >.>
<cr1901>
whitequark[cis]: You have made your point re: measuring... I blackboxed top.leds and then top.leds.mux. Compared to the "good" WB version of the leds peripheral, I save a whooping... 0 FFs and 9 LUTs
<cr1901>
The mux brings in 20 FFs and 24 LUTs
<whitequark[cis]>
the mux being?
<whitequark[cis]>
CSRMultiplexer?
<cr1901>
yes
<whitequark[cis]>
I wonder why it adds flops. is that with alignment=1?
<ravenslofty[m]>
(though on the subject of technology mapping, ABC9-by-default is officially in Yosys 0.36, by virtue of nobody filing any bugs)
<cr1901>
well, the default alignment is 0, but yes. self.mux = csr.bus.Multiplexer(addr_width=2, data_width=8, name="gpio")
<whitequark[cis]>
it's because CSR reads are registered
<whitequark[cis]>
the CSR bus captures the value on the same cycle the read is issued, and gives it back the next cycle
<ravenslofty[m]>
catherine: if it's a hint, this company also said they would leverage the power of chatgpt for hdl.
<cr1901>
ahhh, I rememeber reading that re: registered
<whitequark[cis]>
I only know of some researchers using chatgpt for hdl
<cr1901>
The timer peripheral is the same size in both versions. My guess is that b/c it's only a single read-only register, the CSR version becomes morally equivalent to the wishbone version.
<tpw_rules>
whitequark[cis]: is that guaranteed for all peripherals?
<whitequark[cis]>
I think so (but you should read the docs)
<tpw_rules>
i've been mulling the design of a simple axi to csr converter and that's an important thing to know (i know how the timings are but it's good to know where the register is)
<whitequark[cis]>
oh, you mean the registered nature of reads?
<whitequark[cis]>
yes, that is a property of the CSR bus
<tpw_rules>
no, i mean the fact that the read data connects directly to the register
<cr1901>
I think my question for jfng on Friday will be "will it be possible to use a subset of the CSR register API for devices on a wishbone bus"?
<cr1901>
ALso I lied about registers sharing SEL lines. Looks like I was smart enough to move each register to it's own 32-bit chunk (using SEL[0} to select)
<whitequark[cis]>
the CSR bus consciously makes a design choice for predictable latency and atomicity at the cost of some resource consumption
<whitequark[cis]>
I think there's nothing stopping you from writing your own version of the CSR multiplexer that just shoves CSR elements on the Wishbone bus
<cr1901>
I thought about that; I feel like the CSR register API could nominally be bus neutral. It might confuse ppl who look at my code and notice that I gave up the atomicity requirement ("so it's not CSR then, is it?")
<whitequark[cis]>
I mean, if it's not atomic it's not CSR
<whitequark[cis]>
the reason CSR bakes in atomicity guarantees is that torn accesses should not be possible, ever
<whitequark[cis]>
if you somehow end up with torn accesses, for any reason, you're basically fucked. in a lot of the cases you need to throw out your ASIC and respin it
<whitequark[cis]>
in some others there are awful and unreliable software workarounds, but this is how you end up with many pages of errata
<whitequark[cis]>
I think we might allow, via an option that enforces the precondition, the use of CSRMultiplexer with no shadow registers at all when you don't have any elements bigger than the bus, at the cost of slightly changed access latency
<cr1901>
You should see the recommended software loop for reading the machine time in riscv
<cr1901>
(yes it allows torn reads)
<whitequark[cis]>
I have
<whitequark[cis]>
this is exactly why the Amaranth CSR bus has such a strong stance
<tpw_rules>
cr1901: link?
<cr1901>
tpw_rules: Gimme a sec, muxing _badly_ here
<cr1901>
> via an option that enforces the precondition <-- can you elaborate? 1/2
<cr1901>
>I think we might allow <-- I would be very interested in this. Failing that, I would also be interested in a register API subset for wishbone. The stuff that csr.Multiplexer automates for memory maps is valuable.
<whitequark[cis]>
I highly doubt a register API subset for Wishbone would be merged, and I personally would vote -1 on it (though it's JF's decision)
<whitequark[cis]>
we will (provided metadata RFC gets merged) expose enough hooks for you to eventually be able to talk to the BSP generator yourself and get the same results as with the CSR API
<whitequark[cis]>
other than that, we should focus on improving the CSR stuff rather than just giving up and adding yet another API and creating confusion
<cr1901>
Well, right now, I need to put the CSR stuff aside. I want the demo to fit into icestick, and right now I am unprepared to support two versions of the SoC (one wishbone, one CSR). If there's an option to disable shadow registers, _and/or I can find high-hanging fruit for size optimizations_, it'll be nice to revisit it.
<whitequark[cis]>
tbh I think the icestick is pretty useless
<cr1901>
That's your prerogative
<whitequark[cis]>
(which influences what I think resources should be spent on support of)
<whitequark[cis]>
it's not even cheap! it's a $50 devboard with one of the smallest production FPGAs and barely any IO. I think it doesn't even have a button?
<cr1901>
Was it that expensive? I've had mine since... 2016?
<cr1901>
I forget
<whitequark[cis]>
I think it was half as expensive before
<whitequark[cis]>
it's still really small. in software terms, it's a bit like how some people try to squeeze Linux into half a megabyte of RAM
<cr1901>
So, a much more useful small fpga application would be machxo2 1200HC. Same number of LUTs as ice40hx1k, but less constraints on sharing LUTs and FFs. I suspect I can get 100 more SLICEs out of that, not to mention that there's hard IP for SPI/I2C and a timer over an 8-bit wishbone bus
<whitequark[cis]>
could Linux do better on very low end of memory consumption? yes, pretty sure (though I personally found it difficult to cut it down below half a meg of code size, which seems excessive but I cannot prove that it is). should it be? honestly no, use any OS kernel with less ambition for that
<whitequark[cis]>
(if you really have to use an OS kernel that is--I don't think anyone ever anticipated running a 32-bit core on one of these FPGAs)
<cr1901>
lmao, I don't implement enough of the extensions to run a kernel. It's purely a microcontroller core
<cr1901>
ice40hx1k is the Lowest Common Denominator of FPGA I could feasibly target, so I wanted to make a valiant effort, knowing that any other target will be easier. I made the RV core that was fun for me to make.
<whitequark[cis]>
I was thinking of noMMU Linux
<whitequark[cis]>
(I evaluated what it would take to run noMMU Linux on Minerva, which is a similar sort of goal as yours)
<whitequark[cis]>
(though I wanted to run it on CXXRTL, which still benefits from simplicity of the logic, of course)
<cr1901>
Someone did that w/ AVR emulating an ARM, and driving a DRAM from the GPIO
<whitequark[cis]>
I think they used SIMMs
<whitequark[cis]>
oh that's still DRAM
<cr1901>
yea, simpler async DRAM, but DRAM. But you know the video I mean
<cr1901>
I think noMMU Linux would be cool, but if you try doing it w/ my core, I would have several questions
<whitequark[cis]>
I figured out that you can't reasonably build Linux for 32-bit RISCV
<whitequark[cis]>
which is ... interesting
<whitequark[cis]>
I certainly haven't expected that
<cr1901>
gatecat did it once, but Idk what bullshit she went thru to do it (nextpnr was running on RV32 synthesizing itself)
<whitequark[cis]>
they only target riscv64gc, and while you can cut down on the gc, it's much harder to cut down on the 32
<whitequark[cis]>
iirc
<cr1901>
what's the minimum? RV32IA?
<cr1901>
err 64*,
<whitequark[cis]>
officially RV64GC, but in practice I was able to build it for RV32IA, yes
<whitequark[cis]>
and of course the atomics could be no-op'd
<whitequark[cis]>
s/RV32IA/RV64IA/
<cr1901>
Yea, and do that kernel-helper thing like on ARMv4
<whitequark[cis]>
you don't need that if you have one core
<whitequark[cis]>
atomics become just normal loads/stores
<whitequark[cis]>
oh, for userland? I dunno just run sed before gas or something
<whitequark[cis]>
I think gcc might have support for that already
<cr1901>
that's cursed. And yea I meant for userland; there's a kernel helper that detects if you get preempted in a critical section. Only works for one core, but avoids int disabling
<whitequark[cis]>
well
<whitequark[cis]>
any aligned load is atomic on a single-core CPU, no?
<whitequark[cis]>
oh wait, riscv has ll/sc
<cr1901>
yes (I would hope any aligned load <= word len is atomic period)
<whitequark[cis]>
yes, I wasn't thinking
<whitequark[cis]>
yeah ok I have completely neglected to think of what happens if the userland wants an atomic rmw, oops
<whitequark[cis]>
probably because I never got around to building the userland
<cr1901>
For linux, you would port the kernel helper (if someone hadn't done so); for bare-metal, you disable interrupts for single core. For multicore, "do whatever RP2040 does in hardware :P?"
<cr1901>
(Idk what RP2040 does... is it basically a lock-free queue in h/w?)
<whitequark[cis]>
oooh, just got record/replay to work with CXXRTL
<whitequark[cis]>
I saved an incremental replay log of 10000 cycles of SoC simulation, then replayed it, and started toggling the clock
<whitequark[cis]>
the simulation has continued as if from 10001'th cycle
Wanda[cis] has joined #amaranth-lang
<Wanda[cis]>
<whitequark[cis]> "oooh, just got record/replay..." <- ha, good job!
<hexastorm>
I have a simple question to which I could not find the answer;
<hexastorm>
if I want to multiply i signal value with a fixed constants, e.g. divide by 180, I can achieve this by adding the following sequence; (value >> 8) + (value >> 10) + (value >> 11) + ...
<hexastorm>
A problem I have is that I now have to manually write this out and figure out the correct bitshifts. Is there a faster way?
<hexastorm>
typo sequence is value >> 8 + value >> 10 + value >> 11 = 180_000 >> 8 + 180_000 >> 10 + 180_000 >> 11 = 965 which is approx 180_000 / 180
<whitequark[cis]>
can you write this in pure Python?
<zyp[m]>
this is effectively just a fixedpoint multiplication with 0b0.0000_0001_011
<hexastorm>
in code it looks something like this m.d.sync += countsperdegree.eq((hallcntr >> 8) + (hallcntr >> 10))
<hexastorm>
or could I also write m.d.sync += countsperdegree.eq(hallcntr / 180)
<whitequark[cis]>
/ is unlikely to synthesize very well
<whitequark[cis]>
it's just present for simulation
<hexastorm>
okay, so each time I have to figure out the correct bithshifs?
<cr1901>
I probably conflated synth runs w/ one another yesterday night and convinced myself that one of the < 1280 LCs was actually working (when I for instance, loaded the wrong bitstream)
<cr1901>
(alignment=2 _does_ save a LUT or 2, but not worth the other problems it has)
<cr1901>
the other problems it has in the context of how I was using it with sparse=True*
frgo has joined #amaranth-lang
frgo has quit [Remote host closed the connection]
key2 has joined #amaranth-lang
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 245 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 276 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 276 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 252 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
hexastorm has quit [Quit: Client closed]
alexb-kt has left #amaranth-lang [#amaranth-lang]
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 240 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 246 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 252 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 264 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 260 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 246 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo_ has joined #amaranth-lang
frgo has quit [Read error: Connection reset by peer]
frgo_ has quit [Ping timeout: 256 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 252 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 268 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 276 seconds]
frgo has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
feldim2425 has quit [Ping timeout: 255 seconds]
feldim2425 has joined #amaranth-lang
frgo has quit [Ping timeout: 255 seconds]
frgo has joined #amaranth-lang
frgo has quit [Read error: Connection reset by peer]
<_whitenotifier-3>
[amaranth-lang/amaranth-lang.github.io] github-merge-queue[bot] 52e7faa - Deploying to main from @ amaranth-lang/amaranth@7db049f37f675758c08528277c5222cadb7ba9a9 🚀
<_whitenotifier-3>
[amaranth-lang/amaranth-lang.github.io] github-merge-queue[bot] d5bcfb6 - Deploying to main from @ amaranth-lang/amaranth@422ba9ea51855472e1ed50c3c6eb297a4bff446d 🚀