<jfng[m]>
<tpw_rules> "jfng: random Q, i noticed you..." <- yeah, it's just to explicitly indicate that the `pin_i_sync_ff` variable is no longer used past the body of the loop
<jfng[m]>
<tpw_rules> "jfng: maybe it should be renamed..." <- `offset_granularity` would be a more explicit name, but i'd still keep 8 as default (as it is what most users are familiar with)
<jfng[m]>
i.e. even with a 32-bit CSR bus, using an 8-bit granularity gives you consistent offsets between your implementation, documentation, etc
<tpw_rules>
but the only place it matters in implementation is the offset parameter to csr.Builder, and even then it is subject to restrictions
<jfng[m]>
yeah, and if you care about providing explicit offsets for your registers (e.g. for backward compatibility between silicon revisions), it gives you a single point of truth between implementation, documentation, testbenches, etc
<tpw_rules>
all of those things except documentation need the non-granularized offset though
<jfng[m]>
docs and testbenches and drivers would use the products of a BSP generator, but that offset could be directly obtained from the implementation
<jfng[m]>
tpw_rules: you'd probably want your firmware drivers to use 8-bit granularity, i think
<tpw_rules>
not if i'm writing boneless code
<jfng[m]>
right, and in any case it would be parameterized/translated by the BSP generator
<tpw_rules>
then why don't the BSP generator and the docs generator take a granularity? what does the peripheral have to say about it?
<tpw_rules>
or rather, why does the peripheral have a say in it?
<jfng[m]>
that's a good point
<tpw_rules>
like, there could be 10 different granularities between the user saying "store 3 to 0x42" and the peripheral in a sufficiently perverted bus architecture
<jfng[m]>
in a sense, what would really matter in `csr.Builder` is what is the most convenient for the peripheral designers
<tpw_rules>
that's why i say default to data_width. they can't specify an offset with precision smaller than data_width//granularity anyway
<tpw_rules>
like, i have concern that someone's going to say "okay great i'm gonna put all my registers at offsets with multiples of 4" and needlessly break compatibility with a 64 bit bus
<jfng[m]>
jfng[m]: and then defaulting to `data_width` can become attractive; i'm still not sure that it is a better choice though
<jfng[m]>
it does make things much simpler to reason about
<tpw_rules>
i would think that adding registers in a consistent order and not specifying an offset is the best way to guarantee backwards compatibility
<tpw_rules>
also with it being data_width, nobody looking at the peripheral or using it on the FPGA side can use addresses that are not referenced to that. the 32 bit peripheral designer says "okay register 1 is at offset 4" but, say, logic or testbenches that access the peripheral uses address 1
<jfng[m]>
also a good point
<tpw_rules>
also also the granularity is not compatible with the addr on e.g. csr.Decoder.add, and it's not really clear how it could be made so. that uses the data bus width too
<tpw_rules>
so to reiterate, my optimal proposal is to drop granularity from csr.Builder. it's only useful to a designer who knows the SoC CPU's granularity and is trying to design peripherals that have known data_width and want to use offsets when adding registers.
<tpw_rules>
the other proposal is to rename granularity to offset_granularity to emphasize what it affects, and default it to data_width so in the default case the offsets used line up with the rest of the FPGA design. that way someone could specify a different one, but they would have to know it could break data_width parameterization and confuse readers
<tpw_rules>
in both cases the granularity from csr.Builder should not be exposed to any generator; the generator should take in its own granularity from the perspective of what the artifacts are being generated for and scale the addresses appropriately
<mcc111[m]>
Imagine I tell Amaranth to add a semi-large group of numbers— say a + b + c + d. will it do something smart like (a + b) + (c + d) or will it add these linearly?
<tpw_rules>
"arithmetics on Amaranth values never overflows because the width of the arithmetic expression is always sufficient to represent all possible results.". so it doesn't much matter. i think the toolchain will end up rearranging the logic. i expect the output verilog to be linear iirc
<Wanda[cis]>
mcc111[m]: this goes to the underlying synthesis tool unchanged
<Wanda[cis]>
yosys has a pass that will group these things, and do some pretty smart lowering
chaoticryptidz has quit [Client Quit]
<Wanda[cis]>
(basically: there is only one adder in the final netlist, and a bunch of 3-to-2 compressors)
chaoticryptidz has joined #amaranth-lang
<Wanda[cis]>
I'd expect any other reasonably smart synth tool to do the same, it's a well-known technique
<galibert[m]>
what's a compressor in that context?
<Wanda[cis]>
(or something equivalently smart; for Altera FPGAs the hardware can actually do 3-way addition natively-ish)
<Wanda[cis]>
3-2 compressor is a circuit that takes 3 numbers (a, b, c), and outputs 2 numbers (e, f)
<Wanda[cis]>
such that a + b + c == e + f
<Wanda[cis]>
it's cheaper than addition, and doesn't have a long carry chain, it's just simple per-bit gates
<galibert[m]>
how does that work?
<tpw_rules>
i thought expressions were broken up into intermediate variables in verilog
<Wanda[cis]>
so when you have a bunch of numbers to add, you construct a DAG of these compressors smashing numbers together until there are only two left, then do a normal carry-chain addition on the final two
<galibert[m]>
I can see muxes if at least one bit is zero, but what if they're all one?
<Wanda[cis]>
(this is also how you lower multipliers, multipliers are multi-input adders with some masking)
<Wanda[cis]>
simple
<Wanda[cis]>
you use a full adder
<Wanda[cis]>
take bit-slice 0
<Wanda[cis]>
compute a[0] + b[0] + c[0] == x
<Wanda[cis]>
x is 2-bit
<Wanda[cis]>
you wire x[0] to e[0], x[1] to f[1]
<Wanda[cis]>
basically e is low-order bits of all single-bitslice sums, f is high-order bits of those
<galibert[m]>
ohhh, e is non-carry addition result, f is carries?
<Wanda[cis]>
(and you tie f[0] to 0)
<Wanda[cis]>
yuuup
<galibert[m]>
beautiful
<galibert[m]>
makes a lot of sense
<Wanda[cis]>
if you're on FPGA, that's two LUTs or one frangible LUT
<Wanda[cis]>
per bitslice
<Wanda[cis]>
with no long critical path
<galibert[m]>
yeah, it's just beautiful
<galibert[m]>
the real adder is just used at the end to fold everything
<Wanda[cis]>
mhm!
<cr1901>
Searching for frangible LUT returns exactly 2 results on Google, one from wq, the other from lofty
<cr1901>
What _is_ a frangible LUT?
<galibert[m]>
cr1901: the pretend-LUT6 in the cyclonev labcells, to give one example, is in reality two LUT4 and four LUT3
<Wanda[cis]>
it may be fracturable LUTs?
<cr1901>
oh, so it's the "A LUT6 is two LUT5" thing Xilinx 7-series does?
<Wanda[cis]>
basically large LUTs with multiple outputs where the uhh LUTting power can be distributed between outputs in funny ways
<galibert[m]>
so you can use it as a LUT6 but you can also split it into parts have have things done in parallel
<Wanda[cis]>
yes
<Wanda[cis]>
Xilinx has a pretty simple construct of LUT6 or LUT5×2
<Wanda[cis]>
Altera has... something more complex
<Wanda[cis]>
but, same core idea really
<cr1901>
I actually really missed that feature on ice40 recently... turns out 2s-complementing a n-bit number requires N-LUTs for the complement part. But with fracturable LUTs, I could do it in N/2 (_I think_)
<cr1901>
But I also wouldn't care about LUT counting if I wasn't targeting ice40, so...