whitequark[cis] changed the topic of #amaranth-lang to: Amaranth hardware definition language · weekly meetings: Amaranth each Mon 1700 UTC, Amaranth SoC each Fri 1700 UTC · code https://github.com/amaranth-lang · logs https://libera.irclog.whitequark.org/amaranth-lang · Matrix #amaranth-lang:matrix.org
nelgau has joined #amaranth-lang
lf has quit [Ping timeout: 246 seconds]
lf has joined #amaranth-lang
cyrozap_ has quit [Quit: ZNC 1.8.2+deb3.1 - https://znc.in]
cyrozap has joined #amaranth-lang
<whitequark[cis]> so I discovered something interesting
<whitequark[cis]> remember how we discussed whether it should be possible to specify a signature as Stream payload?
<whitequark[cis]> I just found a use case for this
<whitequark[cis]> consider a registered IO buffer with clock enable
<whitequark[cis]> you could represent its signature with something like:... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/NnmslbUChJWUIAZGShGKNnVF>)
<whitequark[cis]> the "valid" input is the clock enable for the IO buffer registers (whenever it's low, all the registers stop toggling, and data gets transferred neither way); and the "ready" output, while always 1 in most cases, allows you to do wait states which are useful for soft gearboxes
<whitequark[cis]> one example where a soft gearbox is useful in this context is what I call a "0.5x rate pin": suppose you have some protocol where you need to output a center-aligned clock (and the setup/hold constraints reflect that), but you don't want to use up a PLL, or maybe you don't want to phase shift with a PLL (say your clock frequency changes and your phase offset is in picos). in that case you can output the clock at half the clock
<whitequark[cis]> input to the IOB, and then use DDR registers to output 0,1 for a posedge and 1,0 for a negedge
<whitequark[cis]> (this is how I'm currently wiring up HyperRAM on Glasgow. 8 bits/cycle @48M is plenty of bandwidth already and most applets don't need that much, but being able to free up one of the only two PLLs on that chip is very handy, plus you get to keep it all in a single domain)
<whitequark[cis]> I think looking at IO buffer primitives from a stream perspective is very valuable, because it lets you apply the existing understanding and bag of tricks to something that would otherwise be a separate area
<whitequark[cis]> so basically allowing signatures as stream payloads allows one to make a stream that transfers, on each cycle, one word in either direction, but it still logically flows from place A to place B
<whitequark[cis]> however you would not be able to use most stream infrastructure with it, so i don't know if the actual Stream class should allow it; it exists for interop after all
Degi_ has joined #amaranth-lang
Degi has quit [Ping timeout: 255 seconds]
Degi_ is now known as Degi
nelgau has quit [Read error: Connection reset by peer]
peepsalot has quit [Ping timeout: 252 seconds]
peepsalot has joined #amaranth-lang
<zyp[m]> I'm not sure I fully understand the point, don't you run into issues with latency between the output and input signals?
<zyp[m]> I've written e.g. some SPI IO blocks with separate output and input streams, and the input stream would go valid a couple of cycles after the output stream to account for the latency in the IO registers
<whitequark[cis]> you do
<whitequark[cis]> this is useful for uniformly representing IO when soft gearboxes and other intermediaries are involved
<whitequark[cis]> as an extreme case, imagine doing SPI over BSCAN
<whitequark[cis]> you set valid, (a lot of stuff happens), you get back ready
<whitequark[cis]> it is useful to have a protocol core that can talk to this, and also normal SPI, with the same interface
<whitequark[cis]> also, I've long had the idea that output-to-input latency should be a property of the IO buffer rather than hardcoded; this is still something yet to be proven but it's a big thing missing in the current Pin interface
<whitequark[cis]> (well, output-to-pad and pad-to-input
<whitequark[cis]> * and pad-to-input)
<galibert[m]> Something to add to your todo list on pins, a way to express "this output signal must change only after those other ones are settled", thinking CS on sram accesses there but I'm sure there are other cases
<galibert[m]> when the netlist changes everything simultaneously, at the end of a wishbone bus for instance
<whitequark[cis]> can you rephrase? I'm not sure what the condition you are describing actually is
<galibert[m]> sram that says "edge on CS once the address and r/w are set"
<whitequark[cis]> oh, a fully async interface
<whitequark[cis]> Amaranth isn't really built to handle those
<whitequark[cis]> it does as a matter of fact but modern logic is synchronous
<galibert[m]> lemme find the datasheet again
<whitequark[cis]> I don't think the core language should have anything in it that specifically addresses fully async interfaces
<whitequark[cis]> it's too much work for too little gain
<galibert[m]> as far as I can tell that's a pretty standard sram
<galibert[m]> not something old or insane, it's in one of my cv eval boards
<whitequark[cis]> yep
<whitequark[cis]> SRAM like this is one of the very few places where fully async interfaces still exist. other notable examples include UART
<whitequark[cis]> just about everything else includes some sort of clock such that most of the other signals are referenced to that clock
<galibert[m]> I would be a little sad if sram interfacing was out of the scope of amaranth :-)
<whitequark[cis]> SRAM interfacing isn't because it's not special in any way, just an old parallel interface
<whitequark[cis]> but describing SRAM-related invariants isn't in scope
<whitequark[cis]> actually, mm
<whitequark[cis]> if I remember correctly, you can just hold CE low forever, tie nOE to nWE, and have essentially two timing arcs: address-to-data and address+data-to-update
<whitequark[cis]> well, nOE to !nWE
<whitequark[cis]> actually, hold on #2
<galibert[m]> that's ok, I'm writing annoying reports, the fpga is turned off :-)
<whitequark[cis]> galibert: for both read and write cycles, no delay is required between transitions on address and CE
<whitequark[cis]> page 7 note 3 for reads
<whitequark[cis]> page 8 row tSA for writes, illustrated on the next page
<whitequark[cis]> <galibert[m]> "Something to add to your todo..." <- so the SRAM doesn't actually require the condition you're talking about
<galibert[m]> ahhhhhhhhhhhh nice, really nice
<galibert[m]> Love it, much thanks
<whitequark[cis]> I think most if not all SRAMs are like this
<whitequark[cis]> certainly all for which I've read the docs before (I was thinking of another SRAM-style interface--not actual SRAM--that does have tighter requirements)
<whitequark[cis]> (it's the FX2 async memory-style bus)
<galibert[m]> That seems highly probable, terasic probably wouldn't go out if its way to find a weird one
<whitequark[cis]> I don't actually know why tSA would even be a thing
<whitequark[cis]> like, the CE input is just a gate for OE and WE
<whitequark[cis]> it's literally ANDed first thing on the chip
<whitequark[cis]> there isn't any logic in there running that would actually care about transitions on CE more than transitions on OE or WE, and those don't have a setup time
<galibert[m]> yeah, never really understood why CE/CS in general, outside of "because we've always put one", when there's a OE
<whitequark[cis]> probably so that you can put it directly on a 8080-style parallel bus
<whitequark[cis]> iirc one of 8080 or 6800 is slightly easier to interface with if you have CE, and the other if you have OE, or something like that
<zyp[m]> <whitequark[cis]> "you set valid, (a lot of stuff..." <- the issue with that is that you lose the ability to do pipelining, i.e. have multiple bits inflight at once
<whitequark[cis]> zyp: not necessarily!
<whitequark[cis]> the hyperbus PHY we've been playing with has pipelining in the PHY because it uses IOB registers
<whitequark[cis]> if you want to use those (to get predictable timing, useful for nextpnr that doesn't really have input/output constraints), even the half rate PHY requires a register before your output and after your input, which means you have one inflight word either direction no matter what you do
<whitequark[cis]> sorry, one inflight halfword
<whitequark[cis]> so your MAC or controller or whatever has to accommodate nonzero output-stream-to-pin / pin-to-input-stream latencies, which means you can pipeline however much you like so long as your controller can handle it
<whitequark[cis]> the only thing your IO core needs to do is to define this latency via an attribute so that the controller can subtract it from the turnaround period
<whitequark[cis]> does this make sense?
<zyp[m]> for one of the things I'm working on, I've got a SPI PHY that's hooked up to daisy chained ADCs, whose sample rate is already bottlenecked by how fast I can read them out, and I'm using DDR registers for that in the same manner
<whitequark[cis]> in your design for the IO core, you have the ability to introduce a "bubble" into the pipeline, where you e.g. lower valid for a cycle and then get a cycle without ready back after a delay equal to the pipeline latency
<whitequark[cis]> the existence of this bubble needs to be tracked somewhere inside of the IO core; a flop chain can do this so it's not like this is super costly
<zyp[m]> sounds like you're effectively taking two streams with only valid signals, gluing them together and renaming the output valid to ready
<zyp[m]> or input, depending on perspective
<whitequark[cis]> I think it's a bit more complicated than that, maybe?
<whitequark[cis]> actually, thinking more about it, I think you're right, and a useful way to view this would be: we have a black box (representing the outside world), which has an input stream and an output stream. for each word in the input stream we eventually get a word in the output stream back
<galibert[m]> should a ram be considered a stream?
<whitequark[cis]> depends which type
<whitequark[cis]> actually, I think I may have misunderstood the question
<whitequark[cis]> whitequark[cis]: the input of the black box is "state of the FPGA output pins on the next cycle" and the output of the black box is "the state of the FPGA input pins on the previous cycle"
<whitequark[cis]> so it's fully generic as long as you're using a register for input/output
<galibert[m]> ah, very scsi of you
<whitequark[cis]> zyp: okay, I think I now see the advantages of your approach
<whitequark[cis]> in it, the latency is implicitly tracked in the I/O buffer, and the controller mostly doesn't need to care about it (other than e.g. for configuration purposes)
<whitequark[cis]> in your protocol you have everything defined in terms of cycle offset from some reference (say lowering CSn) so you have an FSM or something that counts cycles, and if you know you start getting input data after 4 cycles of address and turnaround, you can just skip 4 words and go straight to data
<whitequark[cis]> which works regardless of your I/O buffer latency as long as it's within what the protocol can handle
<whitequark[cis]> and because your CSn control signal lives in the same output stream as your data signals, those are also synchronized
<galibert[m]> how often does that really happen, fixed pipelining like that?
<galibert[m]> with more than one request in flight?
<whitequark[cis]> galibert: if we had infrastructure built for this, half the applets in glasgow would use it
<whitequark[cis]> and be like 5x faster for that
<whitequark[cis]> you really do want to have e.g. several JTAG bits in flight potentially
<galibert[m]> Just to be clear, what I mean is that it looked to me like fixed pipelining had gone away in favor of tagged frames and possible reordering
<whitequark[cis]> zyp: what do you do with the clock if o_valid is low?
<whitequark[cis]> galibert: I have no idea what that means in this context
<galibert[m]> axi(?) has a command id on every access, and the answer cames with the id, but there can be reordering of the answers w.r.t the requests
<galibert[m]> because caches and stuff
<whitequark[cis]> axi is very complex and a lot of solutions are not like axi at all
<whitequark[cis]> you will see a lot of fixed function stream graphs e.g. for data acquisition and signal processing application having a lot of fixed latency stream based blocks
<whitequark[cis]> with no reordering or tagging whatsoever
<galibert[m]> aren't they directly connected to one another, usually?
<whitequark[cis]> also, even with axi, usually people do not instantiate it with that tag capability, because it makes interconnect more complicated
<zyp[m]> whitequark[cis]: I don't output the clock until I receive the next output bit
<zyp[m]> here's what I've got so far, I haven't hooked up the input side to the source endpoint yet: https://paste.jvnv.net/view/WSWrT
<whitequark[cis]> and it's very much built to allow traditional pipelining where interleaved requests lead to interleaved responses and the requester is supposed to determine that by the order
<whitequark[cis]> zyp[m]: right, ok, so you have something like a DDR clock repeater and you output 0,0 whenever you don't have o_valid?
<zyp[m]> correct
<whitequark[cis]> i should just post our hydrabus phy
<galibert[m]> Ok, I see where I'm going wrong. I was thinking of a input/output stream, but there's no reason to have the input and the output in the same place
<zyp[m]> the module is also enforcing cs assertion to first sck and last sck to cs deassertion times, as well as minimum cs deasserted time to have time for ADC conversion
<whitequark[cis]> in this case, if ready is low, you actually cannot shift another word into the PHY, so it is a ready output
<whitequark[cis]> zyp: the whole thing is a bit cursed but I hope the comments explain why it is the way it is
<zyp[m]> hmm, you don't have registers on rwds/data?
<zyp[m]> <whitequark[cis]> "in it, the latency is implicitly..." <- and yeah, this is very much the point
<Wanda[cis]> zyp[m]: there is a register in the SB_IO
<Wanda[cis]> ie. IOBufferWithEn
<zyp[m]> is that a register, not just a tristate buffer?
<whitequark[cis]> with this PIN_TYPE there is a register on oe, o, and i
<whitequark[cis]> SB_IO is weird, you cannot pack a flop into the IOB on ice40 using any existing tools, but you can configure the IOB to have an internal flop
<zyp[m]> okay, I have limited experience with the amaranth platform resource system so far, and even less with ice40
<whitequark[cis]> so actually the idea with the hyperbus thing is to improve the design of the amaranth platform as it relates to I/O
<zyp[m]> anyway, this looks close to how I'd do it, except I'd have rwds.i and data.i in their own stream with their own valid
<whitequark[cis]> you can ignore the amaranth platform related parts there and just look at SB_IO
<whitequark[cis]> this is the schematic for SB_IO
<whitequark[cis]> it's not actually all that different from Xilinx or Altera IOB, it just has a silly way to manually pack a flop into it instead of a separate primitive
<zyp[m]> back-propagating ready from that stream to the other stream is annoying and something I'd rather avoid, hence why I wanted the ability to make streams explicitly without ready
<whitequark[cis]> so for D_IN_0 and D_OUT_0, there is a single posedge flop in the chain the way IOBufferWithEn is configured
<whitequark[cis]> zyp: but for the purpose of e.g. this HyperBus thing we wrote, `ready` is in fact necessary
<whitequark[cis]> so the hyperbus controller cannot assume that ready is never there
<whitequark[cis]> I guess you could omit ready on the output stream (of data_i, etc)
<zyp[m]> yes, valid_o, ready_o and valid_i are all both useful and easy, but for the phy to respect ready_i, it needs to have internal buffering that can hold any bytes that it's already issued clocks for before it can back-propagate to ready_o
<whitequark[cis]> can you elaborate?
<zyp[m]> if you have two cycles of round trip delay in the IO registers and you're issuing clocks back to back, that means you can have two transfers inflight at any time, whose input data you have to capture in the next two cycles
<zyp[m]> but if ready_i goes low, you can't just stream them out, so you have to buffer them
<zyp[m]> and even if you set ready_o low as soon as ready_i goes low, there's still the two transfers you already issued clocks for
<whitequark[cis]> I feel like I'm missing something
<zyp[m]> give me a few minutes, I'll draw up something in wavedrom
<whitequark[cis]> don't you need the same amount of buffering regardless of whether you have ready_i or not?
<zyp[m]> imagine something like this:
<zyp[m]> output stream feeds a continuous stream of transfers back to back, output registers have one cycle of latency before they hit the bus pins, input registers have another cycle of latency for the data read back and is fed to the input stream
<zyp[m]> and in this case, whatever is hooked to the input stream receives the first transfer and deasserts ready_i, whick backpropagates to ready_o
<zyp[m]> but because of the latency, the second and third transfer are already clocked out, and thus have to be captured in the next two cycles
<whitequark[cis]> ohhhhh.
<zyp[m]> there's two ways to solve this; either this block can add enough buffering to hold those bytes, or the layer above this can just avoid issuing more transfers than it knows it has capacity to receive back
<zyp[m]> and in the latter case, it'd be good if the stream explicitly indicates it doesn't support backpressure
<galibert[m]> I've seen an interesting design in a mpeg decoder chip: you configure a buffer size, 2-32 bytes iirc, and the chip issues a drq only when it has at least that much space available. When drq is set you can then push data blindly at max speed
<galibert[m]> drq has a delay to be unset but that's within the time of the transfert iirc
<whitequark[cis]> (brb)
<galibert[m]> so it's kind of a packatized backpressure
<galibert[m]> * so it's kind of a packetized backpressure, or flow control
<galibert[m]> they connect that to a block dma mode of the h8, and the cpu doesn't need to care and the bus usage is really nice
<gruetzkopf> VLSIs combo compressed audiors decoders/DACs also do that
<galibert[m]> makes sense, it's the same requirement
nelgau has joined #amaranth-lang
frgo has quit [Quit: Leaving...]
frgo has joined #amaranth-lang
peeps[zen] has joined #amaranth-lang
peepsalot has quit [Ping timeout: 264 seconds]
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
lf has quit [Ping timeout: 240 seconds]
lf has joined #amaranth-lang
lf has quit [Ping timeout: 252 seconds]
lf has joined #amaranth-lang
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
nelgau has quit [Ping timeout: 255 seconds]
anubis has joined #amaranth-lang
anubis has quit [Remote host closed the connection]
jess has joined #amaranth-lang
iposthuman[m] has joined #amaranth-lang
<iposthuman[m]> Hello, i have been looking at the a port of Bruno Levy's code to Amaranth and ran into some code that I don't know where in Amaranth it comes from:
<iposthuman[m]> This is from bl0x's port. What is Amarant's memory module? Thanks. 🤔
<iposthuman[m]> Cool. I'll check it out. Much appreciated 🙂
<iposthuman[m]> Ah. I see now. Thanks
FFY00 has joined #amaranth-lang