#amaranth-lang on 2023-10-31 — irc logs at libera.irclog.whitequark.org

2023-08-14 23:59 whitequark[cis] changed the topic of #amaranth-lang to: Amaranth hardware definition language · weekly meetings: Amaranth each Mon 1700 UTC, Amaranth SoC each Fri 1700 UTC · code https://github.com/amaranth-lang · logs https://libera.irclog.whitequark.org/amaranth-lang · Matrix #amaranth-lang:matrix.org

00:39 nelgau has joined #amaranth-lang

00:47 lf has quit [Ping timeout: 246 seconds]

00:48 lf has joined #amaranth-lang

02:26 cyrozap_ has quit [Quit: ZNC 1.8.2+deb3.1 - https://znc.in]

02:27 cyrozap has joined #amaranth-lang

02:59 <whitequark[cis]> so I discovered something interesting

02:59 <whitequark[cis]> remember how we discussed whether it should be possible to specify a signature as Stream payload?

02:59 <whitequark[cis]> I just found a use case for this

02:59 <whitequark[cis]> consider a registered IO buffer with clock enable

03:01 <whitequark[cis]> you could represent its signature with something like:... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/NnmslbUChJWUIAZGShGKNnVF>)

03:03 <whitequark[cis]> the "valid" input is the clock enable for the IO buffer registers (whenever it's low, all the registers stop toggling, and data gets transferred neither way); and the "ready" output, while always 1 in most cases, allows you to do wait states which are useful for soft gearboxes

03:05 <whitequark[cis]> one example where a soft gearbox is useful in this context is what I call a "0.5x rate pin": suppose you have some protocol where you need to output a center-aligned clock (and the setup/hold constraints reflect that), but you don't want to use up a PLL, or maybe you don't want to phase shift with a PLL (say your clock frequency changes and your phase offset is in picos). in that case you can output the clock at half the clock

03:05 <whitequark[cis]> input to the IOB, and then use DDR registers to output 0,1 for a posedge and 1,0 for a negedge

03:07 <whitequark[cis]> (this is how I'm currently wiring up HyperRAM on Glasgow. 8 bits/cycle @48M is plenty of bandwidth already and most applets don't need that much, but being able to free up one of the only two PLLs on that chip is very handy, plus you get to keep it all in a single domain)

03:08 <whitequark[cis]> I think looking at IO buffer primitives from a stream perspective is very valuable, because it lets you apply the existing understanding and bag of tricks to something that would otherwise be a separate area

03:10 <whitequark[cis]> so basically allowing signatures as stream payloads allows one to make a stream that transfers, on each cycle, one word in either direction, but it still logically flows from place A to place B

03:11 <whitequark[cis]> however you would not be able to use most stream infrastructure with it, so i don't know if the actual Stream class should allow it; it exists for interop after all

04:37 Degi_ has joined #amaranth-lang

04:38 Degi has quit [Ping timeout: 255 seconds]

04:38 Degi_ is now known as Degi

05:00 nelgau has quit [Read error: Connection reset by peer]

06:59 peepsalot has quit [Ping timeout: 252 seconds]

07:02 peepsalot has joined #amaranth-lang

09:13 <zyp[m]> I'm not sure I fully understand the point, don't you run into issues with latency between the output and input signals?

09:15 <zyp[m]> I've written e.g. some SPI IO blocks with separate output and input streams, and the input stream would go valid a couple of cycles after the output stream to account for the latency in the IO registers

09:23 <whitequark[cis]> you do

09:24 <whitequark[cis]> this is useful for uniformly representing IO when soft gearboxes and other intermediaries are involved

09:24 <whitequark[cis]> as an extreme case, imagine doing SPI over BSCAN

09:24 <whitequark[cis]> you set valid, (a lot of stuff happens), you get back ready

09:25 <whitequark[cis]> it is useful to have a protocol core that can talk to this, and also normal SPI, with the same interface

09:26 <whitequark[cis]> also, I've long had the idea that output-to-input latency should be a property of the IO buffer rather than hardcoded; this is still something yet to be proven but it's a big thing missing in the current Pin interface

09:26 <whitequark[cis]> (well, output-to-pad and pad-to-input

09:26 <whitequark[cis]> * and pad-to-input)

09:27 <galibert[m]> Something to add to your todo list on pins, a way to express "this output signal must change only after those other ones are settled", thinking CS on sram accesses there but I'm sure there are other cases

09:27 <galibert[m]> when the netlist changes everything simultaneously, at the end of a wishbone bus for instance

09:27 <whitequark[cis]> can you rephrase? I'm not sure what the condition you are describing actually is

09:28 <galibert[m]> sram that says "edge on CS once the address and r/w are set"

09:28 <whitequark[cis]> oh, a fully async interface

09:28 <whitequark[cis]> Amaranth isn't really built to handle those

09:28 <whitequark[cis]> it does as a matter of fact but modern logic is synchronous

09:29 <galibert[m]> lemme find the datasheet again

09:29 <whitequark[cis]> I don't think the core language should have anything in it that specifically addresses fully async interfaces

09:29 <whitequark[cis]> it's too much work for too little gain

09:30 <galibert[m]> https://og.kervella.org/61-64WV25616EDBLL.pdf

09:30 <galibert[m]> as far as I can tell that's a pretty standard sram

09:30 <galibert[m]> not something old or insane, it's in one of my cv eval boards

09:30 <whitequark[cis]> yep

09:31 <whitequark[cis]> SRAM like this is one of the very few places where fully async interfaces still exist. other notable examples include UART

09:31 <whitequark[cis]> just about everything else includes some sort of clock such that most of the other signals are referenced to that clock

09:32 <galibert[m]> I would be a little sad if sram interfacing was out of the scope of amaranth :-)

09:33 <whitequark[cis]> SRAM interfacing isn't because it's not special in any way, just an old parallel interface

09:33 <whitequark[cis]> but describing SRAM-related invariants isn't in scope

09:33 <whitequark[cis]> actually, mm

09:34 <whitequark[cis]> if I remember correctly, you can just hold CE low forever, tie nOE to nWE, and have essentially two timing arcs: address-to-data and address+data-to-update

09:34 <whitequark[cis]> well, nOE to !nWE

09:37 <whitequark[cis]> actually, hold on #2

09:38 <galibert[m]> that's ok, I'm writing annoying reports, the fpga is turned off :-)

09:38 <whitequark[cis]> galibert: for both read and write cycles, no delay is required between transitions on address and CE

09:38 <whitequark[cis]> page 7 note 3 for reads

09:39 <whitequark[cis]> page 8 row tSA for writes, illustrated on the next page

09:39 <whitequark[cis]> <galibert[m]> "Something to add to your todo..." <- so the SRAM doesn't actually require the condition you're talking about

09:39 <galibert[m]> ahhhhhhhhhhhh nice, really nice

09:40 <galibert[m]> Love it, much thanks

09:41 <whitequark[cis]> I think most if not all SRAMs are like this

09:42 <whitequark[cis]> certainly all for which I've read the docs before (I was thinking of another SRAM-style interface--not actual SRAM--that does have tighter requirements)

09:42 <whitequark[cis]> (it's the FX2 async memory-style bus)

09:42 <galibert[m]> That seems highly probable, terasic probably wouldn't go out if its way to find a weird one

09:42 <whitequark[cis]> I don't actually know why tSA would even be a thing

09:42 <whitequark[cis]> like, the CE input is just a gate for OE and WE

09:42 <whitequark[cis]> it's literally ANDed first thing on the chip

09:43 <whitequark[cis]> there isn't any logic in there running that would actually care about transitions on CE more than transitions on OE or WE, and those don't have a setup time

09:43 <galibert[m]> yeah, never really understood why CE/CS in general, outside of "because we've always put one", when there's a OE

09:44 <whitequark[cis]> probably so that you can put it directly on a 8080-style parallel bus

09:44 <whitequark[cis]> iirc one of 8080 or 6800 is slightly easier to interface with if you have CE, and the other if you have OE, or something like that

09:44 <zyp[m]> <whitequark[cis]> "you set valid, (a lot of stuff..." <- the issue with that is that you lose the ability to do pipelining, i.e. have multiple bits inflight at once

09:45 <whitequark[cis]> zyp: not necessarily!

09:45 <whitequark[cis]> the hyperbus PHY we've been playing with has pipelining in the PHY because it uses IOB registers

09:47 <whitequark[cis]> if you want to use those (to get predictable timing, useful for nextpnr that doesn't really have input/output constraints), even the half rate PHY requires a register before your output and after your input, which means you have one inflight word either direction no matter what you do

09:47 <whitequark[cis]> sorry, one inflight halfword

09:48 <whitequark[cis]> so your MAC or controller or whatever has to accommodate nonzero output-stream-to-pin / pin-to-input-stream latencies, which means you can pipeline however much you like so long as your controller can handle it

09:49 <whitequark[cis]> the only thing your IO core needs to do is to define this latency via an attribute so that the controller can subtract it from the turnaround period

09:49 <whitequark[cis]> does this make sense?

09:49 <zyp[m]> for one of the things I'm working on, I've got a SPI PHY that's hooked up to daisy chained ADCs, whose sample rate is already bottlenecked by how fast I can read them out, and I'm using DDR registers for that in the same manner

09:50 <whitequark[cis]> in your design for the IO core, you have the ability to introduce a "bubble" into the pipeline, where you e.g. lower valid for a cycle and then get a cycle without ready back after a delay equal to the pipeline latency

09:51 <whitequark[cis]> the existence of this bubble needs to be tracked somewhere inside of the IO core; a flop chain can do this so it's not like this is super costly

09:52 <zyp[m]> sounds like you're effectively taking two streams with only valid signals, gluing them together and renaming the output valid to ready

09:53 <zyp[m]> or input, depending on perspective

09:53 <whitequark[cis]> I think it's a bit more complicated than that, maybe?

09:55 <whitequark[cis]> actually, thinking more about it, I think you're right, and a useful way to view this would be: we have a black box (representing the outside world), which has an input stream and an output stream. for each word in the input stream we eventually get a word in the output stream back

09:56 <galibert[m]> should a ram be considered a stream?

09:56 <whitequark[cis]> depends which type

09:56 <whitequark[cis]> actually, I think I may have misunderstood the question

09:57 <whitequark[cis]> whitequark[cis]: the input of the black box is "state of the FPGA output pins on the next cycle" and the output of the black box is "the state of the FPGA input pins on the previous cycle"

09:57 <whitequark[cis]> so it's fully generic as long as you're using a register for input/output

09:58 <galibert[m]> ah, very scsi of you

09:58 <whitequark[cis]> zyp: okay, I think I now see the advantages of your approach

09:59 <whitequark[cis]> in it, the latency is implicitly tracked in the I/O buffer, and the controller mostly doesn't need to care about it (other than e.g. for configuration purposes)

10:00 <whitequark[cis]> in your protocol you have everything defined in terms of cycle offset from some reference (say lowering CSn) so you have an FSM or something that counts cycles, and if you know you start getting input data after 4 cycles of address and turnaround, you can just skip 4 words and go straight to data

10:00 <whitequark[cis]> which works regardless of your I/O buffer latency as long as it's within what the protocol can handle

10:01 <whitequark[cis]> and because your CSn control signal lives in the same output stream as your data signals, those are also synchronized

10:01 <galibert[m]> how often does that really happen, fixed pipelining like that?

10:02 <galibert[m]> with more than one request in flight?

10:02 <whitequark[cis]> galibert: if we had infrastructure built for this, half the applets in glasgow would use it

10:02 <whitequark[cis]> and be like 5x faster for that

10:02 <whitequark[cis]> you really do want to have e.g. several JTAG bits in flight potentially

10:03 <galibert[m]> Just to be clear, what I mean is that it looked to me like fixed pipelining had gone away in favor of tagged frames and possible reordering

10:03 <whitequark[cis]> zyp: what do you do with the clock if o_valid is low?

10:04 <whitequark[cis]> galibert: I have no idea what that means in this context

10:04 <galibert[m]> axi(?) has a command id on every access, and the answer cames with the id, but there can be reordering of the answers w.r.t the requests

10:04 <galibert[m]> because caches and stuff

10:05 <whitequark[cis]> axi is very complex and a lot of solutions are not like axi at all

10:05 <whitequark[cis]> you will see a lot of fixed function stream graphs e.g. for data acquisition and signal processing application having a lot of fixed latency stream based blocks

10:05 <whitequark[cis]> with no reordering or tagging whatsoever

10:05 <galibert[m]> aren't they directly connected to one another, usually?

10:06 <whitequark[cis]> also, even with axi, usually people do not instantiate it with that tag capability, because it makes interconnect more complicated

10:06 <zyp[m]> whitequark[cis]: I don't output the clock until I receive the next output bit

10:06 <zyp[m]> here's what I've got so far, I haven't hooked up the input side to the source endpoint yet: https://paste.jvnv.net/view/WSWrT

10:06 <whitequark[cis]> and it's very much built to allow traditional pipelining where interleaved requests lead to interleaved responses and the requester is supposed to determine that by the order

10:07 <whitequark[cis]> zyp[m]: right, ok, so you have something like a DDR clock repeater and you output 0,0 whenever you don't have o_valid?

10:07 <zyp[m]> correct

10:07 <whitequark[cis]> i should just post our hydrabus phy

10:07 <whitequark[cis]> https://paste.debian.net/1296794/

10:08 <galibert[m]> Ok, I see where I'm going wrong. I was thinking of a input/output stream, but there's no reason to have the input and the output in the same place

10:08 <zyp[m]> the module is also enforcing cs assertion to first sck and last sck to cs deassertion times, as well as minimum cs deasserted time to have time for ADC conversion

10:08 <whitequark[cis]> in this case, if ready is low, you actually cannot shift another word into the PHY, so it is a ready output

10:11 <whitequark[cis]> zyp: the whole thing is a bit cursed but I hope the comments explain why it is the way it is

10:24 <zyp[m]> hmm, you don't have registers on rwds/data?

10:26 <zyp[m]> <whitequark[cis]> "in it, the latency is implicitly..." <- and yeah, this is very much the point

10:28 <Wanda[cis]> zyp[m]: there is a register in the SB_IO

10:28 <Wanda[cis]> ie. IOBufferWithEn

10:30 <zyp[m]> is that a register, not just a tristate buffer?

10:31 <whitequark[cis]> with this PIN_TYPE there is a register on oe, o, and i

10:31 <whitequark[cis]> SB_IO is weird, you cannot pack a flop into the IOB on ice40 using any existing tools, but you can configure the IOB to have an internal flop

10:32 <zyp[m]> okay, I have limited experience with the amaranth platform resource system so far, and even less with ice40

10:33 <whitequark[cis]> so actually the idea with the hyperbus thing is to improve the design of the amaranth platform as it relates to I/O

10:33 <zyp[m]> anyway, this looks close to how I'd do it, except I'd have rwds.i and data.i in their own stream with their own valid

10:33 <whitequark[cis]> you can ignore the amaranth platform related parts there and just look at SB_IO

10:34 <whitequark[cis]> this is the schematic for SB_IO

10:34 * whitequark[cis] uploaded an image: (328KiB) < https://catircservices.org/_matrix/media/v3/download/matrix.org/wsZhIYxNovzEaUjBNvbQYmHB/Screenshot_20231031_103412.png >

10:34 <whitequark[cis]> it's not actually all that different from Xilinx or Altera IOB, it just has a silly way to manually pack a flop into it instead of a separate primitive

10:35 <zyp[m]> back-propagating ready from that stream to the other stream is annoying and something I'd rather avoid, hence why I wanted the ability to make streams explicitly without ready

10:35 <whitequark[cis]> so for D_IN_0 and D_OUT_0, there is a single posedge flop in the chain the way IOBufferWithEn is configured

10:36 <whitequark[cis]> zyp: but for the purpose of e.g. this HyperBus thing we wrote, `ready` is in fact necessary

10:36 <whitequark[cis]> so the hyperbus controller cannot assume that ready is never there

10:37 <whitequark[cis]> I guess you could omit ready on the output stream (of data_i, etc)

10:38 <zyp[m]> yes, valid_o, ready_o and valid_i are all both useful and easy, but for the phy to respect ready_i, it needs to have internal buffering that can hold any bytes that it's already issued clocks for before it can back-propagate to ready_o

10:39 <whitequark[cis]> can you elaborate?

10:41 <zyp[m]> if you have two cycles of round trip delay in the IO registers and you're issuing clocks back to back, that means you can have two transfers inflight at any time, whose input data you have to capture in the next two cycles

10:42 <zyp[m]> but if ready_i goes low, you can't just stream them out, so you have to buffer them

10:42 <zyp[m]> and even if you set ready_o low as soon as ready_i goes low, there's still the two transfers you already issued clocks for

10:44 <whitequark[cis]> I feel like I'm missing something

10:46 <zyp[m]> give me a few minutes, I'll draw up something in wavedrom

10:46 <whitequark[cis]> don't you need the same amount of buffering regardless of whether you have ready_i or not?

10:54 * zyp[m] uploaded an image: (16KiB) < https://catircservices.org/_matrix/media/v3/download/matrix.org/UwINnglwGQCtNyirAjFIBRGe/image.png >

10:54 <zyp[m]> imagine something like this:

10:58 <zyp[m]> output stream feeds a continuous stream of transfers back to back, output registers have one cycle of latency before they hit the bus pins, input registers have another cycle of latency for the data read back and is fed to the input stream

10:59 <zyp[m]> and in this case, whatever is hooked to the input stream receives the first transfer and deasserts ready_i, whick backpropagates to ready_o

11:00 <zyp[m]> but because of the latency, the second and third transfer are already clocked out, and thus have to be captured in the next two cycles

11:01 <whitequark[cis]> ohhhhh.

11:03 <zyp[m]> there's two ways to solve this; either this block can add enough buffering to hold those bytes, or the layer above this can just avoid issuing more transfers than it knows it has capacity to receive back

11:03 <zyp[m]> and in the latter case, it'd be good if the stream explicitly indicates it doesn't support backpressure

11:05 <galibert[m]> I've seen an interesting design in a mpeg decoder chip: you configure a buffer size, 2-32 bytes iirc, and the chip issues a drq only when it has at least that much space available. When drq is set you can then push data blindly at max speed

11:05 <galibert[m]> drq has a delay to be unset but that's within the time of the transfert iirc

11:05 <whitequark[cis]> (brb)

11:06 <galibert[m]> so it's kind of a packatized backpressure

11:06 <galibert[m]> * so it's kind of a packetized backpressure, or flow control

11:06 <galibert[m]> they connect that to a block dma mode of the h8, and the cpu doesn't need to care and the bus usage is really nice

12:04 <gruetzkopf> VLSIs combo compressed audiors decoders/DACs also do that

12:05 <galibert[m]> makes sense, it's the same requirement

13:12 nelgau has joined #amaranth-lang

13:16 frgo has quit [Quit: Leaving...]

13:27 frgo has joined #amaranth-lang

14:11 peeps[zen] has joined #amaranth-lang

14:14 peepsalot has quit [Ping timeout: 264 seconds]

14:48 nelgau has quit [Read error: Connection reset by peer]

14:48 nelgau has joined #amaranth-lang

15:26 nelgau has quit [Read error: Connection reset by peer]

15:26 nelgau has joined #amaranth-lang

15:38 nelgau has quit [Read error: Connection reset by peer]

15:38 nelgau has joined #amaranth-lang

16:57 nelgau has quit [Read error: Connection reset by peer]

16:57 nelgau has joined #amaranth-lang

17:09 lf has quit [Ping timeout: 240 seconds]

17:09 lf has joined #amaranth-lang

17:25 lf has quit [Ping timeout: 252 seconds]

17:26 lf has joined #amaranth-lang

17:41 nelgau has quit [Read error: Connection reset by peer]

17:42 nelgau has joined #amaranth-lang

19:21 nelgau has quit [Ping timeout: 255 seconds]

19:47 anubis has joined #amaranth-lang

20:12 anubis has quit [Remote host closed the connection]

20:32 jess has joined #amaranth-lang

21:55 iposthuman[m] has joined #amaranth-lang

21:55 <iposthuman[m]> Hello, i have been looking at the a port of Bruno Levy's code to Amaranth and ran into some code that I don't know where in Amaranth it comes from:

21:55 * iposthuman[m] sent a code block: https://catircservices.org/_matrix/media/v3/download/catircservices.org/mnQfADHDBvRuumucHQLoGvIf

21:56 <iposthuman[m]> This is from bl0x's port. What is Amarant's memory module? Thanks. 🤔

21:57 <galibert[m]> https://github.com/amaranth-lang/amaranth/blob/main/amaranth/hdl/mem.py#L12

22:00 <iposthuman[m]> Cool. I'll check it out. Much appreciated 🙂

22:01 <iposthuman[m]> Ah. I see now. Thanks

23:51 FFY00 has joined #amaranth-lang