#glasgow on 2024-09-07 — irc logs at libera.irclog.whitequark.org

2024-02-26 12:29 whitequark[cis] changed the topic of #glasgow to: https://glasgow-embedded.org · digital interface explorer · https://www.crowdsupply.com/1bitsquared/glasgow · code https://github.com/GlasgowEmbedded/glasgow · logs https://libera.irclog.whitequark.org/glasgow · matrix #glasgow-interface-explorer:matrix.org · discord https://1bitsquared.com/pages/chat

00:44 redstarcomrade has joined #glasgow

01:40 purdeaandrei[m] has joined #glasgow

01:40 <purdeaandrei[m]> Understood, sorry for some reason I thought that you said you would test that with a real chip, and when I saw it merged I assumed everything was okay. I hope it didn't derail the demo.

01:42 whitequark[m] has joined #glasgow

01:42 <whitequark[m]> I completely forgot about it, it seems... I've been overloaded and stressed recently, so not surprising

01:42 <whitequark[m]> the demo went okay since I accounted for the possibility of this and had a few hours to drill into all the issues

01:42 <whitequark[m]> I think I might've tested one of the PRs but not both

01:45 abby has quit [Ping timeout: 252 seconds]

01:45 \\ has quit [Ping timeout: 276 seconds]

03:25 <purdeaandrei[m]> I'm trying out 48MHz now with current main branch, but I think I have too long wires, it's not reliable

03:26 <purdeaandrei[m]> I have something like 20cm, a soic clip (floating in air, checked it makes contact with each pin with a multimeter), and a small adapter

03:31 <purdeaandrei[m]> 10cm + soic ZIF socket also not working right

03:32 <purdeaandrei[m]> * working right at 48MHz but works at 24Mhz

03:32 <purdeaandrei[m]> * working right at 48MHz but works at 24MHz

03:33 <whitequark[m]> how does it fail? ff's after a while?

03:33 <purdeaandrei[m]> 0s after a while

03:34 <purdeaandrei[m]> I'll try to solder down the chip, I just want to double-check: 10cm shouldn't be too long, right?

03:35 <whitequark[m]> twist a ground wire with the clock wire

03:35 <whitequark[m]> and another one with a cs wire

03:35 <whitequark[m]> that usually does the trick

05:26 <purdeaandrei[m]> re-reading my old messages 🙂

05:41 meklort has quit [Quit: ZNC 1.7.5+deb4 - https://znc.in]

05:42 tec has joined #glasgow

05:42 meklort has joined #glasgow

06:17 <purdeaandrei[m]> Alright, so 25x spi chips sample at the rising edge, and output data at the falling edge

06:18 <purdeaandrei[m]> As far as I can tell based on comments in the code, IOClocker is designed, such that it always starts outputting data with a falling edge, and samples on the rising edge

06:19 <purdeaandrei[m]> This is (mostly) true, except for when in DDR mode, it doesn't abstract away whether i[1] or i[0] should be read

06:20 <purdeaandrei[m]> IOclocker in DDR mode, but with divisor!=0 will result with the rising edge sample in i[0]

06:21 <purdeaandrei[m]> but in DDR mode with divisor == 0, the rising edge will be in i[1]

06:22 <purdeaandrei[m]> QSPIController always connects i[0] to the deframer, so that's why it's wrong

06:25 <purdeaandrei[m]> The reason it works with wrong IOStreamer sample time, is that the sampling is actually happening on the falling edge, but the same falling edge that the SPI chip is sending new data on. The time for sck to the chip + sck to output time of the chip + the time back to the IO is long enough to satisfy the FPGA's hold time requirements.

06:40 <purdeaandrei[m]> hmm, I started implementing a fix, there's a bit more to this than I thought...

06:42 <purdeaandrei[m]> Since 24MHz fails too

06:55 <purdeaandrei[m]> Okay, so I realized two more things:

06:55 <purdeaandrei[m]> 1) IOClocker in DDR mode, but with divisor != 0 is not equivalent to IOClocker in SDR mode

06:55 <purdeaandrei[m]> Cause all edges are generated on the falling edge of f_sync

06:58 <purdeaandrei[m]> but data output happens on the rising edge of f_sync. So data output always happens 10ns before the falling edge, instead of at the falling edge

06:58 <purdeaandrei[m]> See this capture:

06:58 * purdeaandrei[m] uploaded an image: (48KiB) < https://catircservices.org/_matrix/media/v3/download/catircservices.org/ISpuiYNHYDkNSANzWFBlkRnG/image.png >

06:59 <purdeaandrei[m]> (logic analyzer ran at 500MHz, so it sometimes sees 10ns as 12ns, sometimes as 8ns

07:00 <purdeaandrei[m]> which means that sampling also happens 10ns before the rising edge

07:02 <purdeaandrei[m]> 2) The reason that 24MHz was also failing, is that this leaves only 10ns of roundtrip-time, from SCLK changing to the the FPGA being able to sample it, and 10ns is just not enough

07:05 <purdeaandrei[m]> This was my initial fix-attempt for 48MHz not working, but this failed, for the same reason: Sampling 10ns after the falling edge is not enough.... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/FAzizmtlkGRXqddBFZMUirOd>)

07:11 <purdeaandrei[m]> Note: 10ns is not enough for the chip I'm playing with right now, which is an "EON EN25F80", and in the particular environment I'm in. Other chips might have stronger driver. Also the glasgow's level shifters probably add a significant amount of delay, that wouldn't be there in a pure FPGA implementation.

07:11 \\ has joined #glasgow

07:13 abby has joined #glasgow

07:15 redstarcomrade has quit [Read error: Connection reset by peer]

07:16 <purdeaandrei[m]> So the big question is: When should we sample?

07:23 <purdeaandrei[m]> A simple solution is we could sample always on the falling edge (like we accidentally do right now in main, when running at 48MHz)

07:24 <purdeaandrei[m]> But that would not work for example at 96MHz in my environment. For that I would need a feature to delay the sampling of the signal

07:27 <purdeaandrei[m]> whitequark (@_discord_182174208896401419:catircservices.org) you said you tested it working to 60MHz, but 85MHz is what the toolchain said it should work at. Assuming you tried > 60Mhz and didn't work, I'd guess that 16.6ns is enough round-trip time for you in your environment. That is: I think > 60MHz should be possible if we delay the signal sampling point.

07:29 <purdeaandrei[m]> Do we intend to support QSPI with >48MHz at some point in the future?

07:35 <purdeaandrei[m]> For an example of other designs also permitting a configurable sample time, see the RP2040 datasheet section "4.10.9.1.1. RXD Sample Delay"

07:39 <purdeaandrei[m]> They see

07:39 <purdeaandrei[m]> * ~They see~

07:39 <purdeaandrei[m]> * ~~They see~~

07:44 <purdeaandrei[m]> What I would do is: Sample by default on the falling edge, and allow to adjust it by a configurable number of f_sync half-clock-cycles.

08:02 redstarcomrade has joined #glasgow

08:29 <purdeaandrei[m]> The only issue is that the final bit (or final nibble, or whatever) would then need to be sampled without a clock cycle being generated. If that is associated with an o_stream transfer, then we'd be wasting time between bytes (i.e. the last clock cycle will be 1 f_sync period longer). Therefore we must avoid the additional o_stream transfer.

08:31 <purdeaandrei[m]> There's two options:

08:31 <purdeaandrei[m]> 1) to memorize that we still need an i_en transfer, and perform it on the next clock cycle only if needed. And do an additional o_stream transfer, but only when we're doing the dummy transfer to deselect the chip.

08:32 <purdeaandrei[m]> + IOStreamer would need to support a configurable sample delay

08:33 <purdeaandrei[m]> ^I don't like this option

08:33 <purdeaandrei[m]> 2) IOStreamer must support a configurable sample delay anyway. So just use the rising edge to launch a sample request, but default the sample delay to half a SPI clock cycle length.

09:35 Eli2_ has joined #glasgow

09:36 tec3 has joined #glasgow

09:36 kitaleth_ has joined #glasgow

09:37 siriusfox_ has joined #glasgow

09:39 redstarcomrade has quit [Read error: Connection reset by peer]

09:44 tec has quit [*.net *.split]

09:44 whitequark[cis] has quit [*.net *.split]

09:44 siriusfox has quit [*.net *.split]

09:44 kitaleth has quit [*.net *.split]

09:44 Eli2 has quit [*.net *.split]

09:44 kitaleth_ is now known as kitaleth

09:44 tec3 is now known as tec

09:49 <_whitenotifier-1> [glasgow] purdeaandrei opened pull request #684: Reapply IOStreamer DDR sampling fix, and also fix QSPI - https://github.com/GlasgowEmbedded/glasgow/pull/684

09:49 <purdeaandrei[m]> Alright, I have a proposed fix here: https://github.com/GlasgowEmbedded/glasgow/pull/684

09:50 <purdeaandrei[m]> consider it an RFC cause I realise how complex it is

09:51 whitequark[cis] has joined #glasgow

09:52 <purdeaandrei[m]> the message I'm replying to is fixed in this commit: https://github.com/GlasgowEmbedded/glasgow/pull/684/commits/f6caa350bcd82b2e0b3fbbe7f25ef0430e0640f2

09:52 <whitequark[m]> noted; thank you for the investigative work

09:54 <purdeaandrei[m]> the problem mentioned in the message I'm replying to is fixed in this commit: https://github.com/GlasgowEmbedded/glasgow/pull/684/commits/4d7d3aea7156dea8539672f133bb52402f8de23b (which is the same as my initial fix attempt but is incomplete)

09:55 <purdeaandrei[m]> And 48MHz works with this commit, since sample delay is automatically set to 1 half-cycle if frequency is about 24MHz: https://github.com/GlasgowEmbedded/glasgow/pull/684/commits/f8ca20e0c00033351995a5eec6ebda453f6e0a7f

09:56 <purdeaandrei[m]> I've chose not to do this to minimize logic in the IOStreamer. Low frequencies will not introduce additional sample delay. Low frequencies will continue sampling on the rising edge.

09:58 <purdeaandrei[m]> whitequark (@_discord_182174208896401419:catircservices.org) you now may be able to go above 60MHz with this changeset. Do you have a PLL-enabled version in some branch somewhere so I can play with it?

10:02 <whitequark[m]> nope, I kinda just reproduce the support code by hand whenever I want to try it... which isn't much good for you of course

10:03 <whitequark[m]> I'm gonna be on a train for two hours in a bit, let's discuss the conceptual approach then

11:13 <whitequark[m]> i'm available

11:13 <whitequark[m]> i'm also incredibly sleep deprived, so if i seem slow to understand, that's why

11:17 <whitequark[m]> oh... for some reason I was convinced they output data on the previous rising edge (think the same way that a posedge clock domain inside an FPGA would work)

11:17 * whitequark[m] uploaded an image: (113KiB) < https://catircservices.org/_matrix/media/v3/download/catircservices.org/nwDELDNSQpwtaXMzmlfbrbUJ/Screenshot_20240907_111745.png >

11:17 <whitequark[m]> you are right of course

11:20 <whitequark[m]> I measured this; iirc glasgow level shifters add about 3 ns of delay one-way

11:22 <whitequark[m]> the datasheet is actually very generous on this

11:22 * whitequark[m] uploaded an image: (139KiB) < https://catircservices.org/_matrix/media/v3/download/catircservices.org/ZgnlFKupvgqLRrUjkVnDlWZl/Screenshot_20240907_112210.png >

11:24 <whitequark[m]> oh, wrong table

11:24 * whitequark[m] uploaded an image: (124KiB) < https://catircservices.org/_matrix/media/v3/download/catircservices.org/JsrNPgRGNILFJoPMvaIGzUJR/Screenshot_20240907_112432.png >

11:25 <whitequark[m]> 3 ns for 3V3 to 3V3 is right in the middle of the datasheet range, so it seems reasonable and I probably didn't misremember

11:26 <whitequark[m]> I was trying really hard to avoid this, which is why only modes 0 and 3 are supported

11:27 <whitequark[m]> yeah, this seems reasonable (if somewhat counterintuitive), and is actually in line with my original design goal for IOStreamer/IOClocker, which is to sample and update on the same edge. I guess I just picked the wrong edge

11:29 <whitequark[m]> actually, the flashes seem to do something different for the very first bit, which is updated at the falling edge of CS# and not SCLK

11:34 <purdeaandrei[m]> After I wrote that, and realized it would be painful to implement, I fell back on sampling on the rising edge + configurable delay

11:35 <whitequark[m]> I see

11:35 <whitequark[m]> why would it be painful to implement again?

11:35 <whitequark[m]> the last nibble?

11:36 <purdeaandrei[m]> yes

11:36 <purdeaandrei[m]> when back-to-back transfers take place, basicly you'd have to sample the last nibble of the previous transaction on the same clock cycle on which you're launching the first nibble of the new transaction

11:37 <purdeaandrei[m]> + when we're done and want to deselect that chip, you'd also want to sample when doing the "dummy cycle"

11:37 <purdeaandrei[m]> (the one that isn't generating a clock)

11:38 <purdeaandrei[m]> this way, by launching the sample on the rising edge, we could still end up sampling on the falling edge, if we set a sample delay equal to half of the spi clock

11:39 <purdeaandrei[m]> but at slow speeds that would add a lot of skid buffer and delay logic in iostreamer

11:39 <whitequark[m]> I am uncomfortable with the added complexity of the sample delay

11:39 <purdeaandrei[m]> and there's no reason to do it at slow speeds

11:39 <purdeaandrei[m]> it is how other solutions do it, see the RP2040 datasheet

11:40 <purdeaandrei[m]> see this section:

11:41 * purdeaandrei[m] uploaded an image: (99KiB) < https://catircservices.org/_matrix/media/v3/download/catircservices.org/HOqIGXqLFVdQTajLhzycopcE/image.png >

11:41 <whitequark[m]> yeah, am looking at it

11:41 <purdeaandrei[m]> the waveform is a little wrong cause it shows rxd_slv responding to the positive edge

11:41 <whitequark[m]> ok, DW_apb_ssi is an industry standard solution, I have to seriously consider what they're doing

11:42 <whitequark[m]> DesignWare cores are usually pretty good

11:42 <purdeaandrei[m]> Yeah, I've seen it in other devices as well

11:42 <whitequark[m]> so this is basically a digital only delay line

11:42 <whitequark[m]> that relies on ... not oversampling, but the ability to oversample

11:43 <whitequark[m]> (since with oversampling you'd usually consider multiple samples, while here only one is used)

11:43 <purdeaandrei[m]> what's better about my current implementation, is that it allows the sample to be specified in half-clock-cycle precision, when using ratio=2. It makes it more precise

11:44 <purdeaandrei[m]> Yeah, it's literally just a delay, where the sampling happens later than the event that's causing it

11:46 <purdeaandrei[m]> But if you really don't like the complexity added by that, then we can just have a clock-cycle-only granularity, end it ends up as a really simple change, that only adds the value to latency

11:46 <purdeaandrei[m]> s/end/and/

11:46 <whitequark[m]> the RP2040 sample delay is configured to provide up to 256 cycles of delay

11:46 <whitequark[m]> that's 256 register stages and a 256-input mux (?!)

11:47 <whitequark[m]> how do they do it without spending a ton of silicon area

11:47 <whitequark[m]> the reason I'm looking at it this way is because IOStreamer should become a part of amaranth-stdio, and QSPIController should likewise

11:47 <whitequark[m]> this means that they have to be considerate of non-Glasgow implementations (including ASIC ones) where the delay would have to be configurable

11:47 <purdeaandrei[m]> well in our case we on;t really need one

11:48 <whitequark[m]> I guess it's just a 256x8 BRAM

11:48 <purdeaandrei[m]> aham, got it

11:48 <purdeaandrei[m]> they may have an SSI-specific implementation

11:49 <whitequark[m]> this is also why I'm being particularly scrutinizing of your PRs, besides my usual attention to detail that goes into Glasgow; Glasgow is being used as a testbed for amaranth-stdio like this

11:49 <purdeaandrei[m]> IOStreamer is much more generic

11:49 <whitequark[m]> but they do need to store 256 cycles worth of data somewhere, no?

11:49 <purdeaandrei[m]> maybe they just need a timer of when to start the shiftregister.

11:50 <whitequark[m]> unless there's a restriction on RX_SAMPLE_DLY <= SCKDV or something

11:50 <whitequark[m]> I am thinking that your approach is probably the right one, but I want to restructure things in a particular way

11:51 <whitequark[m]> given how complex IOStreamer is getting, I think we would benefit from reorganizing it into multiple individually testable units

11:51 <whitequark[m]> the way I'm thinking I would approach it is by making a much simpler component which has always_valid/always_ready streams that merely sample the pin and do nothing else, and do everything else using stream transformers

11:51 <whitequark[m]> that would also make simulating DDR inputs a lot less messy

11:51 <whitequark[m]> s/inputs/scenarios/

11:53 <whitequark[m]> I guess there would be a skid buffer / queue module, an RX delay module, the core component, and the toplevel

11:53 <whitequark[m]> also, I am wondering whether the "bus hold" function that's currently a part of IOStreamer should not be combined with the "cycle stretch" function that is currently a part of IOClocker

11:55 <purdeaandrei[m]> https://www.intel.com/content/www/us/en/programmable/hps/agilex/index.html#rra1561688179463.html

11:55 <purdeaandrei[m]> here's another device using the same designware ip

11:55 <purdeaandrei[m]> see this note:

11:55 <whitequark[m]> (I am a little surprised by how complex solving this problem generally turns out to be)

11:55 <purdeaandrei[m]> Rxd Sample Delay.... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/DZMwyCbvdDeNhFkmCMMzcxpc>)

11:55 <purdeaandrei[m]> looks like they're using shift registers

11:56 <purdeaandrei[m]> but because it's not a generic iostreamer it doesn't need a skid buffer

11:56 <purdeaandrei[m]> by the way, we should make sure, if IOStreamer's i_stream is always ready, then no skid buffer is instantiated

11:57 <whitequark[m]> that would be fairly rare I think, given that the inputs eventually go into some FIFO

11:58 <whitequark[m]> ok, so it might not actually be 256 deep

11:59 <whitequark[m]> The threshold value allows you to provide early indication to the processor that the receive FIFO is nearly full. A receive... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/dKLTzsVWbjhwZhpjZUJNbksd>)

11:59 <whitequark[m]> huh, so DW didn't try to solve the problem the same way we're doing

12:00 <whitequark[m]> I feel that it's unacceptable to lose data, hence the skid buffer and somewhat complex out-to-in readiness logic

12:06 <purdeaandrei[m]> Haven't read that part yet, I'm slow 🙂 I was trying to figure out what the depth was on RP2040 but it doesn't seem documented. I would have to do an experiment to figure out. Anyway, I doubt most devices will actually implement a depth of 256. What should be done is to have it be parametrizable, and the user can make a calculation based on the maximum system frequency, and maximum analog roundtrip delay that needs to be

12:06 <purdeaandrei[m]> supported.

12:06 <whitequark[m]> yeah

12:06 <whitequark[m]> RP2040 docs are weird sometimes

12:06 <purdeaandrei[m]> Let me read what you wrote above about the restructuring...

12:09 <whitequark[m]> there is a lot of stuff about the current structure that feels very hacky or ad-hoc, with lots of conditionals and complex custom connections and stuff

12:09 <whitequark[m]> this is the best I could do and I wanted to ship it, but I'm unhappy with it

12:09 <whitequark[m]> it's actually like 3rd or 4th iteration, I've scrapped everything before

12:16 <purdeaandrei[m]> I think the only way it can loose data is if the software drives the SSI to keep reading more and more without actually extracting the data.

12:16 <purdeaandrei[m]> i.e. the flow control must be done in software

12:18 <whitequark[m]> you could have this happen with DMA for example, so I feel the added safety is worth it

12:18 <whitequark[m]> I'm thinking that QSPIFramer should maybe actually output a pair of samples to put on the bus, with both clock polarities

12:18 <whitequark[m]> so half of IOClocker becomes a part of QSPIFramer, and another half of IOClocker becomes IOBaudGen or something

12:18 <purdeaandrei[m]> Is the RX delay any different from the buffer latency? (not counting my half-cycle RX delay)

12:18 <whitequark[m]> buffer latency is fixed but RX delay is configurable

12:19 <whitequark[m]> I'm wondering if you could get away with less skid buffer than your max RX delay

12:20 <purdeaandrei[m]> yes, of course

12:20 <whitequark[m]> during the RX delay period after a sample has been requested, you (by definition) cannot have any samples, so it feels like there's no need to allocate that much skid buffer space

12:20 <purdeaandrei[m]> the skid buffer will be full, if you start by issuing a sample request on every cycle, and after a while i_stream suddenly becomes non-ready

12:21 <purdeaandrei[m]> but during the rx delay period you can always send more sample requests

12:21 <purdeaandrei[m]> I think any delay would need a matching skid buffer size here

12:22 <whitequark[m]> you only need skid buffer space for pipelining in the system, not the delay

12:22 <whitequark[m]> so basically, the thought here is that any protocol that wishes to work in DDR mode, should be producing 2-lane streams

12:22 <whitequark[m]> then, when the cut down IOStreamer is configured in SDR mode, it accepts 1-lane streams, and there needs to be a down-converter for the 2-lane streams produced by QSPIFramer

12:22 <whitequark[m]> when the cut down IOStreamer is configured in DDR mode, it just accepts them as-is

12:23 <purdeaandrei[m]> but the RX delay can also be pipelined

12:23 <purdeaandrei[m]> imagine very long transmission lines many meters long

12:23 <purdeaandrei[m]> the SPI chip is far far away

12:24 <purdeaandrei[m]> you could send all the clocks you need

12:24 <purdeaandrei[m]> and then get back the data stream

12:24 <purdeaandrei[m]> since it can be pipelined, there needs to be guaranteed space for it in the skid buffer before sending it

12:25 <whitequark[m]> I think this would avoid some of the kinda nightmarish data munging that is currently a part of the interface

12:25 <whitequark[m]> that won't work

12:25 <whitequark[m]> you basically can't make SPI work in an environment like that. if you really need to, you have to use special LVDS converters

12:26 <purdeaandrei[m]> my point is a theoretical one, i'm not trying to build this

12:26 <whitequark[m]> mmm

12:26 <purdeaandrei[m]> I don't see why the skid buffer could be smaller

12:26 <gruetzkopf> a problem i've run into was running a SPI link over RS422 transceivers. they were fast enough, but propagation delay was an issue

12:26 smkz has quit [Quit: smkz]

12:27 <whitequark[m]> the idea here is that RX delay cannot be more than the period

12:27 <whitequark[m]> as you point out, you could theoretically construct a system where that is desirable, but we don't need to support those systems

12:27 <whitequark[m]> with the limitation I propose, I think you can save on the skid buffer space

12:27 smkz has joined #glasgow

12:29 <purdeaandrei[m]> Current round-trip-time on the glasgow is more than 10ns. If your 60MHz limit was because of the sample point, then it could be around 16ns. The chip I'm using can run at 100MHz, that's a 10ns period, so that's already longer than a period

12:29 <whitequark[m]> we could make it two periods

12:29 <whitequark[m]> that adds one entry to the skid buffer I think?

12:30 <whitequark[m]> it's something like ... (IO buffer latency) + (RX delay / CLK period)

12:30 <whitequark[m]> if I'm thinking about this right

12:30 <whitequark[m]> I guess the problem then becomes that if you allow it to be two periods, I think you can't do a simple implementation like a counter

12:30 <purdeaandrei[m]> you mean sync clock period and maximum supported RX delay?

12:31 <whitequark[m]> can you rephrase that?

12:32 <whitequark[m]> what I'm thinking is that there can be no more "interesting" samples in flight than (RX delay / CLK period)

12:33 <purdeaandrei[m]> with the current implementation in my branch, the skid buffer size will be IO_Buffer_latency + maximum_RX_sammple_delay (nanoseconds) / system clock period (ns)

12:33 <whitequark[m]> because there is at most one "interesting" sample per RX delay

12:33 <whitequark[m]> s/RX/CLK/, s/delay/period/

12:33 <purdeaandrei[m]> * with the current implementation in my branch, the skid buffer size will be IO_Buffer_latency + (maximum_RX_sample_delay (nanoseconds) / system clock period (ns))

12:33 <whitequark[m]> yes, which is actually the same as my number exactly

12:33 <whitequark[m]> so I think we might be agreeing?

12:34 <purdeaandrei[m]> yes, okay, so you're not necessarily talking about redesigning how the skid buffer works, you're just saying the maximum_RX_sample_delay, as measured in system clock cycles doesn't need to be large

12:34 <purdeaandrei[m]> which I agree

12:34 <purdeaandrei[m]> But ultimately it doesn't need to be our choice what the maximum is

12:35 <purdeaandrei[m]> We can offer a sensible default for the maximum

12:35 <purdeaandrei[m]> but we can also let it be customizable

12:35 <purdeaandrei[m]> if somebody wants more or less

12:36 <purdeaandrei[m]> (and for glasgow, the command line argument could continue to allow to pick the right size)\

12:36 <purdeaandrei[m]> * (and for glasgow, the command line argument could continue to allow to pick the right size)

12:36 <whitequark[m]> I didn't specify units, so it was ambiguous

12:37 <purdeaandrei[m]> but I don't think a counter can work here

12:38 <purdeaandrei[m]> as long as i_stream is free to push back

12:38 <whitequark[m]> ah actually no it's not the same

12:38 <whitequark[m]> different clock periods

12:38 <purdeaandrei[m]> I guess in the case of DW SSI, they rely on software to never issue too many reads, so they don't have push bacjk

12:38 <purdeaandrei[m]> s/bacjk/back/

12:45 <purdeaandrei[m]> by 2-lane stream here you mean a single stream that transfer both halves of a SPI clock cycle, in a single system clock cycle?

12:48 <whitequark[m]> yes

12:53 <purdeaandrei[m]> if you meant the SPI clock period then what you mean is that we already know that the IOclocker will rate-limit the number of sample requests it can send. And that is why we can be sure that the number of in-flight sample requests is limited, then indeed, we could potentially allow a larger sample delay (in terms of system clocks), without increasing the skid buffer size by the same number. However that puts an additional constraint

12:53 <purdeaandrei[m]> on the stream interface. If we implement it like that, then IOClocker MUST observe that rate limit

12:53 <purdeaandrei[m]> otherwise data loss will occur

12:53 <purdeaandrei[m]> so that becomes part of the contract

12:55 <purdeaandrei[m]> So IOStreamer with 2 cycles of DDR buffer latency, and 4 cycles of rx_sample_delay, in the most general case would need 6 skid buffer depth, however if the user has a rate limit, then it can get away with a smaller skid buffer

12:56 <whitequark[m]> like I said elsewhere, I think the components as a whole should be restructured

12:57 <whitequark[m]> this will allow maintaining a reasonable contract by using slightly different interfaces

12:59 <whitequark[m]> I think the individual pieces probably should not try to guarantee data loss, and the peripheral itself must guarantee their assembly is correct

13:03 <purdeaandrei[m]> How do you think this should go, should I add reorganizing commits on top of my current branch?

13:03 meklort_ has joined #glasgow

13:03 <purdeaandrei[m]> I would prefer not to start reorganizing the current IOstreamer on main, because it's broken in two ways

13:03 meklort has quit [Ping timeout: 248 seconds]

13:03 meklort_ is now known as meklort

13:03 <purdeaandrei[m]> well, three

13:04 <purdeaandrei[m]> the ddr sample point is wrong, there is a condition in which a sample could be lost, and it can deadlock with certain istreamers

13:04 <purdeaandrei[m]> s/istreamers/istream consumers/

13:05 <purdeaandrei[m]> I would prefer to start from a working base, perhaps ass a DDR QSPI test, and go from there

13:05 <purdeaandrei[m]> s/ass/add/

13:06 <purdeaandrei[m]> and rerun tests to confirm that each refactoring step keeps existing functionalit

13:06 <purdeaandrei[m]> s/functionalit/functionality/

13:08 <purdeaandrei[m]> going the opposite way could be error prone (i.e. start by reorganizing the broken IOStreamer with non-functional refactoring changes, followed by fixing all the problems again on the new organization)

13:15 <_whitenotifier-1> [glasgow] purdeaandrei synchronize pull request #672: applet.program.psoc1: implemented - https://github.com/GlasgowEmbedded/glasgow/pull/672

13:15 <whitequark[m]> I'm thinking I would just start rewriting it from scratch

13:15 <whitequark[m]> almost every piece needs replacement

13:16 <purdeaandrei[m]> umm, I'm not confident enough to rewrite it from scratch, it's too complex for that

13:34 <whitequark[m]> I feel like it would take far too much effort to replace it piece-by-piece when I wrote the current version in a few days anyway (like 3 or 4?)

13:34 <whitequark[m]> there are a few parts that I feel are completely wrong in terms of design, like handling of DDR

13:36 <whitequark[m]> I think i[0]/i[1] should not be exposed as-is (i.e. with the same interface as in io.Buffer), but rather that the logic should be written in terms of processing _n_ samples per clock cycle, where n can be 2, 1, <1

13:36 <whitequark[m]> s/n/_n_/

13:37 <whitequark[m]> I'm also thinking about HyperRAM, which would have a complicated DDR-to-DDR transformation which also has to phase shift the clock

13:38 <whitequark[m]> that transformation is much more easily expressed in terms of lane conversion (expand 2 lanes to 4, phase shift the clock, then narrow 4 lanes down to 2 or 1 as appropriate), which can be done with generic stream processing components

13:40 <whitequark[m]> then, on top of all that, you could have an RX sample delay, which could operate in terms of a single lane, or every lane, or maybe even half the lanes as a compromise. this is actually just another generic stream processing component, I think? not sure about this one

13:41 andymandias has quit [Remote host closed the connection]

13:52 <whitequark[m]> - IOStreamer: now has always_ready=True for stream of outputs, always_valid=True for stream of inputs, and does not handle either requesting sampling, or looping back ancillary data. all it does is pack io.Buffer inputs/outputs into streams with lanes, like io_streamer.o_stream.p.lane[1].io2.oe

13:52 <whitequark[m]> * - IOStreamer: now has always_ready=True for stream of outputs, always_valid=True for stream of inputs, and does not handle either requesting sampling, or looping back ancillary data. all it does is pack io.Buffer inputs/outputs into streams with lanes, like io_streamer.o_stream.p.lane[1].io2.oe

14:00 <whitequark[m]> - IOSampler: has always_ready=True incoming stream of incoming samples, normal stream of sample requests, and normal stream of outgoing samples coupled with loopback ancillary data. also has an internal skid buffer and an adjustable delay (combined in their implementation, I think?)

14:11 <whitequark[m]> (it's not completely clear how to maintain the alignment of incoming samples with sample requests here, I'd have to think more about it)

14:13 <whitequark[m]> - IOStretcher: has normal incoming stream of updates, always_valid=True outgoing stream of outputs, and outgoing normal stream of sample requests

14:15 <whitequark[m]> so if you assemble the three, you have incoming o+oe+ancillary data+sample enable, outgoing i+ancillary data, and a way to adjust the sampling point

14:15 <whitequark[m]> but you can test most of the logic separatealy from any actual pins, and the output path separately from the input path

14:18 <purdeaandrei[m]> I think I liked better what I thought the previous idea was: have a SimpleIOStreamer, that is always_valid and always_ready, that always samples. And it would connect its i_stream to something that implements the skid buffer, and throws away samples, when sampling wasn't requested.

14:18 <whitequark[m]> that's what I'm describing, no?

14:18 <purdeaandrei[m]> then I misunderstood

14:19 <whitequark[m]> how did you understand my explanation? it may not have been very clear

14:20 <purdeaandrei[m]> oh I read "does not handle either requesting sampling" and got confused, but that's all my fault

14:20 <purdeaandrei[m]> so IOSampler is the "something" I was describing

14:20 <whitequark[m]> yes

14:21 <whitequark[m]> i think the stream of sample requests might need to be always_valid=True

14:21 <whitequark[m]> this avoids any desynchronization

14:22 <whitequark[m]> when you reset the whole thing, it knows what the phase offset is, and then, since it transfers one sample request [i_en] and one sample per cycle, it will always stay in sync

14:25 <whitequark[m]> if the stream of sample requests stops (because of lost readiness), it means IOStretcher can submit no more updates, and IOSampler can fill and then empty its skid buffer without fear of data loss

14:26 <whitequark[m]> i guess technically that would be OStretcher and ISampler

14:27 <purdeaandrei[m]> I think I understand the big picture

14:27 <whitequark[m]> excellent

14:28 <purdeaandrei[m]> My issue is: thinking through all the corner cases of a design written from ground up feels painful

14:28 <whitequark[m]> it's not totally clear to me yet where the lane conversion would go in, but I think IOStretcher and IOSampler should operate on the same width as the underlying IOStreamer, so 2 lanes for DDR, 1 lane for SDR

14:28 <purdeaandrei[m]> however I already understand the corner cases of the current IOStreamer with my patches

14:29 <purdeaandrei[m]> if I refactor that I can make sure that each change I make does not modify behavior

14:29 <purdeaandrei[m]> and then I can be confident of the result

14:29 <whitequark[m]> I have the complementary problem, which is that I somewhat lost the understanding of corner cases with your patches

14:30 <whitequark[m]> part of it is that we just haven't worked together for a while yet, part of it is that I've been sleep deprived constantly due to various issues and it's just difficult to do good knowledge work in this state

14:30 <whitequark[m]> however, it is my hope that there would be fewer corner cases with a better design

14:30 <whitequark[m]> for example not having to worry about all the various DDR/SDR combinations and only think about lanes seems rather nice

14:31 <purdeaandrei[m]> oh, one thing about the lanes: there is a single OE for both lanes, I wonder how to handle that

14:31 <whitequark[m]> hmm

14:32 <whitequark[m]> the way protocols work is that there is a turnaround period of at least one full clock cycle

14:32 <whitequark[m]> so we can just OR them together in IOStreamer

14:32 <purdeaandrei[m]> but we're wasting shift registers on an extra OE

14:32 andymandias has joined #glasgow

14:33 <whitequark[m]> wasting where?

14:33 <purdeaandrei[m]> oh, wait, sorry, OE isn't delayed, only meta, and i_en, ignore me

14:34 andymandias has quit [Remote host closed the connection]

14:34 <whitequark[m]> and IOStretcher only stores one lane worth of state anyway

14:34 <whitequark[m]> I think there are actually some FPGAs which let you set OE per-bit

14:36 <purdeaandrei[m]> yeah, it's certainly possible to implement that. Then Amaranth would need a new type of DDRBuffer for that.

14:36 <whitequark[m]> I imagine there would be a fair amount of custom IOStreamer like cores for things like OSERDES in Xilinx

14:38 Wanda[cis] has joined #glasgow

14:38 <Wanda[cis]> Xilinx FPGAs are capable of supporting DDR on OE too

14:38 <Wanda[cis]> ugh netsplit

14:38 andymandias has joined #glasgow

14:38 mwkmwkmwk[m] has joined #glasgow

14:39 <mwkmwkmwk[m]> Xilinx FPGAs are capable of supporting DDR on OE too

14:40 <whitequark[m]> considering IOStretcher, I'm thinking it would have two separate data paths with a mux:

14:40 <whitequark[m]> - divisor==0 path, which completely bypasses the entire core (I like how clean this is)

14:40 <whitequark[m]> - divisor!=0 path, which first down-converts any multi-lane input to single-lane, and then stretches the individual lanes as appropriate

14:40 <whitequark[m]> this also naturally results in the behavior where divisor==0 for SDR IOStretcher does the same thing as divisor==1, I think

14:41 <mwkmwkmwk[m]> OSERDES are... weird, I don't remember all the combinations and devices, but on some devices you can do full per-bit OE up to 1:4 serialization, while for higher rates you're on your own and can do only comb OE or some shit like that

14:42 <whitequark[m]> s/- divisor!=0 path, which first down-converts any multi-lane input to single-lane, and then stretches the individual lanes as appropriate/- divisor!=0 path, which first down-converts any multi-lane input to single-lane, and stretches the individual lanes as appropriate, and then up-converts back to multi-lane if needed/

14:43 <purdeaandrei[m]> I'm confused here. divisor is something the IOClocker (or IOBaud) has divisor, IOStretcher, as described above only extends the last state

14:43 <purdeaandrei[m]> s/divisor//

14:45 <whitequark[m]> IOStretcher in the description above would have to take on extending the state because there's nothing to output, and also extending the state because of divisor / baud rate

14:45 <whitequark[m]> sorry, that was confusing

14:47 <whitequark[m]> one change here is that clk_en is gone, meaning there's no way to just quickly toggle CS# or something (well, not without altering the baud rate)

14:47 <whitequark[m]> so every update to the bus is going to be valid for one half-period or more

14:48 <whitequark[m]> one thing I didn't like about IOClocker is that it is a very generic primitive that at the same time encodes fairly specific approach to clocking a peripheral

14:48 <whitequark[m]> i.e. at 0 degree offset. for HyperRAM, which needs 90 degree offset, you'd have to rewrite it

14:49 <whitequark[m]> I want to separate the "ensure the transitions on the bus are slow enough" function from the "toggle the clock" function

14:52 <whitequark[m]> this is related to moving the responsibility of generating DDR waveforms or 2-lane stream, together with the clock toggling itself, to QSPIEnframer; if IOClocker no longer has to clock anything, it is simplified, doesn't need the weird interface with clk_en, and can focus on just stretching the bit times

14:53 <whitequark[m]> does this make sense?

14:54 \\ has quit [Quit: bye]

14:56 \\ has joined #glasgow

16:34 redstarcomrade has joined #glasgow

18:43 l0rds474n[m] has joined #glasgow

18:43 <l0rds474n[m]> Just got my glasgow C3, is there some tutorial of how to use it except the tiny manual on the site?

19:48 redstarcomrade has quit [Read error: Connection reset by peer]

20:39 redstarcomrade has joined #glasgow

22:15 <whitequark[m]> nope, that's it

22:45 puck has quit [Remote host closed the connection]

22:46 puck has joined #glasgow