whitequark[cis] changed the topic of #glasgow to: https://glasgow-embedded.org · digital interface explorer · https://www.crowdsupply.com/1bitsquared/glasgow · code https://github.com/GlasgowEmbedded/glasgow · logs https://libera.irclog.whitequark.org/glasgow · matrix #glasgow-interface-explorer:matrix.org · discord https://1bitsquared.com/pages/chat
redstarcomrade has joined #glasgow
purdeaandrei[m] has joined #glasgow
<purdeaandrei[m]> Understood, sorry for some reason I thought that you said you would test that with a real chip, and when I saw it merged I assumed everything was okay. I hope it didn't derail the demo.
whitequark[m] has joined #glasgow
<whitequark[m]> I completely forgot about it, it seems... I've been overloaded and stressed recently, so not surprising
<whitequark[m]> the demo went okay since I accounted for the possibility of this and had a few hours to drill into all the issues
<whitequark[m]> I think I might've tested one of the PRs but not both
abby has quit [Ping timeout: 252 seconds]
\\ has quit [Ping timeout: 276 seconds]
<purdeaandrei[m]> I'm trying out 48MHz now with current main branch, but I think I have too long wires, it's not reliable
<purdeaandrei[m]> I have something like 20cm, a soic clip (floating in air, checked it makes contact with each pin with a multimeter), and a small adapter
<purdeaandrei[m]> 10cm + soic ZIF socket also not working right
<purdeaandrei[m]> * working right at 48MHz but works at 24Mhz
<purdeaandrei[m]> * working right at 48MHz but works at 24MHz
<whitequark[m]> how does it fail? ff's after a while?
<purdeaandrei[m]> 0s after a while
<purdeaandrei[m]> I'll try to solder down the chip, I just want to double-check: 10cm shouldn't be too long, right?
<whitequark[m]> twist a ground wire with the clock wire
<whitequark[m]> and another one with a cs wire
<whitequark[m]> that usually does the trick
<purdeaandrei[m]> re-reading my old messages 🙂
meklort has quit [Quit: ZNC 1.7.5+deb4 - https://znc.in]
tec has joined #glasgow
meklort has joined #glasgow
<purdeaandrei[m]> Alright, so 25x spi chips sample at the rising edge, and output data at the falling edge
<purdeaandrei[m]> As far as I can tell based on comments in the code, IOClocker is designed, such that it always starts outputting data with a falling edge, and samples on the rising edge
<purdeaandrei[m]> This is (mostly) true, except for when in DDR mode, it doesn't abstract away whether i[1] or i[0] should be read
<purdeaandrei[m]> IOclocker in DDR mode, but with divisor!=0 will result with the rising edge sample in i[0]
<purdeaandrei[m]> but in DDR mode with divisor == 0, the rising edge will be in i[1]
<purdeaandrei[m]> QSPIController always connects i[0] to the deframer, so that's why it's wrong
<purdeaandrei[m]> The reason it works with wrong IOStreamer sample time, is that the sampling is actually happening on the falling edge, but the same falling edge that the SPI chip is sending new data on. The time for sck to the chip + sck to output time of the chip + the time back to the IO is long enough to satisfy the FPGA's hold time requirements.
<purdeaandrei[m]> hmm, I started implementing a fix, there's a bit more to this than I thought...
<purdeaandrei[m]> Since 24MHz fails too
<purdeaandrei[m]> Okay, so I realized two more things:
<purdeaandrei[m]> 1) IOClocker in DDR mode, but with divisor != 0 is not equivalent to IOClocker in SDR mode
<purdeaandrei[m]> Cause all edges are generated on the falling edge of f_sync
<purdeaandrei[m]> but data output happens on the rising edge of f_sync. So data output always happens 10ns before the falling edge, instead of at the falling edge
<purdeaandrei[m]> See this capture:
<purdeaandrei[m]> (logic analyzer ran at 500MHz, so it sometimes sees 10ns as 12ns, sometimes as 8ns
<purdeaandrei[m]> which means that sampling also happens 10ns before the rising edge
<purdeaandrei[m]> 2) The reason that 24MHz was also failing, is that this leaves only 10ns of roundtrip-time, from SCLK changing to the the FPGA being able to sample it, and 10ns is just not enough
<purdeaandrei[m]> This was my initial fix-attempt for 48MHz not working, but this failed, for the same reason: Sampling 10ns after the falling edge is not enough.... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/FAzizmtlkGRXqddBFZMUirOd>)
<purdeaandrei[m]> Note: 10ns is not enough for the chip I'm playing with right now, which is an "EON EN25F80", and in the particular environment I'm in. Other chips might have stronger driver. Also the glasgow's level shifters probably add a significant amount of delay, that wouldn't be there in a pure FPGA implementation.
\\ has joined #glasgow
abby has joined #glasgow
redstarcomrade has quit [Read error: Connection reset by peer]
<purdeaandrei[m]> So the big question is: When should we sample?
<purdeaandrei[m]> A simple solution is we could sample always on the falling edge (like we accidentally do right now in main, when running at 48MHz)
<purdeaandrei[m]> But that would not work for example at 96MHz in my environment. For that I would need a feature to delay the sampling of the signal
<purdeaandrei[m]> whitequark (@_discord_182174208896401419:catircservices.org) you said you tested it working to 60MHz, but 85MHz is what the toolchain said it should work at. Assuming you tried > 60Mhz and didn't work, I'd guess that 16.6ns is enough round-trip time for you in your environment. That is: I think > 60MHz should be possible if we delay the signal sampling point.
<purdeaandrei[m]> Do we intend to support QSPI with >48MHz at some point in the future?
<purdeaandrei[m]> For an example of other designs also permitting a configurable sample time, see the RP2040 datasheet section "4.10.9.1.1. RXD Sample Delay"
<purdeaandrei[m]> They see
<purdeaandrei[m]> * ~They see~
<purdeaandrei[m]> * ~~They see~~
<purdeaandrei[m]> What I would do is: Sample by default on the falling edge, and allow to adjust it by a configurable number of f_sync half-clock-cycles.
redstarcomrade has joined #glasgow
<purdeaandrei[m]> The only issue is that the final bit (or final nibble, or whatever) would then need to be sampled without a clock cycle being generated. If that is associated with an o_stream transfer, then we'd be wasting time between bytes (i.e. the last clock cycle will be 1 f_sync period longer). Therefore we must avoid the additional o_stream transfer.
<purdeaandrei[m]> There's two options:
<purdeaandrei[m]> 1) to memorize that we still need an i_en transfer, and perform it on the next clock cycle only if needed. And do an additional o_stream transfer, but only when we're doing the dummy transfer to deselect the chip.
<purdeaandrei[m]> + IOStreamer would need to support a configurable sample delay
<purdeaandrei[m]> ^I don't like this option
<purdeaandrei[m]> 2) IOStreamer must support a configurable sample delay anyway. So just use the rising edge to launch a sample request, but default the sample delay to half a SPI clock cycle length.
Eli2_ has joined #glasgow
tec3 has joined #glasgow
kitaleth_ has joined #glasgow
siriusfox_ has joined #glasgow
redstarcomrade has quit [Read error: Connection reset by peer]
tec has quit [*.net *.split]
whitequark[cis] has quit [*.net *.split]
siriusfox has quit [*.net *.split]
kitaleth has quit [*.net *.split]
Eli2 has quit [*.net *.split]
kitaleth_ is now known as kitaleth
tec3 is now known as tec
<_whitenotifier-1> [glasgow] purdeaandrei opened pull request #684: Reapply IOStreamer DDR sampling fix, and also fix QSPI - https://github.com/GlasgowEmbedded/glasgow/pull/684
<purdeaandrei[m]> Alright, I have a proposed fix here: https://github.com/GlasgowEmbedded/glasgow/pull/684
<purdeaandrei[m]> consider it an RFC cause I realise how complex it is
whitequark[cis] has joined #glasgow
<purdeaandrei[m]> the message I'm replying to is fixed in this commit: https://github.com/GlasgowEmbedded/glasgow/pull/684/commits/f6caa350bcd82b2e0b3fbbe7f25ef0430e0640f2
<whitequark[m]> noted; thank you for the investigative work
<purdeaandrei[m]> the problem mentioned in the message I'm replying to is fixed in this commit: https://github.com/GlasgowEmbedded/glasgow/pull/684/commits/4d7d3aea7156dea8539672f133bb52402f8de23b (which is the same as my initial fix attempt but is incomplete)
<purdeaandrei[m]> And 48MHz works with this commit, since sample delay is automatically set to 1 half-cycle if frequency is about 24MHz: https://github.com/GlasgowEmbedded/glasgow/pull/684/commits/f8ca20e0c00033351995a5eec6ebda453f6e0a7f
<purdeaandrei[m]> I've chose not to do this to minimize logic in the IOStreamer. Low frequencies will not introduce additional sample delay. Low frequencies will continue sampling on the rising edge.
<purdeaandrei[m]> whitequark (@_discord_182174208896401419:catircservices.org) you now may be able to go above 60MHz with this changeset. Do you have a PLL-enabled version in some branch somewhere so I can play with it?
<whitequark[m]> nope, I kinda just reproduce the support code by hand whenever I want to try it... which isn't much good for you of course
<whitequark[m]> I'm gonna be on a train for two hours in a bit, let's discuss the conceptual approach then
<whitequark[m]> i'm available
<whitequark[m]> i'm also incredibly sleep deprived, so if i seem slow to understand, that's why
<whitequark[m]> oh... for some reason I was convinced they output data on the previous rising edge (think the same way that a posedge clock domain inside an FPGA would work)
<whitequark[m]> you are right of course
<whitequark[m]> I measured this; iirc glasgow level shifters add about 3 ns of delay one-way
<whitequark[m]> the datasheet is actually very generous on this
<whitequark[m]> oh, wrong table
<whitequark[m]> 3 ns for 3V3 to 3V3 is right in the middle of the datasheet range, so it seems reasonable and I probably didn't misremember
<whitequark[m]> I was trying really hard to avoid this, which is why only modes 0 and 3 are supported
<whitequark[m]> yeah, this seems reasonable (if somewhat counterintuitive), and is actually in line with my original design goal for IOStreamer/IOClocker, which is to sample and update on the same edge. I guess I just picked the wrong edge
<whitequark[m]> actually, the flashes seem to do something different for the very first bit, which is updated at the falling edge of CS# and not SCLK
<purdeaandrei[m]> After I wrote that, and realized it would be painful to implement, I fell back on sampling on the rising edge + configurable delay
<whitequark[m]> I see
<whitequark[m]> why would it be painful to implement again?
<whitequark[m]> the last nibble?
<purdeaandrei[m]> yes
<purdeaandrei[m]> when back-to-back transfers take place, basicly you'd have to sample the last nibble of the previous transaction on the same clock cycle on which you're launching the first nibble of the new transaction
<purdeaandrei[m]> + when we're done and want to deselect that chip, you'd also want to sample when doing the "dummy cycle"
<purdeaandrei[m]> (the one that isn't generating a clock)
<purdeaandrei[m]> this way, by launching the sample on the rising edge, we could still end up sampling on the falling edge, if we set a sample delay equal to half of the spi clock
<purdeaandrei[m]> but at slow speeds that would add a lot of skid buffer and delay logic in iostreamer
<whitequark[m]> I am uncomfortable with the added complexity of the sample delay
<purdeaandrei[m]> and there's no reason to do it at slow speeds
<purdeaandrei[m]> it is how other solutions do it, see the RP2040 datasheet
<purdeaandrei[m]> see this section:
<whitequark[m]> yeah, am looking at it
<purdeaandrei[m]> the waveform is a little wrong cause it shows rxd_slv responding to the positive edge
<whitequark[m]> ok, DW_apb_ssi is an industry standard solution, I have to seriously consider what they're doing
<whitequark[m]> DesignWare cores are usually pretty good
<purdeaandrei[m]> Yeah, I've seen it in other devices as well
<whitequark[m]> so this is basically a digital only delay line
<whitequark[m]> that relies on ... not oversampling, but the ability to oversample
<whitequark[m]> (since with oversampling you'd usually consider multiple samples, while here only one is used)
<purdeaandrei[m]> what's better about my current implementation, is that it allows the sample to be specified in half-clock-cycle precision, when using ratio=2. It makes it more precise
<purdeaandrei[m]> Yeah, it's literally just a delay, where the sampling happens later than the event that's causing it
<purdeaandrei[m]> But if you really don't like the complexity added by that, then we can just have a clock-cycle-only granularity, end it ends up as a really simple change, that only adds the value to latency
<purdeaandrei[m]> s/end/and/
<whitequark[m]> the RP2040 sample delay is configured to provide up to 256 cycles of delay
<whitequark[m]> that's 256 register stages and a 256-input mux (?!)
<whitequark[m]> how do they do it without spending a ton of silicon area
<whitequark[m]> the reason I'm looking at it this way is because IOStreamer should become a part of amaranth-stdio, and QSPIController should likewise
<whitequark[m]> this means that they have to be considerate of non-Glasgow implementations (including ASIC ones) where the delay would have to be configurable
<purdeaandrei[m]> well in our case we on;t really need one
<whitequark[m]> I guess it's just a 256x8 BRAM
<purdeaandrei[m]> aham, got it
<purdeaandrei[m]> they may have an SSI-specific implementation
<whitequark[m]> this is also why I'm being particularly scrutinizing of your PRs, besides my usual attention to detail that goes into Glasgow; Glasgow is being used as a testbed for amaranth-stdio like this
<purdeaandrei[m]> IOStreamer is much more generic
<whitequark[m]> but they do need to store 256 cycles worth of data somewhere, no?
<purdeaandrei[m]> maybe they just need a timer of when to start the shiftregister.
<whitequark[m]> unless there's a restriction on RX_SAMPLE_DLY <= SCKDV or something
<whitequark[m]> I am thinking that your approach is probably the right one, but I want to restructure things in a particular way
<whitequark[m]> given how complex IOStreamer is getting, I think we would benefit from reorganizing it into multiple individually testable units
<whitequark[m]> the way I'm thinking I would approach it is by making a much simpler component which has always_valid/always_ready streams that merely sample the pin and do nothing else, and do everything else using stream transformers
<whitequark[m]> that would also make simulating DDR inputs a lot less messy
<whitequark[m]> s/inputs/scenarios/
<whitequark[m]> I guess there would be a skid buffer / queue module, an RX delay module, the core component, and the toplevel
<whitequark[m]> also, I am wondering whether the "bus hold" function that's currently a part of IOStreamer should not be combined with the "cycle stretch" function that is currently a part of IOClocker
<purdeaandrei[m]> here's another device using the same designware ip
<purdeaandrei[m]> see this note:
<whitequark[m]> (I am a little surprised by how complex solving this problem generally turns out to be)
<purdeaandrei[m]> looks like they're using shift registers
<purdeaandrei[m]> but because it's not a generic iostreamer it doesn't need a skid buffer
<purdeaandrei[m]> by the way, we should make sure, if IOStreamer's i_stream is always ready, then no skid buffer is instantiated
<whitequark[m]> that would be fairly rare I think, given that the inputs eventually go into some FIFO
<whitequark[m]> ok, so it might not actually be 256 deep
<whitequark[m]> The threshold value allows you to provide early indication to the processor that the receive FIFO is nearly full. A receive... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/dKLTzsVWbjhwZhpjZUJNbksd>)
<whitequark[m]> huh, so DW didn't try to solve the problem the same way we're doing
<whitequark[m]> I feel that it's unacceptable to lose data, hence the skid buffer and somewhat complex out-to-in readiness logic
<purdeaandrei[m]> Haven't read that part yet, I'm slow 🙂 I was trying to figure out what the depth was on RP2040 but it doesn't seem documented. I would have to do an experiment to figure out. Anyway, I doubt most devices will actually implement a depth of 256. What should be done is to have it be parametrizable, and the user can make a calculation based on the maximum system frequency, and maximum analog roundtrip delay that needs to be
<purdeaandrei[m]> supported.
<whitequark[m]> yeah
<whitequark[m]> RP2040 docs are weird sometimes
<purdeaandrei[m]> Let me read what you wrote above about the restructuring...
<whitequark[m]> there is a lot of stuff about the current structure that feels very hacky or ad-hoc, with lots of conditionals and complex custom connections and stuff
<whitequark[m]> this is the best I could do and I wanted to ship it, but I'm unhappy with it
<whitequark[m]> it's actually like 3rd or 4th iteration, I've scrapped everything before
<purdeaandrei[m]> I think the only way it can loose data is if the software drives the SSI to keep reading more and more without actually extracting the data.
<purdeaandrei[m]> i.e. the flow control must be done in software
<whitequark[m]> you could have this happen with DMA for example, so I feel the added safety is worth it
<whitequark[m]> I'm thinking that QSPIFramer should maybe actually output a pair of samples to put on the bus, with both clock polarities
<whitequark[m]> so half of IOClocker becomes a part of QSPIFramer, and another half of IOClocker becomes IOBaudGen or something
<purdeaandrei[m]> Is the RX delay any different from the buffer latency? (not counting my half-cycle RX delay)
<whitequark[m]> buffer latency is fixed but RX delay is configurable
<whitequark[m]> I'm wondering if you could get away with less skid buffer than your max RX delay
<purdeaandrei[m]> yes, of course
<whitequark[m]> during the RX delay period after a sample has been requested, you (by definition) cannot have any samples, so it feels like there's no need to allocate that much skid buffer space
<purdeaandrei[m]> the skid buffer will be full, if you start by issuing a sample request on every cycle, and after a while i_stream suddenly becomes non-ready
<purdeaandrei[m]> but during the rx delay period you can always send more sample requests
<purdeaandrei[m]> I think any delay would need a matching skid buffer size here
<whitequark[m]> you only need skid buffer space for pipelining in the system, not the delay
<whitequark[m]> so basically, the thought here is that any protocol that wishes to work in DDR mode, should be producing 2-lane streams
<whitequark[m]> then, when the cut down IOStreamer is configured in SDR mode, it accepts 1-lane streams, and there needs to be a down-converter for the 2-lane streams produced by QSPIFramer
<whitequark[m]> when the cut down IOStreamer is configured in DDR mode, it just accepts them as-is
<purdeaandrei[m]> but the RX delay can also be pipelined
<purdeaandrei[m]> imagine very long transmission lines many meters long
<purdeaandrei[m]> the SPI chip is far far away
<purdeaandrei[m]> you could send all the clocks you need
<purdeaandrei[m]> and then get back the data stream
<purdeaandrei[m]> since it can be pipelined, there needs to be guaranteed space for it in the skid buffer before sending it
<whitequark[m]> I think this would avoid some of the kinda nightmarish data munging that is currently a part of the interface
<whitequark[m]> that won't work
<whitequark[m]> you basically can't make SPI work in an environment like that. if you really need to, you have to use special LVDS converters
<purdeaandrei[m]> my point is a theoretical one, i'm not trying to build this
<whitequark[m]> mmm
<purdeaandrei[m]> I don't see why the skid buffer could be smaller
<gruetzkopf> a problem i've run into was running a SPI link over RS422 transceivers. they were fast enough, but propagation delay was an issue
smkz has quit [Quit: smkz]
<whitequark[m]> the idea here is that RX delay cannot be more than the period
<whitequark[m]> as you point out, you could theoretically construct a system where that is desirable, but we don't need to support those systems
<whitequark[m]> with the limitation I propose, I think you can save on the skid buffer space
smkz has joined #glasgow
<purdeaandrei[m]> Current round-trip-time on the glasgow is more than 10ns. If your 60MHz limit was because of the sample point, then it could be around 16ns. The chip I'm using can run at 100MHz, that's a 10ns period, so that's already longer than a period
<whitequark[m]> we could make it two periods
<whitequark[m]> that adds one entry to the skid buffer I think?
<whitequark[m]> it's something like ... (IO buffer latency) + (RX delay / CLK period)
<whitequark[m]> if I'm thinking about this right
<whitequark[m]> I guess the problem then becomes that if you allow it to be two periods, I think you can't do a simple implementation like a counter
<purdeaandrei[m]> you mean sync clock period and maximum supported RX delay?
<whitequark[m]> can you rephrase that?
<whitequark[m]> what I'm thinking is that there can be no more "interesting" samples in flight than (RX delay / CLK period)
<purdeaandrei[m]> with the current implementation in my branch, the skid buffer size will be IO_Buffer_latency + maximum_RX_sammple_delay (nanoseconds) / system clock period (ns)
<whitequark[m]> because there is at most one "interesting" sample per RX delay
<whitequark[m]> s/RX/CLK/, s/delay/period/
<purdeaandrei[m]> * with the current implementation in my branch, the skid buffer size will be IO_Buffer_latency + (maximum_RX_sample_delay (nanoseconds) / system clock period (ns))
<whitequark[m]> yes, which is actually the same as my number exactly
<whitequark[m]> so I think we might be agreeing?
<purdeaandrei[m]> yes, okay, so you're not necessarily talking about redesigning how the skid buffer works, you're just saying the maximum_RX_sample_delay, as measured in system clock cycles doesn't need to be large
<purdeaandrei[m]> which I agree
<purdeaandrei[m]> But ultimately it doesn't need to be our choice what the maximum is
<purdeaandrei[m]> We can offer a sensible default for the maximum
<purdeaandrei[m]> but we can also let it be customizable
<purdeaandrei[m]> if somebody wants more or less
<purdeaandrei[m]> (and for glasgow, the command line argument could continue to allow to pick the right size)\
<purdeaandrei[m]> * (and for glasgow, the command line argument could continue to allow to pick the right size)
<whitequark[m]> I didn't specify units, so it was ambiguous
<purdeaandrei[m]> but I don't think a counter can work here
<purdeaandrei[m]> as long as i_stream is free to push back
<whitequark[m]> ah actually no it's not the same
<whitequark[m]> different clock periods
<purdeaandrei[m]> I guess in the case of DW SSI, they rely on software to never issue too many reads, so they don't have push bacjk
<purdeaandrei[m]> s/bacjk/back/
<purdeaandrei[m]> by 2-lane stream here you mean a single stream that transfer both halves of a SPI clock cycle, in a single system clock cycle?
<whitequark[m]> yes
<purdeaandrei[m]> if you meant the SPI clock period then what you mean is that we already know that the IOclocker will rate-limit the number of sample requests it can send. And that is why we can be sure that the number of in-flight sample requests is limited, then indeed, we could potentially allow a larger sample delay (in terms of system clocks), without increasing the skid buffer size by the same number. However that puts an additional constraint
<purdeaandrei[m]> on the stream interface. If we implement it like that, then IOClocker MUST observe that rate limit
<purdeaandrei[m]> otherwise data loss will occur
<purdeaandrei[m]> so that becomes part of the contract
<purdeaandrei[m]> So IOStreamer with 2 cycles of DDR buffer latency, and 4 cycles of rx_sample_delay, in the most general case would need 6 skid buffer depth, however if the user has a rate limit, then it can get away with a smaller skid buffer
<whitequark[m]> like I said elsewhere, I think the components as a whole should be restructured
<whitequark[m]> this will allow maintaining a reasonable contract by using slightly different interfaces
<whitequark[m]> I think the individual pieces probably should not try to guarantee data loss, and the peripheral itself must guarantee their assembly is correct
<purdeaandrei[m]> How do you think this should go, should I add reorganizing commits on top of my current branch?
meklort_ has joined #glasgow
<purdeaandrei[m]> I would prefer not to start reorganizing the current IOstreamer on main, because it's broken in two ways
meklort has quit [Ping timeout: 248 seconds]
meklort_ is now known as meklort
<purdeaandrei[m]> well, three
<purdeaandrei[m]> the ddr sample point is wrong, there is a condition in which a sample could be lost, and it can deadlock with certain istreamers
<purdeaandrei[m]> s/istreamers/istream consumers/
<purdeaandrei[m]> I would prefer to start from a working base, perhaps ass a DDR QSPI test, and go from there
<purdeaandrei[m]> s/ass/add/
<purdeaandrei[m]> and rerun tests to confirm that each refactoring step keeps existing functionalit
<purdeaandrei[m]> s/functionalit/functionality/
<purdeaandrei[m]> going the opposite way could be error prone (i.e. start by reorganizing the broken IOStreamer with non-functional refactoring changes, followed by fixing all the problems again on the new organization)
<_whitenotifier-1> [glasgow] purdeaandrei synchronize pull request #672: applet.program.psoc1: implemented - https://github.com/GlasgowEmbedded/glasgow/pull/672
<whitequark[m]> I'm thinking I would just start rewriting it from scratch
<whitequark[m]> almost every piece needs replacement
<purdeaandrei[m]> umm, I'm not confident enough to rewrite it from scratch, it's too complex for that
<whitequark[m]> I feel like it would take far too much effort to replace it piece-by-piece when I wrote the current version in a few days anyway (like 3 or 4?)
<whitequark[m]> there are a few parts that I feel are completely wrong in terms of design, like handling of DDR
<whitequark[m]> I think i[0]/i[1] should not be exposed as-is (i.e. with the same interface as in io.Buffer), but rather that the logic should be written in terms of processing _n_ samples per clock cycle, where n can be 2, 1, <1
<whitequark[m]> s/n/_n_/
<whitequark[m]> I'm also thinking about HyperRAM, which would have a complicated DDR-to-DDR transformation which also has to phase shift the clock
<whitequark[m]> that transformation is much more easily expressed in terms of lane conversion (expand 2 lanes to 4, phase shift the clock, then narrow 4 lanes down to 2 or 1 as appropriate), which can be done with generic stream processing components
<whitequark[m]> then, on top of all that, you could have an RX sample delay, which could operate in terms of a single lane, or every lane, or maybe even half the lanes as a compromise. this is actually just another generic stream processing component, I think? not sure about this one
andymandias has quit [Remote host closed the connection]
<whitequark[m]> - IOStreamer: now has always_ready=True for stream of outputs, always_valid=True for stream of inputs, and does not handle either requesting sampling, or looping back ancillary data. all it does is pack io.Buffer inputs/outputs into streams with lanes, like io_streamer.o_stream.p.lane[1].io2.oe
<whitequark[m]> * - IOStreamer: now has always_ready=True for stream of outputs, always_valid=True for stream of inputs, and does not handle either requesting sampling, or looping back ancillary data. all it does is pack io.Buffer inputs/outputs into streams with lanes, like io_streamer.o_stream.p.lane[1].io2.oe
<whitequark[m]> - IOSampler: has always_ready=True incoming stream of incoming samples, normal stream of sample requests, and normal stream of outgoing samples coupled with loopback ancillary data. also has an internal skid buffer and an adjustable delay (combined in their implementation, I think?)
<whitequark[m]> (it's not completely clear how to maintain the alignment of incoming samples with sample requests here, I'd have to think more about it)
<whitequark[m]> - IOStretcher: has normal incoming stream of updates, always_valid=True outgoing stream of outputs, and outgoing normal stream of sample requests
<whitequark[m]> so if you assemble the three, you have incoming o+oe+ancillary data+sample enable, outgoing i+ancillary data, and a way to adjust the sampling point
<whitequark[m]> but you can test most of the logic separatealy from any actual pins, and the output path separately from the input path
<purdeaandrei[m]> I think I liked better what I thought the previous idea was: have a SimpleIOStreamer, that is always_valid and always_ready, that always samples. And it would connect its i_stream to something that implements the skid buffer, and throws away samples, when sampling wasn't requested.
<whitequark[m]> that's what I'm describing, no?
<purdeaandrei[m]> then I misunderstood
<whitequark[m]> how did you understand my explanation? it may not have been very clear
<purdeaandrei[m]> oh I read "does not handle either requesting sampling" and got confused, but that's all my fault
<purdeaandrei[m]> so IOSampler is the "something" I was describing
<whitequark[m]> yes
<whitequark[m]> i think the stream of sample requests might need to be always_valid=True
<whitequark[m]> this avoids any desynchronization
<whitequark[m]> when you reset the whole thing, it knows what the phase offset is, and then, since it transfers one sample request [i_en] and one sample per cycle, it will always stay in sync
<whitequark[m]> if the stream of sample requests stops (because of lost readiness), it means IOStretcher can submit no more updates, and IOSampler can fill and then empty its skid buffer without fear of data loss
<whitequark[m]> i guess technically that would be OStretcher and ISampler
<purdeaandrei[m]> I think I understand the big picture
<whitequark[m]> excellent
<purdeaandrei[m]> My issue is: thinking through all the corner cases of a design written from ground up feels painful
<whitequark[m]> it's not totally clear to me yet where the lane conversion would go in, but I think IOStretcher and IOSampler should operate on the same width as the underlying IOStreamer, so 2 lanes for DDR, 1 lane for SDR
<purdeaandrei[m]> however I already understand the corner cases of the current IOStreamer with my patches
<purdeaandrei[m]> if I refactor that I can make sure that each change I make does not modify behavior
<purdeaandrei[m]> and then I can be confident of the result
<whitequark[m]> I have the complementary problem, which is that I somewhat lost the understanding of corner cases with your patches
<whitequark[m]> part of it is that we just haven't worked together for a while yet, part of it is that I've been sleep deprived constantly due to various issues and it's just difficult to do good knowledge work in this state
<whitequark[m]> however, it is my hope that there would be fewer corner cases with a better design
<whitequark[m]> for example not having to worry about all the various DDR/SDR combinations and only think about lanes seems rather nice
<purdeaandrei[m]> oh, one thing about the lanes: there is a single OE for both lanes, I wonder how to handle that
<whitequark[m]> hmm
<whitequark[m]> the way protocols work is that there is a turnaround period of at least one full clock cycle
<whitequark[m]> so we can just OR them together in IOStreamer
<purdeaandrei[m]> but we're wasting shift registers on an extra OE
andymandias has joined #glasgow
<whitequark[m]> wasting where?
<purdeaandrei[m]> oh, wait, sorry, OE isn't delayed, only meta, and i_en, ignore me
andymandias has quit [Remote host closed the connection]
<whitequark[m]> and IOStretcher only stores one lane worth of state anyway
<whitequark[m]> I think there are actually some FPGAs which let you set OE per-bit
<purdeaandrei[m]> yeah, it's certainly possible to implement that. Then Amaranth would need a new type of DDRBuffer for that.
<whitequark[m]> I imagine there would be a fair amount of custom IOStreamer like cores for things like OSERDES in Xilinx
Wanda[cis] has joined #glasgow
<Wanda[cis]> Xilinx FPGAs are capable of supporting DDR on OE too
<Wanda[cis]> ugh netsplit
andymandias has joined #glasgow
mwkmwkmwk[m] has joined #glasgow
<mwkmwkmwk[m]> Xilinx FPGAs are capable of supporting DDR on OE too
<whitequark[m]> considering IOStretcher, I'm thinking it would have two separate data paths with a mux:
<whitequark[m]> - divisor==0 path, which completely bypasses the entire core (I like how clean this is)
<whitequark[m]> - divisor!=0 path, which first down-converts any multi-lane input to single-lane, and then stretches the individual lanes as appropriate
<whitequark[m]> this also naturally results in the behavior where divisor==0 for SDR IOStretcher does the same thing as divisor==1, I think
<mwkmwkmwk[m]> OSERDES are... weird, I don't remember all the combinations and devices, but on some devices you can do full per-bit OE up to 1:4 serialization, while for higher rates you're on your own and can do only comb OE or some shit like that
<whitequark[m]> s/- divisor!=0 path, which first down-converts any multi-lane input to single-lane, and then stretches the individual lanes as appropriate/- divisor!=0 path, which first down-converts any multi-lane input to single-lane, and stretches the individual lanes as appropriate, and then up-converts back to multi-lane if needed/
<purdeaandrei[m]> I'm confused here. divisor is something the IOClocker (or IOBaud) has divisor, IOStretcher, as described above only extends the last state
<purdeaandrei[m]> s/divisor//
<whitequark[m]> IOStretcher in the description above would have to take on extending the state because there's nothing to output, and also extending the state because of divisor / baud rate
<whitequark[m]> sorry, that was confusing
<whitequark[m]> one change here is that clk_en is gone, meaning there's no way to just quickly toggle CS# or something (well, not without altering the baud rate)
<whitequark[m]> so every update to the bus is going to be valid for one half-period or more
<whitequark[m]> one thing I didn't like about IOClocker is that it is a very generic primitive that at the same time encodes fairly specific approach to clocking a peripheral
<whitequark[m]> i.e. at 0 degree offset. for HyperRAM, which needs 90 degree offset, you'd have to rewrite it
<whitequark[m]> I want to separate the "ensure the transitions on the bus are slow enough" function from the "toggle the clock" function
<whitequark[m]> this is related to moving the responsibility of generating DDR waveforms or 2-lane stream, together with the clock toggling itself, to QSPIEnframer; if IOClocker no longer has to clock anything, it is simplified, doesn't need the weird interface with clk_en, and can focus on just stretching the bit times
<whitequark[m]> does this make sense?
\\ has quit [Quit: bye]
\\ has joined #glasgow
redstarcomrade has joined #glasgow
l0rds474n[m] has joined #glasgow
<l0rds474n[m]> Just got my glasgow C3, is there some tutorial of how to use it except the tiny manual on the site?
redstarcomrade has quit [Read error: Connection reset by peer]
redstarcomrade has joined #glasgow
<whitequark[m]> nope, that's it
puck has quit [Remote host closed the connection]
puck has joined #glasgow