#scopehal on 2022-05-02 — irc logs at libera.irclog.whitequark.org

2022-03-25 21:41 azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal

00:15 <d1b2> <Darius> seems like a reasonable place to start

00:16 <d1b2> <Darius> if it becomes a problem then it can be looked at more later

00:22 Degi_ has joined #scopehal

00:23 Degi has quit [Ping timeout: 256 seconds]

00:23 Degi_ is now known as Degi

04:28 lethalbit has quit [*.net *.split]

04:28 lethalbit has joined #scopehal

06:04 <azonenberg> Sooo this is lovely

06:04 <azonenberg> apparently the ARWV Name command does not actually work

06:04 <azonenberg> i.e. you cannot do C1:ARWV Name,Cot

06:04 <azonenberg> or similar

06:05 <azonenberg> or C1:ARWV Name,ExpFal

06:05 <azonenberg> (cc: @mubes)

06:05 <azonenberg> What you have to do instead is C1:ARWV Index,10

06:05 <azonenberg> and first you have to do the STL? command to figure out the mapping of indexes to names

06:05 <azonenberg> lol

06:30 <d1b2> <Hardkrash> RE: The memory bandwidth limit. I would prefer error and return the partial data collected. There is value for the data captured after the trigger even if the bandwidth was saturated later. If compression crosses a high water mark could padding data mux in and replace the actual data into the stream and trigger a stop of the capture?

06:32 <azonenberg> so part of the problem is, and the reason i am leaning towards an abort

06:32 <azonenberg> is that it's likely you would overflow before you even encountered the trigger

06:32 <azonenberg> i.e. whiel trying to fill the pretrigger buffer

06:32 <d1b2> <Hardkrash> The high watermark might also be based on the memory bandwidth as opposed to a FIFO level. e.g. if the memory hits 98% bandwidth utilization trigger a graceful termination.

06:33 <d1b2> <Hardkrash> Ahh in that case a trigger failed and the data leading up to the failure is of less value

06:33 <azonenberg> and bw utilization is hard to benchmark

06:34 <azonenberg> basically, any given cycle the memory either is ready to accept a comamnd or not

06:34 <azonenberg> if it's ready, and i have data to push, i send that data

06:34 <azonenberg> it's possible for me to have a burst of traffic then the memory to get busy and i don't lose anything

06:34 <azonenberg> but if a refresh happens at the wrong time the same burst could overflow

06:34 <azonenberg> https://www.antikernel.net/temp/ddr-performance-2bankmachines.png

06:35 <d1b2> <Hardkrash> The case i was thinking of was post trigger compression failure.

06:35 <azonenberg> https://www.antikernel.net/temp/ddr-performance-4bankmachines.png

06:35 <azonenberg> https://www.antikernel.net/temp/ddr-performance-8bankmachines.png

06:35 <azonenberg> you can see with 2 bank machines we lose data in 2 of the fifos

06:35 <azonenberg> but we still "have" 3.6 Gbps worth of slots in which the ram is ready to accept a command and we have nothing for it to do

06:36 <azonenberg> since actual throughput is so dependent on access patterns it's very hard to determine how close you are to saturation

06:37 <azonenberg> with 4 vs 8 bank machines we lose no data, but with 4 machines we are available roughly 1/3 of the slots we're not actively writing

06:37 <azonenberg> while with 8 we're available 2/3 of the slots

06:37 <azonenberg> the other issue wrt overflows and sending partial data is that the compression is streaming

06:37 <azonenberg> and variable rate

06:37 <d1b2> <Hardkrash> is this on the host computer or the capture hardware?

06:37 <azonenberg> this is FPGA side

06:38 <azonenberg> and the only timestamp i will have is the *end* of the acquisition

06:38 <azonenberg> basically, as soon as i arm the capture i start shoving 80 Gbps into the compression blocks

06:38 <azonenberg> and write to ram in 16 separate circular buffers

06:38 <azonenberg> each one writing at its own rate depending on compressibility of the input

06:38 <azonenberg> when the acqusiition is over, everything stops

06:38 <azonenberg> i will then (not yet implemented) read the fifos out in reverse order

06:39 <azonenberg> walking back until the start of the acquisition

06:39 <azonenberg> and then go forward again to send the data to the PC

06:39 <d1b2> <Hardkrash> Is there a catch on not being able to have the starting timestamp?

06:39 <azonenberg> I know when i armed the trigger

06:40 <azonenberg> i do not know a priori when $PRETRIGGER_DELAY samples before the trigger event was

06:40 <azonenberg> i have to wait until the trigger then back up

06:40 <azonenberg> without compression this is easy, you just go $MEMDEPTH samples back

06:40 <azonenberg> but with VBR compression, the only way to know how many compression blocks $MEMDEPTH samples ago was is to look at the decompressed length of every block

06:41 <azonenberg> I do have some unused bits in each 512-bit dram burst, it's possible i could store some kind of index in there to allow faster than linear search

06:41 <azonenberg> but the initial implementation will be linear

06:41 <azonenberg> the point is, though, if you drop data in the fifo

06:41 <azonenberg> you've lost timestamps for all data prior to that point

06:41 <azonenberg> you have the samples but have no idea how much data was lost

06:41 <azonenberg> even if you count how many compression blocks were lost you don't know the decompressed length

06:42 <azonenberg> i feel like trying to track all of that would be a nightmare

06:42 <d1b2> <Hardkrash> yea, you would have to add some other journal style metadata that is robust that tracked the samples in each block

06:42 <d1b2> <Hardkrash> is the compressed block fully compressed before going into the FIFO?

06:43 <d1b2> <Hardkrash> or more of a stream?

06:43 <azonenberg> So, the raw data per channel is 5 Gbps, provided as 8 bits at 625 MHz

06:44 <azonenberg> I do some shuffling and toggling to convert this to 16 bits at 312.5 MHz

06:44 <azonenberg> (that initial processing path has a critical path that is literally one lut and barely makes timing, lol)

06:44 <azonenberg> it's all hand floorplanned

06:44 <azonenberg> the 16 bit path at 312.5 MHz is a biiit more forgiving. That's what does the actual compression

06:45 <azonenberg> But i still have to keep my paths short and heavily pipelined

06:45 <azonenberg> The current logic is a 3 stage pipeline

06:45 <azonenberg> each 16 bit block is turned into zero or one 17-bit blocks

06:45 <azonenberg> either 1'b0, original block verbatim

06:45 <azonenberg> (if not compressible)

06:46 <azonenberg> or 1'b1 followed by two 8-bit RLE codes

06:46 <azonenberg> each one consisting of a bit and a 7-bit repetition count from 0 to 127

06:46 <azonenberg> (zero is a legal repetition count used at the end of the stream among other things)

06:46 <azonenberg> as of now, the compressor only supports compressing blocks with a single toggle within the 16-bit window

06:46 <azonenberg> or no toggles at all

06:47 <azonenberg> theoretically it's possible for some blocks with 2 toggles to be comressed by appending to a previous block and emitting a new block

06:47 <azonenberg> but that will complicate the logic more than i wanted to do at this point

06:47 <d1b2> <Hardkrash> is the output of the compression a 512bit burst or a collection of bursts?

06:47 <azonenberg> The output of the compression is a stream of 17 bit words and a valid bit at 312.5 MHz

06:48 <d1b2> <Hardkrash> and these are what would be dropped on the floor in the FIFO

06:49 <azonenberg> Not quite

06:49 <azonenberg> I take the 17 bit words and push them into a temporary working buffer made out of DFFs

06:49 <azonenberg> when i have 7 of them, i pad out to 128 bits with nine zero bits

06:49 <azonenberg> This then goes into a CDC FIFO between the capture domain and a similar-rate (i think 325 MHz) clock derived from the ram controller clock

06:50 <azonenberg> This is the buffer that drops

06:50 <azonenberg> output of that buffer then goes into an arbiter within the logic pod controller to mix the 8 128-bit streams down to one

06:50 <azonenberg> and does so in bursts of 4 words

06:51 <azonenberg> so every free cycle the arbiter picks one of the 8 fifos, then over the next 4 clocks pops 4 words from it and sends into a single output fifo

06:51 <azonenberg> (this is only done when the output fifo has space, so it cannot drop there)

06:51 <azonenberg> at this stage, i also assign dram addresses to the stream of words by concatenating the fifo base address in physical memory with the fifo pointer for that channel

06:53 <azonenberg> Then there is a second stage arbiter that pops 1x addr + 4x 128b data bursts from this buffer (choosing between the two pod subsystems, or other logic i have yet to build elsewhere in the system)

06:53 <azonenberg> reshuffles the data again from 4x 128b to 2x 256b

06:53 <azonenberg> clock domain shifts yet again

06:53 <azonenberg> and then that goes into the xilinx ddr3 controller

06:57 <d1b2> <Hardkrash> Ok i think i have it, curious of the performance impact from dropping down to 6x words per buffer and adding a counter

06:58 <d1b2> <Hardkrash> or a counter intermixed every nth packet

06:59 <azonenberg> yeah there's lots of possibilities. more a question of if the benefits are worth it

06:59 <azonenberg> note that anythign we add will increase the bw required to move the same amount of data

06:59 <azonenberg> making overflows more likely

07:00 <azonenberg> in fact i am thinking of striping one more word out across the 9 padding bits

07:00 <azonenberg> so instaed of having 7 words per block i have 7.5

07:00 <azonenberg> then i waste an average of one bit every 256 instead of 18

07:02 <d1b2> <Hardkrash> the alternative is the compressor would take 15 bits in and then it would not create the reminder issue.

07:02 <_whitenotifier-e> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±3] https://github.com/glscopeclient/scopehal/compare/2a42a0439f70...a99f4b48c287

07:02 <_whitenotifier-e> [scopehal] azonenberg a99f4b4 - SiglentSCPIOscilloscope: initial support for function generator option. Should easily port to SDG series signal generators in the future. Fixes #581.

07:02 <_whitenotifier-e> [scopehal] azonenberg closed issue #581: Siglent: Support AWG - https://github.com/glscopeclient/scopehal/issues/581

07:03 <_whitenotifier-e> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://github.com/glscopeclient/scopehal-apps/compare/84e7befae689...48a47e8ee659

07:03 <_whitenotifier-e> [scopehal-apps] azonenberg 48a47e8 - FunctionGeneratorDialog: added new predefined waveforms for Siglent AWG

07:03 <d1b2> <Hardkrash> You loose the 0-127 on both positions in the compression block,

07:08 <azonenberg> That's not possible

07:08 <azonenberg> the sampling hardware generates 16 bits

07:08 <azonenberg> processing 15 would mean creating a new clock domain at 16/15 of the sampling frequency and having a messy asynchronous gearbox

07:08 <azonenberg> could be done in theory, but would be a ton of area and probably be very hard to make timing in the fpga

07:10 <d1b2> <Hardkrash> i take it that the striping at the 7 -> 7.5 word is happening in a much lower speed thus more practical.

07:11 <azonenberg> Well, 7 -> 7.5 is a simple 1:2 split

07:11 <azonenberg> i just have a toggle register to record if i'm even or odd

07:11 <azonenberg> if even, store 7 words then bits 17:8 of the 8th

07:11 <azonenberg> if odd, store bits 7:0 of the previous word and then 7 more words

07:12 <azonenberg> it's still gearboxing but far simpler than 16/15 which is a really awkward ratio

07:12 <azonenberg> which means more complex muxing

07:12 <azonenberg> i did a 72 -> 64 bit gearbox for a customer once. that wasn't too bad because they are both multiples of 8

07:12 <azonenberg> but 15 and 16 are relatively prime

07:13 <d1b2> <Hardkrash> agreed that 15 to 16 is impractical.

07:14 <azonenberg> so it's not that it's lower speed, it's that it is 2:1 gearboxing of a fairly small amount of data within a single clock domain

07:16 <d1b2> <Hardkrash> Just had a left field idea... ECC DDR ram? 😛

07:18 <_whitenotifier-e> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/a99f4b48c287...bd66583b934d

07:18 <_whitenotifier-e> [scopehal] azonenberg bd66583 - Disabled "negative pulse" waveform as does not seem to work in hardware despite docs

07:18 <_whitenotifier-e> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/48a47e8ee659...854fa5781b9f

07:18 <_whitenotifier-e> [scopehal-apps] azonenberg 854fa57 - Updated to latest scopehal

07:18 <azonenberg> @hardkrash: board already exists

07:18 <azonenberg> has a 64 bit sodimm, i'm stuck with it

07:19 <azonenberg> the dram is currently clocked at 650 MT/s (325 MHz) because that's the closest to 667 MT/s (333 MHz) that i can get with the oscillator on the board

07:19 <azonenberg> i have a 200 MHz osc because i had intended to run at 800 MT/s (400 MHz), which the kintex-7 datasheet says you can do for 1.35V DDR3 in a HR bank on a kintex7 -2

07:19 <azonenberg> however the xilinx IP is refusing to let me select anything faster than 667

07:20 <d1b2> <Hardkrash> That's unfortunate

07:20 <azonenberg> one of the options i'm exploring is to generate the ip telling it my input clock is a different speed than it actually is

07:20 <azonenberg> then patch the constraints file for the correct frequency

07:20 <azonenberg> so it will do static timing at the correct freq but bypass what i think is probably a bug in the ip generator

07:24 <azonenberg> or even just manually edit the generated rtl wrapper that has the clock synthesis PLL to tweak the multiplier

07:26 <azonenberg> It looks like right now the PLL is taking 200 MHz, divide by 2 to get 100 MHz, multiply by 13 to get 1.3 GHz, then has outputs at 1/2, 1/4, 1/8, 1/64 of that (650, 325, 162.5, 20.3125 MHz)

07:27 <azonenberg> so i should be able to just bump that 13 to 16 to run at 1.6 GHz (the PLL VCO can go up to 1866 in -2 speed)

07:28 <azonenberg> and as long as i never regenerate the IP it will work

07:28 <d1b2> <Darius> tried asking on their forums (or your FAE) about why it won't run at 800MT/s?

07:28 <d1b2> <Hardkrash> Setting the PLL manually seems better and rather straightforward.

07:28 <azonenberg> I have a post on the forum from over a week ago, no replies

07:29 <azonenberg> I do not currently have a xilinx FAE assigned

07:29 <azonenberg> https://support.xilinx.com/s/question/0D52E000078tRiNSAU/unable-to-configure-mig-for-ddr3l-800-sodimm-on-hr-banks-of-2-speed-kintex7-fbg484-package-in-vivado-20212?language=en_US

07:30 <azonenberg> and it gets a bit more complicated because i am seeing references to a MMCM elsewhere in the design

07:30 <azonenberg> so i have to make sure i get relationships between those right etc

07:30 <azonenberg> i'd rather fix it at the generation

07:36 <azonenberg> Digikey is an authorized distributor though so i might be able to get in touch with one through them

07:42 <d1b2> <Hardkrash> That just pushes up the bandwidth limit where this is encountered to 51.2Gbps. The compression allows for bursts that are well behaved and is a great idea, the down side is the pathological poor compression cases and getting it out the door is more of a priority.

07:43 <azonenberg> Correct. And yeah, the intended use case of this board is sniffing things like SPI flash and eMMC

07:43 <azonenberg> it's unlikely anything i'd use it with would actually push toggles fast enough for this to be an issue, although really noisy slow risetime edges might be problematic during the transition region

07:43 <azonenberg> i'm just sampling at 5 Gsps to get more accurate timing

07:43 <azonenberg> and because i can :p

07:44 <d1b2> <Hardkrash> Every 14 words could have it's entire compressed length stored every other frame 7.25 words per transaction with a length of data since start 😛

07:45 <d1b2> <Hardkrash> But that's another side distraction.

07:46 <azonenberg> and its not just bursts, its the reality that chip selects, write enables, etc tend to be slow

07:46 <d1b2> <Hardkrash> and if we set sampling to 2.5 then the whore bandwidth issue goes away for slow signals.

07:46 <d1b2> <Hardkrash> err. Whole

07:47 <d1b2> <Hardkrash> Other open drain signals would not be fun either.

07:47 <azonenberg> ehhhhh no. i still cant push the theoretical bw in linear writes

07:48 <azonenberg> the probe pods have some hysteresis also

07:48 <azonenberg> tunable, default may be a bit light

08:39 <d1b2> <mubes> More of a documentation thing than bug? Although if it’s documentation then it’s a usability problem too. Difficult for them to change field functionality that isn’t actually broken though.

11:07 bvernoux has joined #scopehal

14:47 <azonenberg> yeah i think it's a documentation error

14:48 <azonenberg> or more precisely, the docs say a command works when in reality it does not

15:39 <azonenberg> @mubes and any other siglent users here - please play with the function generator mode for SDS2000X+ when you get a chance

15:39 <azonenberg> let me know if you have any issues

17:35 <tnt> azonenberg: does the pinout you're using (and clock source) match the requirement for the "native mode IO" thing ?

17:35 <tnt> (wrt to the 667 limit)

17:36 <tnt> nm, you're not on ultrascale, I'm dumb.

17:38 <azonenberg> yeah i'm on kintex7 -2. and it's not even letting me get to the point of picking pinouts

17:38 <azonenberg> its on like the third page of the mig after i select memory type and target devices

17:39 <azonenberg> as soon as i say 800M it complains that i have to use HP banks

20:45 bvernoux has quit [Quit: Leaving]

23:49 Bird|ghosted has quit [Remote host closed the connection]