azonenberg changed the topic of #scopehal to: libscopehal, libscopeprotocols, and glscopeclient development and testing | https://github.com/glscopeclient/scopehal-apps | Logs: https://libera.irclog.whitequark.org/scopehal
<d1b2> <Darius> seems like a reasonable place to start
<d1b2> <Darius> if it becomes a problem then it can be looked at more later
Degi_ has joined #scopehal
Degi has quit [Ping timeout: 256 seconds]
Degi_ is now known as Degi
lethalbit has quit [*.net *.split]
lethalbit has joined #scopehal
<azonenberg> Sooo this is lovely
<azonenberg> apparently the ARWV Name command does not actually work
<azonenberg> i.e. you cannot do C1:ARWV Name,Cot
<azonenberg> or similar
<azonenberg> or C1:ARWV Name,ExpFal
<azonenberg> (cc: @mubes)
<azonenberg> What you have to do instead is C1:ARWV Index,10
<azonenberg> and first you have to do the STL? command to figure out the mapping of indexes to names
<azonenberg> lol
<d1b2> <Hardkrash> RE: The memory bandwidth limit. I would prefer error and return the partial data collected. There is value for the data captured after the trigger even if the bandwidth was saturated later. If compression crosses a high water mark could padding data mux in and replace the actual data into the stream and trigger a stop of the capture?
<azonenberg> so part of the problem is, and the reason i am leaning towards an abort
<azonenberg> is that it's likely you would overflow before you even encountered the trigger
<azonenberg> i.e. whiel trying to fill the pretrigger buffer
<d1b2> <Hardkrash> The high watermark might also be based on the memory bandwidth as opposed to a FIFO level. e.g. if the memory hits 98% bandwidth utilization trigger a graceful termination.
<d1b2> <Hardkrash> Ahh in that case a trigger failed and the data leading up to the failure is of less value
<azonenberg> and bw utilization is hard to benchmark
<azonenberg> basically, any given cycle the memory either is ready to accept a comamnd or not
<azonenberg> if it's ready, and i have data to push, i send that data
<azonenberg> it's possible for me to have a burst of traffic then the memory to get busy and i don't lose anything
<azonenberg> but if a refresh happens at the wrong time the same burst could overflow
<d1b2> <Hardkrash> The case i was thinking of was post trigger compression failure.
<azonenberg> you can see with 2 bank machines we lose data in 2 of the fifos
<azonenberg> but we still "have" 3.6 Gbps worth of slots in which the ram is ready to accept a command and we have nothing for it to do
<azonenberg> since actual throughput is so dependent on access patterns it's very hard to determine how close you are to saturation
<azonenberg> with 4 vs 8 bank machines we lose no data, but with 4 machines we are available roughly 1/3 of the slots we're not actively writing
<azonenberg> while with 8 we're available 2/3 of the slots
<azonenberg> the other issue wrt overflows and sending partial data is that the compression is streaming
<azonenberg> and variable rate
<d1b2> <Hardkrash> is this on the host computer or the capture hardware?
<azonenberg> this is FPGA side
<azonenberg> and the only timestamp i will have is the *end* of the acquisition
<azonenberg> basically, as soon as i arm the capture i start shoving 80 Gbps into the compression blocks
<azonenberg> and write to ram in 16 separate circular buffers
<azonenberg> each one writing at its own rate depending on compressibility of the input
<azonenberg> when the acqusiition is over, everything stops
<azonenberg> i will then (not yet implemented) read the fifos out in reverse order
<azonenberg> walking back until the start of the acquisition
<azonenberg> and then go forward again to send the data to the PC
<d1b2> <Hardkrash> Is there a catch on not being able to have the starting timestamp?
<azonenberg> I know when i armed the trigger
<azonenberg> i do not know a priori when $PRETRIGGER_DELAY samples before the trigger event was
<azonenberg> i have to wait until the trigger then back up
<azonenberg> without compression this is easy, you just go $MEMDEPTH samples back
<azonenberg> but with VBR compression, the only way to know how many compression blocks $MEMDEPTH samples ago was is to look at the decompressed length of every block
<azonenberg> I do have some unused bits in each 512-bit dram burst, it's possible i could store some kind of index in there to allow faster than linear search
<azonenberg> but the initial implementation will be linear
<azonenberg> the point is, though, if you drop data in the fifo
<azonenberg> you've lost timestamps for all data prior to that point
<azonenberg> you have the samples but have no idea how much data was lost
<azonenberg> even if you count how many compression blocks were lost you don't know the decompressed length
<azonenberg> i feel like trying to track all of that would be a nightmare
<d1b2> <Hardkrash> yea, you would have to add some other journal style metadata that is robust that tracked the samples in each block
<d1b2> <Hardkrash> is the compressed block fully compressed before going into the FIFO?
<d1b2> <Hardkrash> or more of a stream?
<azonenberg> So, the raw data per channel is 5 Gbps, provided as 8 bits at 625 MHz
<azonenberg> I do some shuffling and toggling to convert this to 16 bits at 312.5 MHz
<azonenberg> (that initial processing path has a critical path that is literally one lut and barely makes timing, lol)
<azonenberg> it's all hand floorplanned
<azonenberg> the 16 bit path at 312.5 MHz is a biiit more forgiving. That's what does the actual compression
<azonenberg> But i still have to keep my paths short and heavily pipelined
<azonenberg> The current logic is a 3 stage pipeline
<azonenberg> each 16 bit block is turned into zero or one 17-bit blocks
<azonenberg> either 1'b0, original block verbatim
<azonenberg> (if not compressible)
<azonenberg> or 1'b1 followed by two 8-bit RLE codes
<azonenberg> each one consisting of a bit and a 7-bit repetition count from 0 to 127
<azonenberg> (zero is a legal repetition count used at the end of the stream among other things)
<azonenberg> as of now, the compressor only supports compressing blocks with a single toggle within the 16-bit window
<azonenberg> or no toggles at all
<azonenberg> theoretically it's possible for some blocks with 2 toggles to be comressed by appending to a previous block and emitting a new block
<azonenberg> but that will complicate the logic more than i wanted to do at this point
<d1b2> <Hardkrash> is the output of the compression a 512bit burst or a collection of bursts?
<azonenberg> The output of the compression is a stream of 17 bit words and a valid bit at 312.5 MHz
<d1b2> <Hardkrash> and these are what would be dropped on the floor in the FIFO
<azonenberg> Not quite
<azonenberg> I take the 17 bit words and push them into a temporary working buffer made out of DFFs
<azonenberg> when i have 7 of them, i pad out to 128 bits with nine zero bits
<azonenberg> This then goes into a CDC FIFO between the capture domain and a similar-rate (i think 325 MHz) clock derived from the ram controller clock
<azonenberg> This is the buffer that drops
<azonenberg> output of that buffer then goes into an arbiter within the logic pod controller to mix the 8 128-bit streams down to one
<azonenberg> and does so in bursts of 4 words
<azonenberg> so every free cycle the arbiter picks one of the 8 fifos, then over the next 4 clocks pops 4 words from it and sends into a single output fifo
<azonenberg> (this is only done when the output fifo has space, so it cannot drop there)
<azonenberg> at this stage, i also assign dram addresses to the stream of words by concatenating the fifo base address in physical memory with the fifo pointer for that channel
<azonenberg> Then there is a second stage arbiter that pops 1x addr + 4x 128b data bursts from this buffer (choosing between the two pod subsystems, or other logic i have yet to build elsewhere in the system)
<azonenberg> reshuffles the data again from 4x 128b to 2x 256b
<azonenberg> clock domain shifts yet again
<azonenberg> and then that goes into the xilinx ddr3 controller
<d1b2> <Hardkrash> Ok i think i have it, curious of the performance impact from dropping down to 6x words per buffer and adding a counter
<d1b2> <Hardkrash> or a counter intermixed every nth packet
<azonenberg> yeah there's lots of possibilities. more a question of if the benefits are worth it
<azonenberg> note that anythign we add will increase the bw required to move the same amount of data
<azonenberg> making overflows more likely
<azonenberg> in fact i am thinking of striping one more word out across the 9 padding bits
<azonenberg> so instaed of having 7 words per block i have 7.5
<azonenberg> then i waste an average of one bit every 256 instead of 18
<d1b2> <Hardkrash> the alternative is the compressor would take 15 bits in and then it would not create the reminder issue.
<_whitenotifier-e> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±3] https://github.com/glscopeclient/scopehal/compare/2a42a0439f70...a99f4b48c287
<_whitenotifier-e> [scopehal] azonenberg a99f4b4 - SiglentSCPIOscilloscope: initial support for function generator option. Should easily port to SDG series signal generators in the future. Fixes #581.
<_whitenotifier-e> [scopehal] azonenberg closed issue #581: Siglent: Support AWG - https://github.com/glscopeclient/scopehal/issues/581
<_whitenotifier-e> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±2] https://github.com/glscopeclient/scopehal-apps/compare/84e7befae689...48a47e8ee659
<_whitenotifier-e> [scopehal-apps] azonenberg 48a47e8 - FunctionGeneratorDialog: added new predefined waveforms for Siglent AWG
<d1b2> <Hardkrash> You loose the 0-127 on both positions in the compression block,
<azonenberg> That's not possible
<azonenberg> the sampling hardware generates 16 bits
<azonenberg> processing 15 would mean creating a new clock domain at 16/15 of the sampling frequency and having a messy asynchronous gearbox
<azonenberg> could be done in theory, but would be a ton of area and probably be very hard to make timing in the fpga
<d1b2> <Hardkrash> i take it that the striping at the 7 -> 7.5 word is happening in a much lower speed thus more practical.
<azonenberg> Well, 7 -> 7.5 is a simple 1:2 split
<azonenberg> i just have a toggle register to record if i'm even or odd
<azonenberg> if even, store 7 words then bits 17:8 of the 8th
<azonenberg> if odd, store bits 7:0 of the previous word and then 7 more words
<azonenberg> it's still gearboxing but far simpler than 16/15 which is a really awkward ratio
<azonenberg> which means more complex muxing
<azonenberg> i did a 72 -> 64 bit gearbox for a customer once. that wasn't too bad because they are both multiples of 8
<azonenberg> but 15 and 16 are relatively prime
<d1b2> <Hardkrash> agreed that 15 to 16 is impractical.
<azonenberg> so it's not that it's lower speed, it's that it is 2:1 gearboxing of a fairly small amount of data within a single clock domain
<d1b2> <Hardkrash> Just had a left field idea... ECC DDR ram? 😛
<_whitenotifier-e> [scopehal] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal/compare/a99f4b48c287...bd66583b934d
<_whitenotifier-e> [scopehal] azonenberg bd66583 - Disabled "negative pulse" waveform as does not seem to work in hardware despite docs
<_whitenotifier-e> [scopehal-apps] azonenberg pushed 1 commit to master [+0/-0/±1] https://github.com/glscopeclient/scopehal-apps/compare/48a47e8ee659...854fa5781b9f
<_whitenotifier-e> [scopehal-apps] azonenberg 854fa57 - Updated to latest scopehal
<azonenberg> @hardkrash: board already exists
<azonenberg> has a 64 bit sodimm, i'm stuck with it
<azonenberg> the dram is currently clocked at 650 MT/s (325 MHz) because that's the closest to 667 MT/s (333 MHz) that i can get with the oscillator on the board
<azonenberg> i have a 200 MHz osc because i had intended to run at 800 MT/s (400 MHz), which the kintex-7 datasheet says you can do for 1.35V DDR3 in a HR bank on a kintex7 -2
<azonenberg> however the xilinx IP is refusing to let me select anything faster than 667
<d1b2> <Hardkrash> That's unfortunate
<azonenberg> one of the options i'm exploring is to generate the ip telling it my input clock is a different speed than it actually is
<azonenberg> then patch the constraints file for the correct frequency
<azonenberg> so it will do static timing at the correct freq but bypass what i think is probably a bug in the ip generator
<azonenberg> or even just manually edit the generated rtl wrapper that has the clock synthesis PLL to tweak the multiplier
<azonenberg> It looks like right now the PLL is taking 200 MHz, divide by 2 to get 100 MHz, multiply by 13 to get 1.3 GHz, then has outputs at 1/2, 1/4, 1/8, 1/64 of that (650, 325, 162.5, 20.3125 MHz)
<azonenberg> so i should be able to just bump that 13 to 16 to run at 1.6 GHz (the PLL VCO can go up to 1866 in -2 speed)
<azonenberg> and as long as i never regenerate the IP it will work
<d1b2> <Darius> tried asking on their forums (or your FAE) about why it won't run at 800MT/s?
<d1b2> <Hardkrash> Setting the PLL manually seems better and rather straightforward.
<azonenberg> I have a post on the forum from over a week ago, no replies
<azonenberg> I do not currently have a xilinx FAE assigned
<azonenberg> and it gets a bit more complicated because i am seeing references to a MMCM elsewhere in the design
<azonenberg> so i have to make sure i get relationships between those right etc
<azonenberg> i'd rather fix it at the generation
<azonenberg> Digikey is an authorized distributor though so i might be able to get in touch with one through them
<d1b2> <Hardkrash> That just pushes up the bandwidth limit where this is encountered to 51.2Gbps. The compression allows for bursts that are well behaved and is a great idea, the down side is the pathological poor compression cases and getting it out the door is more of a priority.
<azonenberg> Correct. And yeah, the intended use case of this board is sniffing things like SPI flash and eMMC
<azonenberg> it's unlikely anything i'd use it with would actually push toggles fast enough for this to be an issue, although really noisy slow risetime edges might be problematic during the transition region
<azonenberg> i'm just sampling at 5 Gsps to get more accurate timing
<azonenberg> and because i can :p
<d1b2> <Hardkrash> Every 14 words could have it's entire compressed length stored every other frame 7.25 words per transaction with a length of data since start 😛
<d1b2> <Hardkrash> But that's another side distraction.
<azonenberg> and its not just bursts, its the reality that chip selects, write enables, etc tend to be slow
<d1b2> <Hardkrash> and if we set sampling to 2.5 then the whore bandwidth issue goes away for slow signals.
<d1b2> <Hardkrash> err. Whole
<d1b2> <Hardkrash> Other open drain signals would not be fun either.
<azonenberg> ehhhhh no. i still cant push the theoretical bw in linear writes
<azonenberg> the probe pods have some hysteresis also
<azonenberg> tunable, default may be a bit light
<d1b2> <mubes> More of a documentation thing than bug? Although if it’s documentation then it’s a usability problem too. Difficult for them to change field functionality that isn’t actually broken though.
bvernoux has joined #scopehal
<azonenberg> yeah i think it's a documentation error
<azonenberg> or more precisely, the docs say a command works when in reality it does not
<azonenberg> @mubes and any other siglent users here - please play with the function generator mode for SDS2000X+ when you get a chance
<azonenberg> let me know if you have any issues
<tnt> azonenberg: does the pinout you're using (and clock source) match the requirement for the "native mode IO" thing ?
<tnt> (wrt to the 667 limit)
<tnt> nm, you're not on ultrascale, I'm dumb.
<azonenberg> yeah i'm on kintex7 -2. and it's not even letting me get to the point of picking pinouts
<azonenberg> its on like the third page of the mig after i select memory type and target devices
<azonenberg> as soon as i say 800M it complains that i have to use HP banks
bvernoux has quit [Quit: Leaving]
Bird|ghosted has quit [Remote host closed the connection]