<whitequark[cis]>
you call it once in your software
<mcc111[m]>
<whitequark[cis]> "nope" <- darn
<mcc111[m]>
Looks like the programming method is "libero soc" which has a nightmarish looking licensing scheme (pay per-seat per-year, no option for an indefinite license)
<mcc111[m]>
galibert[m]: in my current case, games/graphics (so I'm not concerned with adversarial/cryptographic cases)
<whitequark[cis]>
if they really didn't want you to use it they would not let you download it
<mcc111[m]>
hm ok. well at least they have a linux version
<galibert[m]>
It's a very nice card but I'm not sure what I could do with it, it's annoying
<galibert[m]>
E141, you can't just toss it into the "nice toy" pile
<galibert[m]>
The wife will disagree :-)
<mcc111[m]>
galibert[m]: A long term daydream I've had a while is "run linux on a hard riscv core, have an fpga attached i can write a very simple gpu in". but i assume 23k 4luts probably isn't enough for that without additional hardware.
<whitequark[cis]>
23k 4luts is barely enough for a single high performance CPU core
<galibert[m]>
beware that I suspect the #1 problem with writing a "very simple gpu" is that you need a cpu-ish core for shaders (and if possible a bunch of them)
<galibert[m]>
it has to have an isa that's a good fit for shaders, and you of course need to write the compiler for it
<galibert[m]>
so nowadays very simple is already not that simple
<mcc111[m]>
what if i simply wrote a very bad shader
<mcc111[m]>
s/bad/_bad_/, s/shader/softgpu/
<whitequark[cis]>
i mean you don't have to have shaders in your gpu
<galibert[m]>
you don't have to, but then you won't run anything created after opengl 1.1
<whitequark[cis]>
I don't think you'll be able to design any GPU that runs regular software with any FPGA you can afford
<galibert[m]>
Catherine: hey, f you go single-gpu-execution-core and seconds-per-frame you maybe can
<whitequark[cis]>
that's not "runs" really
RobTaylor[m] has joined #amaranth-lang
<RobTaylor[m]>
One option is to use the N64 gpu implementation. There are opengl1 implementations for that
<galibert[m]>
ah that point you go riscv core and llvmpipe :-)
<galibert[m]>
s/ah/at/
<mcc111[m]>
My theory is a very bad gpu, or one that does nothing but accelerate wayland slightly, would at least be a proof of concept, and if someone (me) did the proof of concept of wiring vulkan-on-fpga into a linux kernel (i can believe i could do this more easily than i could believe i could write a gpu in gateware), maybe someone who's better at gateware than me or has a better fpga than me could write a better gpu someday.
<mcc111[m]>
* accelerate wayland compositing slightly, would
<mcc111[m]>
But maybe it is possible that no consumer-accessible fpga will ever run a "good enough" gpu under any circumstances.
<galibert[m]>
grab a de-10 nano and you'll have a dual-core arm handy that boots on a SD card and can run a full linux. Then you can see how far you manage to go
<mcc111[m]>
galibert[m]: hm. i guess i gotta do this anyway if i ever want to support mister…
<RobTaylor[m]>
mcc111[m]: Should be possible to do a compositing video accelerator for sure
<mcc111[m]>
* support mister with my pocket cores…
<whitequark[cis]>
mcc111[m]: I mean a consumer can buy an FPGA like that, you just need to dump 5-10k$ on it
<whitequark[cis]>
which consumers do pay for gaming rigs! so it's consumer accessible. I just assume you don't have that much money to put into a Xilinx devboard
<galibert[m]>
arria V is like 3K, don't exaggerate :-)
<whitequark[cis]>
I was thinking of something like kcu105, which would be at a low end of what you'd use to make a shader enabled GPU
<whitequark[cis]>
or htg960, which would be at high end (~30k)
<mcc111[m]>
whitequark[cis]: hm, i see. hard to imagine a situation where you convince someone to do that rather than buy an amd gpu
<whitequark[cis]>
another benefit of an AMD GPU: it doesn't involve export control paperwork :D
<galibert[m]>
when if comes to bang-for-bucks, it's kinda obvious that fpgas suck at cpu or gpu
<whitequark[cis]>
though this might change soon, what with the AI restrictions
<mcc111[m]>
i've been describing the idea kind of backward here, but my ultimate goal is a proof of concept of "could you make a desktop pc with an entirely open hardware stack" or "could the MNT Reform RKX7 ever be uesful"
<galibert[m]>
you'll have an easier time doing a MPW cpu and/or gpu
<whitequark[cis]>
for XC7K325T, you would probably be able to use partial reconfig to translate your shaders to FPGA logic
<whitequark[cis]>
sort of like how Glasgow works
<whitequark[cis]>
this will be a much better option than trying to shoehorn a full shader enabled GPU into an FPGA, and give you way better performance too, especially for simple applications
<mcc111[m]>
mcc111[m]: Or maybe "useful" is asking too much, maybe just "something other than a toy"
<whitequark[cis]>
i.e. you dedicate e.g. a quarter of your FPGA to the CPU, and use other three quarters as shader accelerators, dynamically translated and executed
<galibert[m]>
Catherine: that sounds fun
<mcc111[m]>
whitequark[cis]: Sure, "compile shaders to gateware" is very interesting to me, although I don't know if you can mesh it with Vulkan without some extensions
nelgau has quit [Read error: Connection reset by peer]
<whitequark[cis]>
of course this requires running Vivado, so in practice you will rely on shader cache, populated by spending hours in Vivado somewhere off the Reform (which cannot run Vivado)
<whitequark[cis]>
mcc111[m]: you can definitely compile normal Vulkan shaders to HDL
<galibert[m]>
also, fwiw, as far as I know, the fpga in the de10-nano can do partial reconfig (even if I haven't tried yet)
nelgau has joined #amaranth-lang
<mcc111[m]>
galibert[m]: What is an mpw?
<whitequark[cis]>
multi project wafer
<galibert[m]>
I need to RE the partial reconfig at some point
<whitequark[cis]>
but also "easier" is ... a way to describe it
<galibert[m]>
mcc111: multiple IC projets put together that share a wafer, reduces costs a lot
<galibert[m]>
Catherine: well, from what I understand "easier" is exactly what your employer is about :-)
<whitequark[cis]>
that is correct
<whitequark[cis]>
although I don't know when we'll offer something useful to mcc111 specifically for this project
<galibert[m]>
heh
<mcc111[m]>
whitequark[cis]: In opengl world, it was possible with extensions to precompile all of the shaders to the gpu's machine language and load those shaders at runtime skipping the shader compile step completely. Video game consoles do something like this. I think this was what I was envisioning, that the "precompiled shader" shipped with the software is a netlist. though I don't know how if this can be meshed with any modern API like
<mcc111[m]>
WebGPU or Vulkan.
<whitequark[cis]>
Steam definitely ships precompiled shaders for Vulkan games
<whitequark[cis]>
I think it's standard in Vulkan?
<galibert[m]>
there are at least two levels of compilation in vulkan, it's fun
<galibert[m]>
whatever (usually glsl 4.mumble) to spirv, and then spirv to the gpu isa
<galibert[m]>
there are recent-ish extensions to make it possible to grab the gpu isa objects and save/reload it
<galibert[m]>
to avoid compilation time
<mcc111[m]>
whitequark[cis]: It seems to me that if "small batch IC manufacturing" (which is what I interpret this as making possible?) becomes a thing in the next 15 years, the first step to designing that IC probably involves a toy example running HDL code on an FPGA
<galibert[m]>
glsl -> spirv was usually already done ahead of time
<mcc111[m]>
galibert[m]: Cool
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
<whitequark[cis]>
mcc111[m]: generally yes, this is the current industry standard
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
<mcc111[m]>
So I've noticed, looking at other HDLs, a number of them (PipelineC, Filament HDL) follow this "pipelining" concept where algorithms are written in a linear, basically-imperative way, and the HDL splits your code across cycles, automatically overlaps statements which could overlap and produces circuits that perform your code in the minimum reasonable number of cycles
<mcc111[m]>
Amaranth doesn't have anything like this currently, does it? I seem to find myself "manually" pipelining by like having a sequence of bits which sync-activate the next bit on the chain on the next cycle
<mcc111[m]>
Would this be something that would be possibly feasible to layer on top of amaranth (either in the standard library or in an addon library?)
<cr1901>
You can put stages of ffs at the end of your algorithm, and ask the synthesizer to place the ffs at the locations which optimize timing (as long as "whatever's attached to the output of your algorithm" can't tell the difference). This is what retiming passes can do (called Hyperpipelining in Altera/Intel world). abc in yosys has retiming passes, I've not had much luck playing with them.
<cr1901>
But "putting stages of ffs at the outputs of your algorithm and asking the synthesizer to place the ffs in better locations" would be a cheap way to get autopipelining.
<cr1901>
I've used PipelineC for toy multiplier experiments, it's pretty cool how it adds pipeline register stages one stage at a time. But Idk how the PipelineC decides where to place FFs as it adds consecutive stages.
<mcc111[m]>
By FF you mean, FFSynchronizer and AsyncFFSynchronizer spcifically?
<cr1901>
No, I mean plain old flip-flops
<whitequark[cis]>
<mcc111[m]> "So I've noticed, looking at..." <- > <@mcc111:matrix.org> So I've noticed, looking at other HDLs, a number of them (PipelineC, Filament HDL) follow this "pipelining" concept where algorithms are written in a linear, basically-imperative way, and the HDL splits your code across cycles, automatically... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/AWhahySuLzMpLsZSTCFCXVuF>)
<cr1901>
(I can explain this much better with a picture)
<whitequark[cis]>
FFSynchronizer is specifically for clock domain crossing, you should not be using it for anything else
<mcc111[m]>
cr1901: oh
<mcc111[m]>
whitequark[cis]: Hm. The fact Amaranth is already an eDSL makes it seem like building a language on top of Amaranth could be done in a pretty user-non-disruptive way.
<cr1901>
retiming will move the "FF stage 1" FFs into the comb logic cloud
<mcc111[m]>
I'm not sure I understand. You mean, this is an optimization pass FPGA compilers can do?
<cr1901>
yes
<cr1901>
pipelineC goes a bit further and will add "FF stage 2,3,4,etc" in between the comb cloud and FF out _and_ "somehow" (Idk the details) move those FFs into the comb cloud to optimizing timing.
<adamgreig[m]>
yes, though I don't think yosys does much or any of it, but it's common in xilinx/altera stuff
<cr1901>
As long as the rest of your logic only interacts w/ the "FF out" flip-flops, the rest of your design doesn't care how the FF stage 1-n are placed inside your design
<mcc111[m]>
cr1901: I see. And the FPGA compiler, when it does the retiming, is smart enough to not retime things in a way that would break the tree of data dependencies?
<mcc111[m]>
adamgreig[m]: So if you compile pipelineC using yosys, does that imply it will "not work" as intended or will it work because pipelineC is itself doing the retiming, not the fpga compiler?
<adamgreig[m]>
if you have like a complicated comb expression and then three successive registers, the compiler can do an as-if optimisation by moving two of the registers to some point inside the comb expression, the input/output is identical but the shortest path between registers got smaller so timing is better
<cr1901>
If the rest of your design only depends on COMB INPUTs and FF Outs, then the compiler can do what it wants with everything in between w/o breaking your tree of data deps
<adamgreig[m]>
don't know about pipelineC, or much about yosys+retiming, but it sounds like pipelineC already inserts the registers at sensible places, without relying on synthesis doing retiming
<cr1901>
(If you're reading the thread, --out-of-context mode for nextpnr is required to prevent the following scenario: https://twitter.com/cr1901/status/1570936521783676928, where nextpnr maximizes the clock frequency but severely penalizes the path from I/O pins to the registers)
<cr1901>
Galaxy brain: "i/o to clocked FFs is an asynchronous domain right? So we don't need to include that critical path in Fmax for the domain"
ravenslofty[m] has joined #amaranth-lang
<ravenslofty[m]>
Yosys flows on FPGAs do not perform register retiming, because ABC's support for register retiming is either:
<ravenslofty[m]>
- poor on ABC flows, because Yosys must partition the design based on flop control sets, and ABC can only place flops around LUTs
<ravenslofty[m]>
- nonexistent on ABC9 flows (that people should really be using), because while ABC9 is "aware" of flop control sets, Alan would rather do an integrated mapping/retiming command than do mapping then retiming like in ABC
<adamgreig[m]>
ravenslofty: thanks, i thought that might be the case but wasn't really sure
<cr1901>
I don't know what a flop control set is
<adamgreig[m]>
does that mean abc9 might eventually learn to be good at retiming?
<ravenslofty[m]>
cr1901: please don't tell people to use out of context mode for...basically anything that isn't the world's most hypothetical what-if scenario
<galibert[m]>
what-if... the quartus file formats were sane?
<cr1901>
I mean, PipelineC uses it :D
<ravenslofty[m]>
adamgrieg: make no mistake, retiming is a Hard Problem, but on top of the improved performance you can get from ABC9, Alan thinks a combined mapper/retimer can reduce clock period by 20%
<ravenslofty[m]>
cr1901: by "flop control set" I mean the set of {clock, enable, async reset}, that kind of thing
<ravenslofty[m]>
galibert: crazy talk; Intel could never comprehend such a thing
<cr1901>
Why do partitions based on flop control sets make retiming poor? Does this have to do with the fact that yosys feeds abc your design in pieces (thus fewer opportunities to move flops around while LUTs are being created from the internal cell lib)?
<whitequark[cis]>
<mcc111[m]> "Hm. The fact Amaranth is already..." <- yes
<adamgreig[m]>
ravenslofty: yea, I totally appreciate it being a really hard problem! I think I mostly see retiming as a synthesis convenience to let the author just litter FFs at the end and hope for the best so I've never been too worried about it :P
<whitequark[cis]>
if people should really be using ABC9 flows why are they still not the default?
<whitequark[cis]>
my understanding is that there were significant enough issues with ABC9 that they are still considered strictly experimental
<adamgreig[m]>
huh, I've been using abc9 for everything for ages without apparent issues
<ravenslofty[m]>
cr1901: Retiming works based on knowing what the lowest global clock period is; feeding the design in pieces like that means it might come up with different clock periods depending on what the bits of HDL it gets are.
<cr1901>
Is my understanding correct that yosys feeds the design in in pieces?
<ravenslofty[m]>
Catherine: they're the default in the synth scripts I've written, e.g. `synth_intel_alm` (and `synth_quicklogic`). Yes, there are a few bugs with `&mfs` and `&dc2`, I won't claim otherwise; but the `&mfs2` issues get handled resiliently by Yosys and the `&dc2` bug is extremely rare. Conversely, the reward in improved synthesis quality is pretty major, and I personally am tired of dealing with "Yosys eats all my FPGA" messages
<ravenslofty[m]>
that I have to reply to with "please use `-abc9`"
<whitequark[cis]>
what is the impact of the `&mfs` and `&dc2` bugs? (I love ABC command names, they're so descriptive)
<ravenslofty[m]>
cr1901: not exactly. it will under certain conditions (an ABC flow with -dff), but ABC flows that are entirely combinational and ABC9 flows do not get partitioned.
<whitequark[cis]>
I remember once discussing whether -abc9 should be the default in Amaranth and the answer at the time was "conclusively no"
<whitequark[cis]>
at which time I've decided that the ABC9 flow will become the default in Amaranth whenever it becomes the default in Yosys
nelgau has quit [Read error: Connection reset by peer]
nelgau has joined #amaranth-lang
<ravenslofty[m]>
they both cause ABC to crash, however `&mfs` is a post-mapping improvement pass, so Yosys writes the design, calls `&mfs` and writes the design again, and if `&mfs` continues you suffer from worse QoR but synthesis continues. `&dc2` crashing is a show-stopper, if it happens.
<ravenslofty[m]>
roughly: `&mfs` takes a design that is mapped to LUTs and uses a SAT solver to find don't-care terms to minimise LUTs.
<cr1901>
In synth_ice40 "abc -dff -D 1 (only if -retime)". So... the design is partitioned only if retiming is enabled. But... partitioning the design is the thing that causes retiming to be suboptimal in the first place. I think I'm confused.
<ravenslofty[m]>
and `&dc2` is a pre-mapping fine-grained heavy optimisation pass
<cr1901>
(Or I'm asking the wrong question, and a better question might be "why does yosys force partitioning if retiming is enabled, when that partitioning makes retiming suboptimal?")
<ravenslofty[m]>
because it is needed for retiming to be correct.
<cr1901>
Why? (asking sincerely/not being an ass, I promise)
<ravenslofty[m]>
ABC assumes that all flops it has are compatible; that is, any flop [plus or minus async controls] can be moved and the resulting netlist is sequentially correct.
<ravenslofty[m]>
*sequentially equivalent
<cr1901>
ahhh okay. So why doesn't everything break later when abc is called again in map_luts ("abc -dress -lut 4 (skip if -noabc)"). abc is being fed all flops at this point regardless of control set, correct?
<ravenslofty[m]>
not correct: ABC is being fed no flops at that point.
smkz has quit [Quit: smkz]
<cr1901>
or does the abc passes called by this command leave the flops alone and only focus on the comb logic?
<cr1901>
ahhh
<cr1901>
fair enough, thanks :D
<cr1901>
(Don't feed ABC flops after midnight)
<ravenslofty[m]>
(or do, but make sure you use ABC9 instead)
<whitequark[cis]>
<ravenslofty[m]> "they both cause ABC to crash..." <- ah ok, a crash isn't so bad
smkz has joined #amaranth-lang
<ravenslofty[m]>
oh, Eddie closed the `&dc2` bug as no longer reproducing
<cr1901>
It might not be incredibly useful, but is there a way to get yosys to keep the temp abc directory (to look at the contents?)
<ravenslofty[m]>
And it's not like ABC was bug-free either
<ravenslofty[m]>
cr1901: abc -nocleanup
<cr1901>
oh oops. Staring me right in the fact all this time
<cr1901>
face*
smkz has quit [Quit: smkz]
<ravenslofty[m]>
Catherine: looking at the Yosys bug tracker, there is #3384, which starts off well with `read_verilog -nomem2reg -yydebug` and may or may not be fixed by #3670, and there's #3346 which is too big to debug. Those are the two open ABC9 bugs I can find.
smkz has joined #amaranth-lang
<ravenslofty[m]>
So, I'm going to make an argument that even if ABC9 is not stable and has bugs, the only way we will find those bugs is by increasing adoption of it to get it tested more.
<whitequark[cis]>
that sounds like ABC9 should be enabled across the board with resources dedicated to fixing the fallout from that, then
<ravenslofty[m]>
resource singular. my headmate's thoroughly leaving me alone on this one.
<cr1901>
I have at least one design I'm working in where in terms of ice40 LUT usage, abc still beats flowmap and abc9 soundly. Hopefully I have some code to share soon.
<whitequark[cis]>
(beating flowmap is really not surprising)
<cr1901>
By nearly a factor of 2?
<whitequark[cis]>
yep
<cr1901>
Okay, if that's expected, then disregard the flowmap part.
<adamgreig[m]>
does anyone know if lattice's ice40 tools create a startup delay to work around the bram bug, like amaranth's ice40platform does?
<mcc111[m]>
<whitequark[cis]> "mcc111: oh, it looks like..." <- Ah, cool
<mcc111[m]>
Should I be worried it says 1 year lol
<whitequark[cis]>
I think you just renew it a year after
<adamgreig[m]>
and make a fake eth0 interface with a made up mac address because it doesn't understand to check enp4s0f0 instead, sigh