<mcc111[m]>
I'm assuming WSL, but if amaranth ever wants to talk to USB, then a Windows python will make more sense
<Darius>
I thought WSL could do USB these days
<whitequark[cis]1>
WSL can do USB
<whitequark[cis]1>
it is mostly stable
<whitequark[cis]1>
Amaranth is designed to work on Windows Python well; if it doesn't that's a bug
<mcc111[m]>
USB working on WSL is a surprise to me. Does it work on WSL1? I still use WSL1 because the windows drive performance is better.
<Darius>
I doubt it works with v1
charlottia has joined #amaranth-lang
<charlottia>
Can confirm it works well with v2 (with some hacks to do the passthrough), v1 would be a little surprising but I've been more surprised.
_whitenotifier has quit [Server closed connection]
<gruetzkopf>
oh they *finally* added usb passthrough to hyperv?
<charlottia>
I’m not sure the mechanism exactly; there’s a device driver for the passthrough and at one point a custom (Linux) kernel too was needed too, I guess whatever it was got mainlined.
<galibert[m]>
the two evaluation boards with a cyclone v I have (de10-nano and cycloe v gx starter kit) have an adv7513 hdmi encoder. It has a 24-pins parallel interface for the pixels, so you only need to run at the pixel clock rate
<galibert[m]>
That's 135MHz roughly for full hd 60Hz, it does that without problems. The adv itself can go up to 165MHz, so full hd is in practice the limit
<galibert[m]>
s/cycloe/cyclone/
<adamgreig[m]>
In comparison without the adv7513 if you output the hdmi tmds signals directly it's about 10x the clock
<adamgreig[m]>
So like 1.5GHz for 1080p60 which is much more challenging
<galibert[m]>
yeah, the cyclone wouldn't be able to run at that speed
<adamgreig[m]>
If you had an ecp5 with serdes transceivers maybe you could do it but I don't think the colorlight boards do and it'd be a lot of work
<Allie>
SDI is probably more approachable
<adamgreig[m]>
Really, at 1080p? Isn't it even higher clocks due to being a single stream?
<adamgreig[m]>
I've never looked in to generating it
<mcc111[m]>
I've found it surprisingly difficult to figure out what speed any of these FPGA boards is running at.
<mcc111[m]>
Like, it's not mentioned on the sale page for the colorlight
<mcc111[m]>
I assumed the answer to "what is the fpga clock rate?" Is "it's complicated"
<adamgreig[m]>
Yea, the answer is always "it depends" but for any of those you might imagine 200MHz is possible for reasonably well designed logic
<galibert[m]>
mcc111: typical point-to-point routing time in the cyclone v is 200-500 picoseconds
<Allie>
576i is about 270M, which is very achievable with good logic
<Allie>
and tbh SD is good enough for anybody :>
<adamgreig[m]>
Seems less annoying to generate than the hdmi tmds too
<galibert[m]>
mcc111: making timing essentially requires having a propagation delay from one ff to the next that less than a clock period
<galibert[m]>
which is why the maximal reachable clock speed is very very very design dependant
<mcc111[m]>
When one talks about "a clock" in an fpga are they talking about an actual clock circult. like a crystal or something
<mcc111[m]>
somebody mentioned the Cyclone V having "PLL"s (plural)
<Allie>
mcc111[m]: do you know what a PLL is?
<galibert[m]>
yeah, between... 4 and 7 iirc. I would need to check
<Allie>
adamgreig[m]: yeah, for SD it's just Cb Y Cr Y' Cb Y Cr Y' with a sync word at the start of every line and another one at the start of every hblank
<Allie>
super trivial to generate and therefore very fun
<adamgreig[m]>
Most FPGAs will have one or many PLLs, and then a lowish frequency crystal or oscillator or other clock source on the board
<Allie>
HD is basically the same except you have a run of Y Y' Y Y' Y Y' Y Y' and a run of Cb Cr Cb Cr Cb Cr Cb Cr
<adamgreig[m]>
Then your design sets up the pll to generate whatever frequency/ies you need based on the external clock(s)
<adamgreig[m]>
Allie: Aah fun, weird though, why not keep doing each pixel?
<mcc111[m]>
<adamgreig[m]> "Yea, the answer is always "it..." <- I'm surprised to learn it's *that* low given modern CPUs are in the 3ghz - 5.8ghz range. That calls into question my assumption that FPGAs actually can keep up with CPUs for modern day workloads. Like I understand that FPGAs can potentially "do things" on every clock cycle whereas that usually is not the case for a CPU, but still.
<adamgreig[m]>
Does sound fun though, I'd like to try it... what do you use to view SDI?
<Allie>
I use a £30k sony trimaster EL. you can use a £35 converter from blackmagic :P
<adamgreig[m]>
These are like $30 FPGAs from 10 years ago, they are a different league to a modern desktop cpu
<galibert[m]>
mcc111: fpgas are damn slow compared to cpus or gpus
<Allie>
adamgreig[m]: there aren't really pixels in SDI
<Allie>
certainly not square ones
<adamgreig[m]>
Even a super modern $$$$ FPGA on a latest process is way way slower than a modern CPU though, yea
<mcc111[m]>
MNT is currently prototyping a laptop where the GPU/GPU are replaced with a Xilinx Kintex-7 XC7K325T-FFG676.
<Allie>
(SDTV is a *very* efficient way to use a bunch of airtime)
<galibert[m]>
mcc111: it will probably be comparable to... oh... maybe a 2005 laptop?
<mcc111[m]>
galibert[m]: okay. interesting.
<galibert[m]>
unless it has an embedded cpu core
<mcc111[m]>
i feel like we hit a "this computer is fast enough. it never needs to go any faster" point for me sometime between 2010 and 2015
<mcc111[m]>
2005 i might be able to put up with as long as the applications are carefully chosen
<galibert[m]>
I still want a faster computer
<galibert[m]>
(I want a faster fpga too :-) )
<mcc111[m]>
<Allie> "mcc111: do you know what a PLL..." <- I do, but my understanding of what you do with a PLL once you have one is relatively shallow.
<galibert[m]>
well, the pll/clock networks in a cyclone v can build a clock for more or less any frequency between... something like 1 and 550 MHz
<mcc111[m]>
Are the PLL units in a modern fpga finite, like there's a specific number of dedicated "PLL"s etched in the chip and you task them as you see fit, or is a PLL simply one of the configurations you can force labs into?
<crzwdjk>
PLLs are dedicated hardware units which can be configured by the FPGA bitstream
bob_twinkles[m] has joined #amaranth-lang
<bob_twinkles[m]>
you *can* build an oscillator out of LABs but it's not going to be very good
<crzwdjk>
But you can also e.g. build a clock divider in logic
<bob_twinkles[m]>
PLLs have specific analog considerations that make them hard to emulate in digital logic
<mcc111[m]>
Thanks for the explanations
<galibert[m]>
low frequency plls like used in floppy controllers are doable digitally (the amiga, the wd177x for instance have a digital pll, there are patents about them), but yeah, low frequency
<crzwdjk>
But yeah, FPGAs are not going to beat a CPU for raw speed. But an FPGA lets you run multiple processes that would take a whole CPU (because of needing careful timing for example) and run them in parallel
<galibert[m]>
then cpus can have a bunch of cores, and gpus a very big bunch of them
<crzwdjk>
Everything has its tradeoffs of course
<galibert[m]>
honestly non-hobby use of fpgas is, in my opinion, prototyping, handling lots of i/o channels and things where you need a very very low latency
<mcc111[m]>
While I'm double checking my assumptions
<adamgreig[m]>
Check out tinytapeout.com, you can really get an ASIC with your little HDL design just like that
<adamgreig[m]>
The difference is mostly in the IO. For serious or fast chips there's still going to be a bunch of extra work though.
Wanda[cis] has joined #amaranth-lang
<Wanda[cis]>
there are differences in how ASICs work and what is possible on them
<adamgreig[m]>
The tinytapeout designs will be running at a few kHz despite being ASICs, heh
<Wanda[cis]>
for one, on FPGAs, initializing everything is basically free (because there's startup logic that has to configure the FPGA for you anyway), on ASICs you have to deal with every ROM and register being undefined upon powerup
<crzwdjk>
TinyTapeout 4 promises a clock speed of "around 50 MHz"
<adamgreig[m]>
I'm still waiting to get my TT01 chips :p
<Wanda[cis]>
so ... you can reuse the HDL file, provided that you wrote it having ASIC limitations in mind in the first place
<adamgreig[m]>
So like, you could take your fpga HDL and turn it right into an ASIC (excepting thinking about io, and doing something about the fpga blocks like memory and pll), but usually ASICs allow for different optimisations
<galibert[m]>
Note that 4-6GHz is rather high even for a cpu
<adamgreig[m]>
<galibert[m]> "honestly non-hobby use of..." <- I think maybe DSP too? My work use is a mix of "custom serial protocols that mcus don't have a peripheral for" and "fast parallel dsp on a bunch of channels"
<bob_twinkles[m]>
the OpenROAD project is trying to get to the point of approximately "push button, get ASIC manufacturing artifacts" but it will likely always require some amount of artistry to get a design that is actually manufacturable. in industry there's entire teams devoted to doing nothing but patching up the holes left by the automation
<bob_twinkles[m]>
it's also extremely expensive, at least if you want to use a modern process node
<galibert[m]>
fast parallel dsp on a lot of channels, it's really hard to beat a gpu for that
<mcc111[m]>
galibert[m]: If the question is "does it make more sense to solve this problem in an fpga or on my desktop computer" i think it's reasonable to compare to an average intel chip on amazon (ofc at some step you'll get into the differences between the two, but it makes sense to start the comparison there)
<crzwdjk>
Low volume hardware that need to shuffle a lot of data around weird digital interfaces is also a pretty common FPGA use case
<bob_twinkles[m]>
if you don't care about latency yeah, but if you have hard-real-time constraints GPUs can't really deliver that because the software stack above it doesn't really support it
<galibert[m]>
heh, I have "vulkan-compatible gpu-ish thing" in my infinite todo list
<mcc111[m]>
So that makes me curious how big the "wall" between FPGAs and real chip fab is. Because if, by the time I'm done, I have made an Apache-licensed design which running on an FPGA can run at 160p, but a "real company" can steal it and fab it and it can run at 1080p, then I've done something worthwhile.
<mcc111[m]>
But if all I've done is make a proof of concept … it's less clear this was anything but a way to have fun.
<crzwdjk>
Tangentially, I just heard a presentation by a RISC-V person about a RISC-V GPU. Literally just today. But, CPUs are CPUs and GPUs are kind of their own thing for various reasons.
<galibert[m]>
cpu ISAs suck for gpus
<mcc111[m]>
galibert[m]: Right. I feel like the "hard part" is the interface between the kernel and the silicon. Because the hardware people aren't necessarily good at that and the Vulkan spec is (IMO anyway) big.
<mcc111[m]>
That is the part that having a "standard" design might be useful (even if you have to redesign everything else to actually fab)
<crzwdjk>
Btw, real GPU companies do use FPGAs to emulate their new GPU designs, albeit at a much slower speed and, at least as of when I remember seeing them, which was quite a while ago, in a huge refrigerator sized box
<mcc111[m]>
crzwdjk: lol
<mcc111[m]>
crzwdjk: You mean like, the shader units are little riscv processors (this was a thing I had been thinking about exploring)? Or just a gpu designed to work with riscv
<bob_twinkles[m]>
yeah, those run at ~10-100 times slower than the final product actually will IIRC
<bob_twinkles[m]>
partly because the full design doesn't actually fit on a single FPGA die so you have a ton of bit shuffling across slow macroscopic busses I think
<crzwdjk>
mcc111[m]: shader units running risc-v code as I understand it. This idea seems to come up periodically.
<mcc111[m]>
bob_twinkles[m]: ok.
<mcc111[m]>
so that makes it sound like if my final design runs at ~10-100 times slower than you need to run a regular laptop i don't necessarily need to panic then lol
<mcc111[m]>
crzwdjk: it seems like a natural place to *start*. as galibert says it's probably not optimal, but if you have code for a riscv softcore already, hey…
<crzwdjk>
Well, also this "100x slower" is for a desktop GPU which is a pretty big chip
<bob_twinkles[m]>
yeah
<bob_twinkles[m]>
i think your project is a really cool but i also sit sort-of-close to the hardware designers at a big-name GPU company so i think i know too much about how much work designing a modern GPU is
<bob_twinkles[m]>
there is a lot of stuff in there
<crzwdjk>
I kind of had the idea of making a GPU too, there are all kind of potential crazy ideas for making something that's "a GPU" but also "doesn't even try to come to the same neighborhood as competing with NVIDIA"
<bob_twinkles[m]>
https://moonbaseotago.github.io/ this project might be of interest -- it's basically what you're proposing but for a RISC-V CPU
<crzwdjk>
Idk, maybe a "tiling GPU" designed to run on actual, physical tiles of an LED panel. Or I had the idea of making a specifically 2D GPU
<bob_twinkles[m]>
Apache-2.0 licensed CPU implementation that's architected to work well in an ASIC context (i.e. provide server-class performance) but is currently being developed on FPGAs
<mcc111[m]>
bob_twinkles[m]: That's very exciting.
<galibert[m]>
anyone who wants to make a gpu with shaders should have a look at the intel documentation of their gpu isa. I don't mean you have to imitate them, but it shows how different a gpu isa is compared to a cpu one
<bob_twinkles[m]>
his approach is to rent time on some big lads at Amazon.com rather than try and squeeze it on to something that a mortal can afford
<mcc111[m]>
galibert[m]: would reading this document create patent encumbrance?
<crzwdjk>
You can also try to look at the various mobile GPU ISAs that Mesa supports
<galibert[m]>
crzwdjk: when you just want to have a high-level lookj at what it looks like, documentation is nicer than RE results :-)
<bob_twinkles[m]>
the Apple GPU architecture has also been black-box reversed for the asahi project
<crzwdjk>
Sometimes the RE result blog posts are more readable (and accurate!) than the original docs!
<bob_twinkles[m]>
IIRC it's like Mali-derived? but i haven't looked too closely at that (see: I work at a GPU company that's not one of those)
<crzwdjk>
I think the Asahi folks have a pretty good overview post about how the ISA works somewhere
<crzwdjk>
But big common themes are a) SIMD b) each SIMD lane is one shader invocation (pixel, vertex, whatever), c) because of the previous 2 items, some kind of per-lane predicatation mechanism
<galibert[m]>
Also not a set of registers but more like an array of fast dedicated per-core memory
<crzwdjk>
Lots of registers too, sometimes you also get to make a tradeoff between more registers and more parallelism
<mcc111[m]>
yeah what i've been told is the Hard Part of building a GPU is lining up the caches so that the data arrives when you want it
<mcc111[m]>
since a "normal" modern texture is much larger than any cache. you have to request the textures a shader unit is going to sample long in advance of the shader unit running
<bob_twinkles[m]>
stamping down an adder or a multiplier is easy, keeping it fed and happy is where things get tricky 😅
<galibert[m]>
well, if you look at the video ram bandwidth, and you divide by the number of cores and the clock frequency, you realize how little data can be accessed by cycle
<crzwdjk>
Yeah, I think shader ISAs in general let you issue a memory read well in advance of where you need the result
<crzwdjk>
And yeah, memory bandwidth is really the tough constraint
<galibert[m]>
ten years ago it was like one byte per cycle. I suspect it hadn't gotten that much better
<bob_twinkles[m]>
if anything it's gotten worse yeah
<crzwdjk>
Even in my not-quite-a-GPU that I am currently working on lol
<mcc111[m]>
anyway i just want to start by picking one small part of the problem and nibbling at it
<mcc111[m]>
termite logic
<galibert[m]>
well, start by a fp32 mac? :-)
<mcc111[m]>
and if as far as i get is "i added a 3D mode which can draw untextured polygons to the Project Freedom fantasy console on Analogue Pocket" that will be satisfying
<galibert[m]>
you'll need a divider
<mcc111[m]>
I was told the Cyclone has DSP units. Can't I do at least some of the math on those?
<mcc111[m]>
If I can't do that, my plan was to make this "flopoco" thing do the work
<mcc111[m]>
to start
<galibert[m]>
they're integer multipliers
<mcc111[m]>
oh no
<galibert[m]>
9x9, 18x18 or 27x27
<mcc111[m]>
(mind you, writing some floating point ALUs sounds fun)
<crzwdjk>
I mean, you can build a floating point multiplier on top of that with a bit of work?
<mcc111[m]>
anyway crzwdjk it's very likely you'll never hear me mention this again and i encourage you to run with your plan
<galibert[m]>
you can, it's also in my todo list :-) There are nice papers about how to implement a fp mac
<crzwdjk>
mcc111[m]: My idea is probably even less practical than yours, apparently HW-accelerated path rendering is a hard problem because people keep putting out research projects about it.
<mcc111[m]>
Path rendering?
<bob_twinkles[m]>
it's memory bandwidth again right?
<crzwdjk>
2D paths like SVG
<mcc111[m]>
oh, that's very interesting
<mcc111[m]>
So an important decision I made thinking about the RISCV GPU idea
<mcc111[m]>
The most important thing is *not 3D*
<mcc111[m]>
The important thing is to accelerate 2D interface compositing
<mcc111[m]>
Because if the goal is "the MNT Reform RKX7 should someday be able to run a desktop operating system", that's what you *actually* need
<mcc111[m]>
That and video decoding, but I assume video decoding is too much of a patent minefield to ever be possible
<crzwdjk>
Most 2D acceleration just uses 3D hardware these days. But memory bandwidth is, as always, a problem.
<bob_twinkles[m]>
especially in the mobile space, lots of GPUs do have a dedicated 2D engine that does compositing during scanout
<bob_twinkles[m]>
since you can do that with much less power than spinning up the big 3D pipeline
<crzwdjk>
Ah yeah, that is definitely a thing, and now they can composite like, 8 things at once or something, which is a big step up in terms of capability
<crzwdjk>
Also they do stuff like colorspace conversion for video (this has been a thing display engines could do for quite a while)
<bob_twinkles[m]>
yep, it's super handy since modern phone interfaces tend to have like 4-5 active layers at a time
<bob_twinkles[m]>
status bar, main application view, soft buttons, and maybe a video overlay
<galibert[m]>
it's also due to the fact that mobile gpus are really bad at blending
<bob_twinkles[m]>
FWIW most modern desktop toolkits do render through the 3D APIs (e.g. Gnome and QT both use GL or VK as their rendering backend, with the exception of text which usually ends up on the CPU)
<bob_twinkles[m]>
so while the display server work (slapping windows together) could benefit from 2D acceleration because you could hack that in to the compositor and use custom APIs, most of the actual application interfaces will still want good 3D performance
<crzwdjk>
Text is hard, as it turns out
<bob_twinkles[m]>
it's path rendering, but worse because you have like 4 separate turing-complete virtual machines to run 😅
<bob_twinkles[m]>
WesternDigital is also a major contributor to the RISC-V standard body IIRC...
<crzwdjk>
Not surprising, RISC-V is a good replacement for all those weird bespoke embedded architectures that live in various hardware
<cr1901>
And an alphabet soup of extensions (sorry I LOVE the base spec and it's unapologetic minimalism, and... not much else)
<cr1901>
bob_twinkles[m]: Oh I remember this article/the linked hddguru thread. I tried doing this on one of my old failing hard drives and I think I just made things worse lmao
<bob_twinkles[m]>
heh, i can believe it. modern persistent storage is full of dark magic and arcane arts 😁
<mcc111[m]>
<bob_twinkles[m]> "WesternDigital is also a major..." <- well that solves the cache problem then… we will simply store the data on a flash drive
<bob_twinkles[m]>
heh, i think that's about an order of magnitude too slow...
<bob_twinkles[m]>
i was searching around to see if AMD contributed to RISC-V and stumbled across the "RV64X" project, perhaps that would be of interest to you if you haven't seen it already
<crzwdjk>
mcc111[m]: my current not-quite-a-GPU thing is a dumb terminal that stores its (full unicode, more or less) font in SPI flash.
<galibert[m]>
My latest amusement is trying to generate a pm5644 live in amaranth. Circles are reasonably doable with just adders... but the radius lines I currently have no idea
<sorear>
much of the issue is that CPU architecture reached an essentially modern form ~1995, while GPUs, especially mobile ones, have been in flux much more recently