#amaranth-lang on 2023-09-06 — irc logs at libera.irclog.whitequark.org

2023-08-14 23:59 whitequark[cis] changed the topic of #amaranth-lang to: Amaranth hardware definition language · weekly meetings: Amaranth each Mon 1700 UTC, Amaranth SoC each Fri 1700 UTC · code https://github.com/amaranth-lang · logs https://libera.irclog.whitequark.org/amaranth-lang · Matrix #amaranth-lang:matrix.org

00:29 agg has quit [*.net *.split]

00:29 ebb has quit [*.net *.split]

00:30 agg has joined #amaranth-lang

00:41 ebb has joined #amaranth-lang

00:44 <_whitenotifier> [YoWASP/yosys] whitequark pushed 1 commit to develop-0.33 [+0/-0/±1] https://github.com/YoWASP/yosys/commit/3feb7071cb05

00:44 <_whitenotifier> [YoWASP/yosys] whitequark 3feb707 - [autorelease] Yosys 0.33.

00:44 <_whitenotifier> [yosys] whitequark created branch develop-0.33 - https://github.com/YoWASP/yosys

00:44 <_whitenotifier> [YoWASP/yosys] whitequark pushed 1 commit to develop [+0/-0/±1] https://github.com/YoWASP/yosys/compare/c4df91a800b9...73d3df2d3faa

00:44 <_whitenotifier> [YoWASP/yosys] whitequark 73d3df2 - Update dependencies.

00:54 <_whitenotifier> [YoWASP/yosys] whitequark created branch release-0.33 https://github.com/YoWASP/yosys/commit/3feb7071cb0582423fa9df9233fffd5f8e9601e4

00:54 <_whitenotifier> [yosys] whitequark created branch release-0.33 - https://github.com/YoWASP/yosys

01:08 Degi has quit [Ping timeout: 245 seconds]

01:08 Degi has joined #amaranth-lang

03:30 <_whitenotifier> [amaranth] cr1901 reviewed pull request #904 commit - https://github.com/amaranth-lang/amaranth/pull/904#discussion_r1316673357

03:32 <_whitenotifier> [amaranth] cr1901 reviewed pull request #904 commit - https://github.com/amaranth-lang/amaranth/pull/904#discussion_r1316674022

03:41 <mcc111[m]> Say I want to do some stuff with Amaranth, and I'm on Microsoft Windows, but I spend most of my time in WSL.... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/apVXelJlKgkeJNgAqGjGgcuE>)

03:42 <mcc111[m]> I'm assuming WSL, but if amaranth ever wants to talk to USB, then a Windows python will make more sense

03:48 <Darius> I thought WSL could do USB these days

04:13 <whitequark[cis]1> WSL can do USB

04:13 <whitequark[cis]1> it is mostly stable

04:14 <whitequark[cis]1> Amaranth is designed to work on Windows Python well; if it doesn't that's a bug

04:32 <mcc111[m]> USB working on WSL is a surprise to me. Does it work on WSL1? I still use WSL1 because the windows drive performance is better.

04:45 <Darius> I doubt it works with v1

06:15 charlottia has joined #amaranth-lang

06:15 <charlottia> Can confirm it works well with v2 (with some hacks to do the passthrough), v1 would be a little surprising but I've been more surprised.

06:28 _whitenotifier has quit [Server closed connection]

07:31 <gruetzkopf> oh they *finally* added usb passthrough to hyperv?

09:32 <charlottia> I’m not sure the mechanism exactly; there’s a device driver for the passthrough and at one point a custom (Linux) kernel too was needed too, I guess whatever it was got mainlined.

11:55 _whitenotifier-1 has joined #amaranth-lang

11:55 <_whitenotifier-1> [amaranth-soc] jfng opened pull request #50: csr.bus: replace ceil(log2(n)) with log2_int(n, need_pow2=False). - https://github.com/amaranth-lang/amaranth-soc/pull/50

11:56 <_whitenotifier-1> [amaranth-lang/amaranth-soc] github-merge-queue[bot] pushed 1 commit to gh-readonly-queue/main/pr-50-59223a82399df45addf1db362ad6fb9670e72b51 [+0/-0/±1] https://github.com/amaranth-lang/amaranth-soc/commit/f90d9a433151

11:56 <_whitenotifier-1> [amaranth-lang/amaranth-soc] jfng f90d9a4 - csr.bus: replace ceil(log2(n)) with log2_int(n, need_pow2=False).

11:56 <_whitenotifier-1> [amaranth-soc] github-merge-queue[bot] created branch gh-readonly-queue/main/pr-50-59223a82399df45addf1db362ad6fb9670e72b51 - https://github.com/amaranth-lang/amaranth-soc

11:57 <_whitenotifier-1> [amaranth-lang/amaranth-soc] github-merge-queue[bot] pushed 1 commit to main [+0/-0/±1] https://github.com/amaranth-lang/amaranth-soc/compare/59223a82399d...f90d9a433151

11:57 <_whitenotifier-1> [amaranth-lang/amaranth-soc] jfng f90d9a4 - csr.bus: replace ceil(log2(n)) with log2_int(n, need_pow2=False).

11:57 <_whitenotifier-1> [amaranth-lang/amaranth-soc] github-merge-queue[bot] deleted branch gh-readonly-queue/main/pr-50-59223a82399df45addf1db362ad6fb9670e72b51

11:57 <_whitenotifier-1> [amaranth-soc] jfng closed pull request #50: csr.bus: replace ceil(log2(n)) with log2_int(n, need_pow2=False). - https://github.com/amaranth-lang/amaranth-soc/pull/50

11:57 <_whitenotifier-1> [amaranth-soc] github-merge-queue[bot] deleted branch gh-readonly-queue/main/pr-50-59223a82399df45addf1db362ad6fb9670e72b51 - https://github.com/amaranth-lang/amaranth-soc

13:28 <_whitenotifier-1> [amaranth-soc] jfng opened pull request #51: Minor docstring clarifications. - https://github.com/amaranth-lang/amaranth-soc/pull/51

13:29 <_whitenotifier-1> [amaranth-lang/amaranth-soc] github-merge-queue[bot] pushed 1 commit to gh-readonly-queue/main/pr-51-f90d9a4331516f78c5f1e50d7981569a164c60e4 [+0/-0/±5] https://github.com/amaranth-lang/amaranth-soc/commit/df8013d2dcab

13:29 <_whitenotifier-1> [amaranth-lang/amaranth-soc] jfng df8013d - Minor docstring clarifications.

13:29 <_whitenotifier-1> [amaranth-soc] github-merge-queue[bot] created branch gh-readonly-queue/main/pr-51-f90d9a4331516f78c5f1e50d7981569a164c60e4 - https://github.com/amaranth-lang/amaranth-soc

13:30 <_whitenotifier-1> [amaranth-lang/amaranth-soc] github-merge-queue[bot] pushed 1 commit to main [+0/-0/±5] https://github.com/amaranth-lang/amaranth-soc/compare/f90d9a433151...df8013d2dcab

13:30 <_whitenotifier-1> [amaranth-lang/amaranth-soc] jfng df8013d - Minor docstring clarifications.

13:30 <_whitenotifier-1> [amaranth-soc] github-merge-queue[bot] deleted branch gh-readonly-queue/main/pr-51-f90d9a4331516f78c5f1e50d7981569a164c60e4 - https://github.com/amaranth-lang/amaranth-soc

13:30 <_whitenotifier-1> [amaranth-lang/amaranth-soc] github-merge-queue[bot] deleted branch gh-readonly-queue/main/pr-51-f90d9a4331516f78c5f1e50d7981569a164c60e4

13:30 <_whitenotifier-1> [amaranth-soc] jfng closed pull request #51: Minor docstring clarifications. - https://github.com/amaranth-lang/amaranth-soc/pull/51

14:38 <_whitenotifier-1> [amaranth-soc] jfng reviewed pull request #49 commit - https://github.com/amaranth-lang/amaranth-soc/pull/49#discussion_r1317391329

14:39 <_whitenotifier-1> [amaranth-soc] jfng reviewed pull request #49 commit - https://github.com/amaranth-lang/amaranth-soc/pull/49#discussion_r1317393208

14:39 <_whitenotifier-1> [amaranth-soc] jfng reviewed pull request #49 commit - https://github.com/amaranth-lang/amaranth-soc/pull/49#discussion_r1317393539

14:41 <_whitenotifier-1> [amaranth-soc] jfng reviewed pull request #49 commit - https://github.com/amaranth-lang/amaranth-soc/pull/49#discussion_r1317395757

14:41 <_whitenotifier-1> [amaranth-soc] jfng reviewed pull request #49 commit - https://github.com/amaranth-lang/amaranth-soc/pull/49#discussion_r1317396014

19:56 <mcc111[m]> So Catherine was telling me how an FPGA with an HDMI port can emit high definition video by rapidly emitting pixels at the TV's scan rate.... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/vAVavtmCPxWbYKaNJmofIacr>)

20:05 <galibert[m]> the two evaluation boards with a cyclone v I have (de10-nano and cycloe v gx starter kit) have an adv7513 hdmi encoder. It has a 24-pins parallel interface for the pixels, so you only need to run at the pixel clock rate

20:06 <galibert[m]> That's 135MHz roughly for full hd 60Hz, it does that without problems. The adv itself can go up to 165MHz, so full hd is in practice the limit

20:06 <galibert[m]> s/cycloe/cyclone/

20:07 <adamgreig[m]> In comparison without the adv7513 if you output the hdmi tmds signals directly it's about 10x the clock

20:07 <adamgreig[m]> So like 1.5GHz for 1080p60 which is much more challenging

20:07 <galibert[m]> yeah, the cyclone wouldn't be able to run at that speed

20:08 <adamgreig[m]> If you had an ecp5 with serdes transceivers maybe you could do it but I don't think the colorlight boards do and it'd be a lot of work

20:08 <Allie> SDI is probably more approachable

20:09 <adamgreig[m]> Really, at 1080p? Isn't it even higher clocks due to being a single stream?

20:09 <adamgreig[m]> I've never looked in to generating it

20:09 <mcc111[m]> I've found it surprisingly difficult to figure out what speed any of these FPGA boards is running at.

20:09 <mcc111[m]> Like, it's not mentioned on the sale page for the colorlight

20:09 <mcc111[m]> https://a.aliexpress.com/_mqzBTLG

20:09 <Allie> adamgreig[m]: 1080i is 1.5G-ish

20:10 <mcc111[m]> I assumed the answer to "what is the fpga clock rate?" Is "it's complicated"

20:10 <adamgreig[m]> Yea, the answer is always "it depends" but for any of those you might imagine 200MHz is possible for reasonably well designed logic

20:10 <galibert[m]> mcc111: typical point-to-point routing time in the cyclone v is 200-500 picoseconds

20:11 <Allie> 576i is about 270M, which is very achievable with good logic

20:11 <Allie> and tbh SD is good enough for anybody :>

20:11 <adamgreig[m]> Seems less annoying to generate than the hdmi tmds too

20:12 <galibert[m]> mcc111: making timing essentially requires having a propagation delay from one ff to the next that less than a clock period

20:13 <galibert[m]> which is why the maximal reachable clock speed is very very very design dependant

20:13 <mcc111[m]> When one talks about "a clock" in an fpga are they talking about an actual clock circult. like a crystal or something

20:13 <mcc111[m]> somebody mentioned the Cyclone V having "PLL"s (plural)

20:13 <Allie> mcc111[m]: do you know what a PLL is?

20:13 <galibert[m]> yeah, between... 4 and 7 iirc. I would need to check

20:13 <Allie> adamgreig[m]: yeah, for SD it's just Cb Y Cr Y' Cb Y Cr Y' with a sync word at the start of every line and another one at the start of every hblank

20:14 <Allie> super trivial to generate and therefore very fun

20:14 <adamgreig[m]> Most FPGAs will have one or many PLLs, and then a lowish frequency crystal or oscillator or other clock source on the board

20:14 <Allie> HD is basically the same except you have a run of Y Y' Y Y' Y Y' Y Y' and a run of Cb Cr Cb Cr Cb Cr Cb Cr

20:14 <adamgreig[m]> Then your design sets up the pll to generate whatever frequency/ies you need based on the external clock(s)

20:15 <adamgreig[m]> Allie: Aah fun, weird though, why not keep doing each pixel?

20:15 <mcc111[m]> <adamgreig[m]> "Yea, the answer is always "it..." <- I'm surprised to learn it's *that* low given modern CPUs are in the 3ghz - 5.8ghz range. That calls into question my assumption that FPGAs actually can keep up with CPUs for modern day workloads. Like I understand that FPGAs can potentially "do things" on every clock cycle whereas that usually is not the case for a CPU, but still.

20:15 <adamgreig[m]> Does sound fun though, I'd like to try it... what do you use to view SDI?

20:16 <Allie> I use a £30k sony trimaster EL. you can use a £35 converter from blackmagic :P

20:16 <adamgreig[m]> These are like $30 FPGAs from 10 years ago, they are a different league to a modern desktop cpu

20:16 <galibert[m]> mcc111: fpgas are damn slow compared to cpus or gpus

20:16 <Allie> adamgreig[m]: there aren't really pixels in SDI

20:16 <Allie> certainly not square ones

20:16 <adamgreig[m]> Even a super modern $$$$ FPGA on a latest process is way way slower than a modern CPU though, yea

20:16 <mcc111[m]> MNT is currently prototyping a laptop where the GPU/GPU are replaced with a Xilinx Kintex-7 XC7K325T-FFG676.

20:16 <Allie> (SDTV is a *very* efficient way to use a bunch of airtime)

20:17 <galibert[m]> mcc111: it will probably be comparable to... oh... maybe a 2005 laptop?

20:17 <mcc111[m]> galibert[m]: okay. interesting.

20:18 <galibert[m]> unless it has an embedded cpu core

20:18 <mcc111[m]> i feel like we hit a "this computer is fast enough. it never needs to go any faster" point for me sometime between 2010 and 2015

20:18 <mcc111[m]> 2005 i might be able to put up with as long as the applications are carefully chosen

20:18 <galibert[m]> I still want a faster computer

20:19 <galibert[m]> (I want a faster fpga too :-) )

20:19 <mcc111[m]> <Allie> "mcc111: do you know what a PLL..." <- I do, but my understanding of what you do with a PLL once you have one is relatively shallow.

20:20 <galibert[m]> well, the pll/clock networks in a cyclone v can build a clock for more or less any frequency between... something like 1 and 550 MHz

20:21 <mcc111[m]> Are the PLL units in a modern fpga finite, like there's a specific number of dedicated "PLL"s etched in the chip and you task them as you see fit, or is a PLL simply one of the configurations you can force labs into?

20:21 <crzwdjk> PLLs are dedicated hardware units which can be configured by the FPGA bitstream

20:22 bob_twinkles[m] has joined #amaranth-lang

20:22 <bob_twinkles[m]> you *can* build an oscillator out of LABs but it's not going to be very good

20:22 <crzwdjk> But you can also e.g. build a clock divider in logic

20:23 <bob_twinkles[m]> PLLs have specific analog considerations that make them hard to emulate in digital logic

20:24 <mcc111[m]> Thanks for the explanations

20:24 <galibert[m]> low frequency plls like used in floppy controllers are doable digitally (the amiga, the wd177x for instance have a digital pll, there are patents about them), but yeah, low frequency

20:27 <crzwdjk> But yeah, FPGAs are not going to beat a CPU for raw speed. But an FPGA lets you run multiple processes that would take a whole CPU (because of needing careful timing for example) and run them in parallel

20:27 <galibert[m]> then cpus can have a bunch of cores, and gpus a very big bunch of them

20:28 <crzwdjk> Everything has its tradeoffs of course

20:28 <galibert[m]> honestly non-hobby use of fpgas is, in my opinion, prototyping, handling lots of i/o channels and things where you need a very very low latency

20:29 <mcc111[m]> While I'm double checking my assumptions

20:32 <mcc111[m]> Say you use an FPGA for prototyping. You write a program in Amaranth or Verilog or SystemVerilog or whatever. Your circuit works in hardware, it's running on your FPGA at 200MHz.... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/TgKAZrPYSzvwDiKXDKUUINLg>)

20:33 <galibert[m]> That I'm not able to answer

20:34 <adamgreig[m]> Check out tinytapeout.com, you can really get an ASIC with your little HDL design just like that

20:35 <adamgreig[m]> The difference is mostly in the IO. For serious or fast chips there's still going to be a bunch of extra work though.

20:35 Wanda[cis] has joined #amaranth-lang

20:35 <Wanda[cis]> there are differences in how ASICs work and what is possible on them

20:35 <adamgreig[m]> The tinytapeout designs will be running at a few kHz despite being ASICs, heh

20:35 <Wanda[cis]> for one, on FPGAs, initializing everything is basically free (because there's startup logic that has to configure the FPGA for you anyway), on ASICs you have to deal with every ROM and register being undefined upon powerup

20:36 <crzwdjk> TinyTapeout 4 promises a clock speed of "around 50 MHz"

20:36 <adamgreig[m]> I'm still waiting to get my TT01 chips :p

20:37 <Wanda[cis]> so ... you can reuse the HDL file, provided that you wrote it having ASIC limitations in mind in the first place

20:37 <adamgreig[m]> So like, you could take your fpga HDL and turn it right into an ASIC (excepting thinking about io, and doing something about the fpga blocks like memory and pll), but usually ASICs allow for different optimisations

20:37 <galibert[m]> Note that 4-6GHz is rather high even for a cpu

20:38 <adamgreig[m]> <galibert[m]> "honestly non-hobby use of..." <- I think maybe DSP too? My work use is a mix of "custom serial protocols that mcus don't have a peripheral for" and "fast parallel dsp on a bunch of channels"

20:38 <bob_twinkles[m]> the OpenROAD project is trying to get to the point of approximately "push button, get ASIC manufacturing artifacts" but it will likely always require some amount of artistry to get a design that is actually manufacturable. in industry there's entire teams devoted to doing nothing but patching up the holes left by the automation

20:39 <bob_twinkles[m]> it's also extremely expensive, at least if you want to use a modern process node

20:39 <galibert[m]> fast parallel dsp on a lot of channels, it's really hard to beat a gpu for that

20:39 <mcc111[m]> galibert[m]: If the question is "does it make more sense to solve this problem in an fpga or on my desktop computer" i think it's reasonable to compare to an average intel chip on amazon (ofc at some step you'll get into the differences between the two, but it makes sense to start the comparison there)

20:39 <crzwdjk> Low volume hardware that need to shuffle a lot of data around weird digital interfaces is also a pretty common FPGA use case

20:40 <bob_twinkles[m]> if you don't care about latency yeah, but if you have hard-real-time constraints GPUs can't really deliver that because the software stack above it doesn't really support it

20:40 <bob_twinkles[m]> there's some real finesse in things like https://developer.nvidia.com/aerial-sdk

20:41 <mcc111[m]> crzwdjk: So it's ofc ideal for the Glasgow :)

20:41 <crzwdjk> For example, NVIDIA made an SDI output card that took a DVI input and converted it to SDI and that was built around an FPGA

20:43 <crzwdjk> Ah right, low latency is the other big potential advantage of FPGAs

20:43 <mcc111[m]> So... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/LTDVtaFNtTpaaJIlHqitpDiv>)

20:43 <mcc111[m]> * OK... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/BOLfnJcFHanJJkhqDnxyKUuY>)

20:44 <galibert[m]> heh, I have "vulkan-compatible gpu-ish thing" in my infinite todo list

20:45 <mcc111[m]> So that makes me curious how big the "wall" between FPGAs and real chip fab is. Because if, by the time I'm done, I have made an Apache-licensed design which running on an FPGA can run at 160p, but a "real company" can steal it and fab it and it can run at 1080p, then I've done something worthwhile.

20:45 <mcc111[m]> But if all I've done is make a proof of concept … it's less clear this was anything but a way to have fun.

20:45 <crzwdjk> Tangentially, I just heard a presentation by a RISC-V person about a RISC-V GPU. Literally just today. But, CPUs are CPUs and GPUs are kind of their own thing for various reasons.

20:45 <galibert[m]> cpu ISAs suck for gpus

20:46 <mcc111[m]> galibert[m]: Right. I feel like the "hard part" is the interface between the kernel and the silicon. Because the hardware people aren't necessarily good at that and the Vulkan spec is (IMO anyway) big.

20:46 <mcc111[m]> That is the part that having a "standard" design might be useful (even if you have to redesign everything else to actually fab)

20:46 <crzwdjk> Btw, real GPU companies do use FPGAs to emulate their new GPU designs, albeit at a much slower speed and, at least as of when I remember seeing them, which was quite a while ago, in a huge refrigerator sized box

20:47 <mcc111[m]> crzwdjk: lol

20:47 <mcc111[m]> crzwdjk: You mean like, the shader units are little riscv processors (this was a thing I had been thinking about exploring)? Or just a gpu designed to work with riscv

20:47 <bob_twinkles[m]> yeah, those run at ~10-100 times slower than the final product actually will IIRC

20:48 <bob_twinkles[m]> partly because the full design doesn't actually fit on a single FPGA die so you have a ton of bit shuffling across slow macroscopic busses I think

20:48 <crzwdjk> mcc111[m]: shader units running risc-v code as I understand it. This idea seems to come up periodically.

20:48 <mcc111[m]> bob_twinkles[m]: ok.

20:48 <mcc111[m]> so that makes it sound like if my final design runs at ~10-100 times slower than you need to run a regular laptop i don't necessarily need to panic then lol

20:48 <mcc111[m]> crzwdjk: it seems like a natural place to *start*. as galibert says it's probably not optimal, but if you have code for a riscv softcore already, hey…

20:49 <crzwdjk> Well, also this "100x slower" is for a desktop GPU which is a pretty big chip

20:49 <bob_twinkles[m]> yeah

20:49 <bob_twinkles[m]> i think your project is a really cool but i also sit sort-of-close to the hardware designers at a big-name GPU company so i think i know too much about how much work designing a modern GPU is

20:49 <bob_twinkles[m]> there is a lot of stuff in there

20:50 <crzwdjk> I kind of had the idea of making a GPU too, there are all kind of potential crazy ideas for making something that's "a GPU" but also "doesn't even try to come to the same neighborhood as competing with NVIDIA"

20:50 <bob_twinkles[m]> https://moonbaseotago.github.io/ this project might be of interest -- it's basically what you're proposing but for a RISC-V CPU

20:50 <crzwdjk> Idk, maybe a "tiling GPU" designed to run on actual, physical tiles of an LED panel. Or I had the idea of making a specifically 2D GPU

20:51 <bob_twinkles[m]> Apache-2.0 licensed CPU implementation that's architected to work well in an ASIC context (i.e. provide server-class performance) but is currently being developed on FPGAs

20:51 <mcc111[m]> bob_twinkles[m]: That's very exciting.

20:51 <galibert[m]> anyone who wants to make a gpu with shaders should have a look at the intel documentation of their gpu isa. I don't mean you have to imitate them, but it shows how different a gpu isa is compared to a cpu one

20:51 <bob_twinkles[m]> his approach is to rent time on some big lads at Amazon.com rather than try and squeeze it on to something that a mortal can afford

20:51 <mcc111[m]> galibert[m]: would reading this document create patent encumbrance?

20:52 <crzwdjk> You can also try to look at the various mobile GPU ISAs that Mesa supports

20:52 <galibert[m]> crzwdjk: when you just want to have a high-level lookj at what it looks like, documentation is nicer than RE results :-)

20:52 <bob_twinkles[m]> the Apple GPU architecture has also been black-box reversed for the asahi project

20:53 <crzwdjk> Sometimes the RE result blog posts are more readable (and accurate!) than the original docs!

20:53 <bob_twinkles[m]> IIRC it's like Mali-derived? but i haven't looked too closely at that (see: I work at a GPU company that's not one of those)

20:53 <crzwdjk> I think the Asahi folks have a pretty good overview post about how the ISA works somewhere

20:55 <crzwdjk> But big common themes are a) SIMD b) each SIMD lane is one shader invocation (pixel, vertex, whatever), c) because of the previous 2 items, some kind of per-lane predicatation mechanism

20:55 <galibert[m]> Also not a set of registers but more like an array of fast dedicated per-core memory

20:56 <crzwdjk> Lots of registers too, sometimes you also get to make a tradeoff between more registers and more parallelism

20:57 <mcc111[m]> yeah what i've been told is the Hard Part of building a GPU is lining up the caches so that the data arrives when you want it

20:58 <mcc111[m]> since a "normal" modern texture is much larger than any cache. you have to request the textures a shader unit is going to sample long in advance of the shader unit running

20:58 <bob_twinkles[m]> stamping down an adder or a multiplier is easy, keeping it fed and happy is where things get tricky 😅

20:58 <galibert[m]> well, if you look at the video ram bandwidth, and you divide by the number of cores and the clock frequency, you realize how little data can be accessed by cycle

20:58 <crzwdjk> Yeah, I think shader ISAs in general let you issue a memory read well in advance of where you need the result

20:59 <crzwdjk> And yeah, memory bandwidth is really the tough constraint

20:59 <galibert[m]> ten years ago it was like one byte per cycle. I suspect it hadn't gotten that much better

20:59 <bob_twinkles[m]> if anything it's gotten worse yeah

20:59 <crzwdjk> Even in my not-quite-a-GPU that I am currently working on lol

21:00 <mcc111[m]> anyway i just want to start by picking one small part of the problem and nibbling at it

21:00 <mcc111[m]> termite logic

21:01 <galibert[m]> well, start by a fp32 mac? :-)

21:01 <mcc111[m]> and if as far as i get is "i added a 3D mode which can draw untextured polygons to the Project Freedom fantasy console on Analogue Pocket" that will be satisfying

21:01 <galibert[m]> you'll need a divider

21:02 <mcc111[m]> I was told the Cyclone has DSP units. Can't I do at least some of the math on those?

21:02 <mcc111[m]> If I can't do that, my plan was to make this "flopoco" thing do the work

21:02 <mcc111[m]> to start

21:02 <galibert[m]> they're integer multipliers

21:02 <mcc111[m]> oh no

21:02 <galibert[m]> 9x9, 18x18 or 27x27

21:02 <mcc111[m]> (mind you, writing some floating point ALUs sounds fun)

21:03 <crzwdjk> I mean, you can build a floating point multiplier on top of that with a bit of work?

21:04 <mcc111[m]> anyway crzwdjk it's very likely you'll never hear me mention this again and i encourage you to run with your plan

21:04 <galibert[m]> you can, it's also in my todo list :-) There are nice papers about how to implement a fp mac

21:06 <crzwdjk> mcc111[m]: My idea is probably even less practical than yours, apparently HW-accelerated path rendering is a hard problem because people keep putting out research projects about it.

21:06 <mcc111[m]> Path rendering?

21:06 <bob_twinkles[m]> it's memory bandwidth again right?

21:06 <crzwdjk> 2D paths like SVG

21:06 <mcc111[m]> oh, that's very interesting

21:07 <mcc111[m]> So an important decision I made thinking about the RISCV GPU idea

21:07 <mcc111[m]> The most important thing is *not 3D*

21:07 <mcc111[m]> The important thing is to accelerate 2D interface compositing

21:08 <mcc111[m]> Because if the goal is "the MNT Reform RKX7 should someday be able to run a desktop operating system", that's what you *actually* need

21:08 <mcc111[m]> That and video decoding, but I assume video decoding is too much of a patent minefield to ever be possible

21:09 <crzwdjk> Most 2D acceleration just uses 3D hardware these days. But memory bandwidth is, as always, a problem.

21:09 <bob_twinkles[m]> especially in the mobile space, lots of GPUs do have a dedicated 2D engine that does compositing during scanout

21:10 <bob_twinkles[m]> since you can do that with much less power than spinning up the big 3D pipeline

21:10 <crzwdjk> Ah yeah, that is definitely a thing, and now they can composite like, 8 things at once or something, which is a big step up in terms of capability

21:11 <crzwdjk> Also they do stuff like colorspace conversion for video (this has been a thing display engines could do for quite a while)

21:11 <bob_twinkles[m]> yep, it's super handy since modern phone interfaces tend to have like 4-5 active layers at a time

21:12 <bob_twinkles[m]> status bar, main application view, soft buttons, and maybe a video overlay

21:14 <galibert[m]> it's also due to the fact that mobile gpus are really bad at blending

21:15 <bob_twinkles[m]> FWIW most modern desktop toolkits do render through the 3D APIs (e.g. Gnome and QT both use GL or VK as their rendering backend, with the exception of text which usually ends up on the CPU)

21:15 <bob_twinkles[m]> so while the display server work (slapping windows together) could benefit from 2D acceleration because you could hack that in to the compositor and use custom APIs, most of the actual application interfaces will still want good 3D performance

21:18 <crzwdjk> Text is hard, as it turns out

21:18 <bob_twinkles[m]> it's path rendering, but worse because you have like 4 separate turing-complete virtual machines to run 😅

21:23 <mcc111[m]> - You need a GPU... (full message at <https://catircservices.org/_matrix/media/v3/download/catircservices.org/zcRvTUXoJfmJDrRkBtarFTTt>)

21:27 <bob_twinkles[m]> http://spritesmods.com/?art=hddhack&page=1

21:27 <bob_twinkles[m]> WesternDigital is also a major contributor to the RISC-V standard body IIRC...

21:29 <crzwdjk> Not surprising, RISC-V is a good replacement for all those weird bespoke embedded architectures that live in various hardware

21:31 <cr1901> And an alphabet soup of extensions (sorry I LOVE the base spec and it's unapologetic minimalism, and... not much else)

21:34 <cr1901> bob_twinkles[m]: Oh I remember this article/the linked hddguru thread. I tried doing this on one of my old failing hard drives and I think I just made things worse lmao

21:36 <bob_twinkles[m]> heh, i can believe it. modern persistent storage is full of dark magic and arcane arts 😁

21:38 <mcc111[m]> <bob_twinkles[m]> "WesternDigital is also a major..." <- well that solves the cache problem then… we will simply store the data on a flash drive

21:39 <bob_twinkles[m]> heh, i think that's about an order of magnitude too slow...

21:40 <bob_twinkles[m]> i was searching around to see if AMD contributed to RISC-V and stumbled across the "RV64X" project, perhaps that would be of interest to you if you haven't seen it already

21:45 <crzwdjk> mcc111[m]: my current not-quite-a-GPU thing is a dumb terminal that stores its (full unicode, more or less) font in SPI flash.

21:48 <galibert[m]> My latest amusement is trying to generate a pm5644 live in amaranth. Circles are reasonably doable with just adders... but the radius lines I currently have no idea

21:48 <galibert[m]> some variant of cordic maybe

21:48 <galibert[m]> https://upload.wikimedia.org/wikipedia/commons/4/47/Philips_Pattern_pm5644.png

21:49 <mcc111[m]> ohh, that's fun

21:49 <galibert[m]> I have all of it (in amaranth-convertible C++) except the four circles in the corners

21:50 <crzwdjk> Btw this guy made a GPU to run Doom on an ICE40 UP: https://www.antexel.com/doomchip_onice_rc3/

21:51 <crzwdjk> Here is the code: https://github.com/sylefeb/tinygpus/

22:11 <sorear> much of the issue is that CPU architecture reached an essentially modern form ~1995, while GPUs, especially mobile ones, have been in flux much more recently