##openfpga on 2023-03-16 — irc logs at libera.irclog.whitequark.org

2021-06-23 14:13 azonenberg changed the topic of ##openfpga to: Open source tools for FPGAs, CPLDs, etc. Silicon RE, bitfile RE, synthesis, place-and-route, and JTAG are all on topic. Channel logs: https://libera.irclog.whitequark.org/~h~openfpga

00:08 mewt_ has quit [Remote host closed the connection]

00:10 mewt has joined ##openfpga

03:42 Degi has quit [Ping timeout: 276 seconds]

03:43 Degi has joined ##openfpga

06:04 <azonenberg> pie_: so, a few other things to consider

06:05 <azonenberg> With a GPU, your only way to get data in or out is PCIe

06:05 <azonenberg> So you're potentially bottlenecked on other parts of the system for data transfer (same as a CPU)

06:06 <azonenberg> with an FPGA you have the option of direct inputs on I/O pins or high speed serial transceivers

06:07 <azonenberg> GPUs typically come as premade boards with a ton of very fast ram; this is difficult to beat for access to large quantities of data when your problem is memory bound. Way more BW than a CPU can get

06:07 <azonenberg> Some newer gen high end FPGAs have HBM on them which is pretty competitive, and bigger ones with a ton of IO can run many channels of DDRx

06:08 <azonenberg> on the flip side, however, for smaller datasets that fit in on-die SRAM, you can get extreme bandwidth on an FPGA because every single block RAM can run in parallel (or you can cascade them to sacrifice bandwidth for larger size on a given block)

06:09 <azonenberg> Hard numbers for what I have handy right now: my workstation CPU is a Xeon Scalable Gold 6144, with 6 channels of DDR4-2666 per socket (I have two CPUs but let's talk apples to apples)

06:10 <azonenberg> So that's six 64-bit memory buses ignoring ECC, so 6*64 = 384 bits @ 2666 MT/s = 1.024 Tbps

06:12 <azonenberg> my GPU is an RTX 2080 Ti, which has a 352 bit memory bus at 14 GT/s (GDDR6 is much faster than DDR4), so a total bandwidth of 4.92 Tbps

06:15 <azonenberg> Compare this to a fairly midrange FPGA, the Xilinx XC7K160T-2FFG900. It has 325 36 Kbit SRAM blocks, 350 HR I/O pins, and 150 HP I/O pins

06:17 <azonenberg> The HR pins can collectively fit two 64-bit channels of DDR3-1066, the HP pins can do one channel of DDR3-1866. This is a combined 0.255 Tbps (255 Gbps), so about a quarter the CPU and 1/20 the GPU

06:18 <azonenberg> But the block RAMs can clock up to 543 MHz (or 601 in the -3 speed grade, so let's be generous and use that)

06:19 <azonenberg> And they each have a 36 bit read and 36 bit write bus, and can do both simultaneously. If you have equal mixes of reads and writes that's equivalent to a 72 bit bus, so you have 43.2 Gbps *per block RAM*

06:19 <azonenberg> Times 325 gives 14.04 Tbps, about 14x the CPU and 2.8x the GPU

06:20 <azonenberg> now the CPUs and GPUs have caches too, of course, but they're still somewhat limited because they have only so many cores for the cache to feed

06:20 <azonenberg> if you can optimize your FPGA design with sufficient parallelism you can get ridiculous bandwidth. And with an even bigger FPGA, of course, the difference is bigger

06:23 <azonenberg> And that was a 28nm FPGA. a Kintex-Ultrascale+ XCKU19P has 1728 block RAMs which can clock to 825 MHz, so 59.4 Gbps * 1728 blocks = 102.6 Tbps. And that's not the largest FPGA available by a long shot

06:24 <azonenberg> FPGAs are not, however, very good at floating point math because you have to use fabric logic resources to build FPUs, while a CPU or GPU will have the floating point stuff in hard wired logic so they're faster and you have more of them

06:24 <azonenberg> So you'll get your butt kicked on floating point heavy stuff. But for fixed point integer work, they excel because they have hundreds or thousands of integer multipliers that can all run in parallel, plus you can build tons of adders in fabric logic

06:25 <azonenberg> FPGA vs ASIC is also an interesting thing, ignoring the startup costs for ASIC. In particular, the relative cost of different things changes

06:25 <azonenberg> in FPGAs, ram is cheap. block RAMs are implemented as basically asic SRAM macros, while logic is expensive since it needs to be done in LUTs

06:25 <azonenberg> so your ratio of logic to RAM is quite low

06:26 <azonenberg> in ASIC, RAM is proportionally enormous because your logic shrinks and the memory doesn't

06:27 <azonenberg> the same is true of things like serial transceivers. A double digit Gbps ASIC SERDES IP is going to cost you seven digits in licensing fees and take up a huge amount of die area compared to logic

06:28 <azonenberg> in FPGA, you get the transceivers for free and they're relatively abundant compared to the amount of logic you have available

06:28 <azonenberg> and of course in CPU/GPU you have transceivers free too, but they're hardwired to certain functions like usb, pcie, etc. So if you need 100GbE to your compute, you're out of luck unless you add a pcie nic and some code to marshal data around

07:18 <Flea86> azonenberg: Great writeup mate. Also TIL Intel Arria 10 FPGAs have single precision float support in their hard math blocks :>

07:35 <Flea86> Hmm, I see that a lot of the high end Xilinx parts can do it too.

09:49 <pie_> saved

10:52 specing has quit [Ping timeout: 260 seconds]

11:06 specing has joined ##openfpga

12:26 ZipCPU has quit [Remote host closed the connection]

12:33 ZipCPU_ has joined ##openfpga

12:33 ZipCPU_ is now known as ZipCPU

19:34 <pie_> Does anyone know any good material on youtube on verilog?

19:39 <pie_> azonenberg: I dont know if this is relevant to your pcb checklist but it looks helpful for beginners at least https://resources.ema-eda.com/i/1008704-the-hitchhikers-guide-to-pcb-design/ via https://www2.umbc.edu/ieee/resources.html

22:38 enok_ has quit [Remote host closed the connection]

22:39 enok has joined ##openfpga

23:09 enok has quit [Remote host closed the connection]

23:09 enok has joined ##openfpga