azonenberg changed the topic of ##openfpga to: Open source tools for FPGAs, CPLDs, etc. Silicon RE, bitfile RE, synthesis, place-and-route, and JTAG are all on topic. Channel logs: https://libera.irclog.whitequark.org/~h~openfpga
mewt_ has quit [Remote host closed the connection]
mewt has joined ##openfpga
Degi has quit [Ping timeout: 276 seconds]
Degi has joined ##openfpga
<azonenberg> pie_: so, a few other things to consider
<azonenberg> With a GPU, your only way to get data in or out is PCIe
<azonenberg> So you're potentially bottlenecked on other parts of the system for data transfer (same as a CPU)
<azonenberg> with an FPGA you have the option of direct inputs on I/O pins or high speed serial transceivers
<azonenberg> GPUs typically come as premade boards with a ton of very fast ram; this is difficult to beat for access to large quantities of data when your problem is memory bound. Way more BW than a CPU can get
<azonenberg> Some newer gen high end FPGAs have HBM on them which is pretty competitive, and bigger ones with a ton of IO can run many channels of DDRx
<azonenberg> on the flip side, however, for smaller datasets that fit in on-die SRAM, you can get extreme bandwidth on an FPGA because every single block RAM can run in parallel (or you can cascade them to sacrifice bandwidth for larger size on a given block)
<azonenberg> Hard numbers for what I have handy right now: my workstation CPU is a Xeon Scalable Gold 6144, with 6 channels of DDR4-2666 per socket (I have two CPUs but let's talk apples to apples)
<azonenberg> So that's six 64-bit memory buses ignoring ECC, so 6*64 = 384 bits @ 2666 MT/s = 1.024 Tbps
<azonenberg> my GPU is an RTX 2080 Ti, which has a 352 bit memory bus at 14 GT/s (GDDR6 is much faster than DDR4), so a total bandwidth of 4.92 Tbps
<azonenberg> Compare this to a fairly midrange FPGA, the Xilinx XC7K160T-2FFG900. It has 325 36 Kbit SRAM blocks, 350 HR I/O pins, and 150 HP I/O pins
<azonenberg> The HR pins can collectively fit two 64-bit channels of DDR3-1066, the HP pins can do one channel of DDR3-1866. This is a combined 0.255 Tbps (255 Gbps), so about a quarter the CPU and 1/20 the GPU
<azonenberg> But the block RAMs can clock up to 543 MHz (or 601 in the -3 speed grade, so let's be generous and use that)
<azonenberg> And they each have a 36 bit read and 36 bit write bus, and can do both simultaneously. If you have equal mixes of reads and writes that's equivalent to a 72 bit bus, so you have 43.2 Gbps *per block RAM*
<azonenberg> Times 325 gives 14.04 Tbps, about 14x the CPU and 2.8x the GPU
<azonenberg> now the CPUs and GPUs have caches too, of course, but they're still somewhat limited because they have only so many cores for the cache to feed
<azonenberg> if you can optimize your FPGA design with sufficient parallelism you can get ridiculous bandwidth. And with an even bigger FPGA, of course, the difference is bigger
<azonenberg> And that was a 28nm FPGA. a Kintex-Ultrascale+ XCKU19P has 1728 block RAMs which can clock to 825 MHz, so 59.4 Gbps * 1728 blocks = 102.6 Tbps. And that's not the largest FPGA available by a long shot
<azonenberg> FPGAs are not, however, very good at floating point math because you have to use fabric logic resources to build FPUs, while a CPU or GPU will have the floating point stuff in hard wired logic so they're faster and you have more of them
<azonenberg> So you'll get your butt kicked on floating point heavy stuff. But for fixed point integer work, they excel because they have hundreds or thousands of integer multipliers that can all run in parallel, plus you can build tons of adders in fabric logic
<azonenberg> FPGA vs ASIC is also an interesting thing, ignoring the startup costs for ASIC. In particular, the relative cost of different things changes
<azonenberg> in FPGAs, ram is cheap. block RAMs are implemented as basically asic SRAM macros, while logic is expensive since it needs to be done in LUTs
<azonenberg> so your ratio of logic to RAM is quite low
<azonenberg> in ASIC, RAM is proportionally enormous because your logic shrinks and the memory doesn't
<azonenberg> the same is true of things like serial transceivers. A double digit Gbps ASIC SERDES IP is going to cost you seven digits in licensing fees and take up a huge amount of die area compared to logic
<azonenberg> in FPGA, you get the transceivers for free and they're relatively abundant compared to the amount of logic you have available
<azonenberg> and of course in CPU/GPU you have transceivers free too, but they're hardwired to certain functions like usb, pcie, etc. So if you need 100GbE to your compute, you're out of luck unless you add a pcie nic and some code to marshal data around
<Flea86> azonenberg: Great writeup mate. Also TIL Intel Arria 10 FPGAs have single precision float support in their hard math blocks :>
<Flea86> Hmm, I see that a lot of the high end Xilinx parts can do it too.
<pie_> saved
specing has quit [Ping timeout: 260 seconds]
specing has joined ##openfpga
ZipCPU has quit [Remote host closed the connection]
ZipCPU_ has joined ##openfpga
ZipCPU_ is now known as ZipCPU
<pie_> Does anyone know any good material on youtube on verilog?
<pie_> azonenberg: I dont know if this is relevant to your pcb checklist but it looks helpful for beginners at least https://resources.ema-eda.com/i/1008704-the-hitchhikers-guide-to-pcb-design/ via https://www2.umbc.edu/ieee/resources.html
enok_ has quit [Remote host closed the connection]
enok has joined ##openfpga
enok has quit [Remote host closed the connection]
enok has joined ##openfpga