birm[m] has quit [Quit: You have been kicked for being idle]
<rcurtin[m]>
here's a nice bandicoot benchmark; randu() for 100M floats takes ~1.35s for Armadillo... using cuRand with bandicoot, it takes ~0.001s. speedup of 1000x+ :)
<rcurtin[m]>
that's a rate of roughly 360 GB/s for randu generation on my RTX2080ti, which has a max. memory bandwidth of 616 GB/s, so I guess the cuRand developers did a pretty good job :)
<rcurtin[m]>
however, I have to write the randu kernels by hand for OpenCL... so I don't know if I'll succeed at getting the same performance levels...