<jjb[m]>
ryan nice! I saw a few items on the “Should require only C++” that I’ll aim to tackle.
<zoq[m]>
Some really good numbers.
<rcurtin[m]>
I still have some minor bugs in my OpenCL XORWOW implementation, but I have it within ~4x of CUDA. I'll probably spend a couple more hours with it, but randu() performance is not the most important thing in the world so probably not much more time than that... for now 😃
<rcurtin[m]>
The Philox generator you wrote will be what I use for randn() 👍️
<rcurtin[m]>
* I still have some minor bugs in my OpenCL XORWOW implementation, but I have it within ~4x of the runtime CUDA. I'll probably spend a couple more hours with it, but randu() performance is not the most important thing in the world so probably not much more time than that... for now 😃
<rcurtin[m]>
* I still have some minor bugs in my OpenCL XORWOW implementation, but I have it within ~4x of the runtime of CUDA. I'll probably spend a couple more hours with it, but randu() performance is not the most important thing in the world so probably not much more time than that... for now 😃
<rcurtin[m]>
s/CUDA/the runtime of cuRand/, s//`/, s//`/
<zoq[m]>
So far the implementation is easy, so easy to review.
<rcurtin[m]>
Those array operations are the easiest kernels to write and tune, they're very boilerplate. There might be some extra performance that one could squeeze out of each operation, but that's a task for another time...