<rcurtin[m]>
at least personally I don't have the time, but I am definitely hoping that Bandicoot plus integration of Bandicoot into ensmallen and mlpack will provide a framework upon which LLMs can be efficiently implemented in mlpack!
<jonpsy[m]>
Interesting
<jonpsy[m]>
How's bandicoot doing these days?
<rcurtin[m]>
very good! release day is... today? tomorrow? I am finishing up benchmarks to be posted on the website
<rcurtin[m]>
here is the graph I am looking at right now:
<rcurtin[m]>
so, sometimes the CPU really can be faster (and I think on my benchmark system, the CPU is newer than the RTX2080ti I am using as a GPU, so, not totally surprised)
<rcurtin[m]>
it depends a lot on the operation that's being done
<zoq[m]>
and size, 64-bit precision is almost the same for GPU/CPU after 10^3?
<rcurtin[m]>
yeah; I was hoping to see ~an order of magnitude speedup for double matrix multiplication, but even cuBLAS isn't really outperforming OpenBLAS (those underlying libraries are really what's being compared here)
<zoq[m]>
Did you run something else for the first graph at the same time?
<zoq[m]>
Because of the strange peak.
<rcurtin[m]>
oh, the small spike for the CUDA backend? yeah, let me just run the whole thing again and see if that's still there. sometimes particular sizes can do better or worse depending on how they match up with the warp size on the GPU, but that looks pretty random
<zoq[m]>
I see
<rcurtin[m]>
yeah, I reran it and the spike just moved a little bit:
<jonpsy[m]>
rcurtin[m]: Interesting, you mean exceeding the batch size?
<rcurtin[m]>
I take the minimum of 5 trials, but maybe I should do it with 10
<rcurtin[m]>
yeah, basically, the nvidia warp size is usually 32 threads, so if I have a vector that's 33 elements, then I need two warps, and in the second warp I only use one thread (which is not good utilization)
CaCode has joined #mlpack
<jonpsy[m]>
ohkay got it
<jonpsy[m]>
Welp, im having a little trouble comprehending log graphs