#mlpack on 2023-07-17 — irc logs at libera.irclog.whitequark.org

2021-07-27 15:44 rcurtin_irc changed the topic of #mlpack to: mlpack: a scalable machine learning library (https://www.mlpack.org/) -- channel logs: https://libera.irclog.whitequark.org/mlpack -- NOTE: messages sent here might not be seen by bridged users on matrix, gitter, or slack

11:25 NishantTekwani[m has joined #mlpack

15:14 <rcurtin[m]> at least personally I don't have the time, but I am definitely hoping that Bandicoot plus integration of Bandicoot into ensmallen and mlpack will provide a framework upon which LLMs can be efficiently implemented in mlpack!

16:29 <jonpsy[m]> Interesting

16:30 <jonpsy[m]> How's bandicoot doing these days?

16:31 <rcurtin[m]> very good! release day is... today? tomorrow? I am finishing up benchmarks to be posted on the website

16:31 <rcurtin[m]> here is the graph I am looking at right now:

16:31 <jonpsy[m]> Impeccable timing

16:31 * rcurtin[m] uploaded an image: (35KiB) < https://libera.ems.host/_matrix/media/v3/download/ratml.org/TDGDpYiROCDMmBujLSoDdpHT/arma_accu_float_comparison.png >

16:32 <rcurtin[m]> not everything is better on GPUs, though; check out matrix multiplication benchmarks for float and for double:

16:32 * rcurtin[m] uploaded an image: (169KiB) < https://libera.ems.host/_matrix/media/v3/download/ratml.org/oYliMTphIDYMFDktqRyObVqr/arma_matmul_float_comparison.png >

16:32 * rcurtin[m] uploaded an image: (194KiB) < https://libera.ems.host/_matrix/media/v3/download/ratml.org/PjzcssmkLswlLnhceIdphThk/arma_matmul_double_comparison.png >

16:33 <rcurtin[m]> so, sometimes the CPU really can be faster (and I think on my benchmark system, the CPU is newer than the RTX2080ti I am using as a GPU, so, not totally surprised)

16:33 <rcurtin[m]> it depends a lot on the operation that's being done

16:35 <zoq[m]> and size, 64-bit precision is almost the same for GPU/CPU after 10^3?

16:36 <rcurtin[m]> yeah; I was hoping to see ~an order of magnitude speedup for double matrix multiplication, but even cuBLAS isn't really outperforming OpenBLAS (those underlying libraries are really what's being compared here)

16:36 <zoq[m]> Did you run something else for the first graph at the same time?

16:36 <zoq[m]> Because of the strange peak.

16:37 <rcurtin[m]> oh, the small spike for the CUDA backend? yeah, let me just run the whole thing again and see if that's still there. sometimes particular sizes can do better or worse depending on how they match up with the warp size on the GPU, but that looks pretty random

16:38 <zoq[m]> I see

16:40 <rcurtin[m]> yeah, I reran it and the spike just moved a little bit:

16:40 * rcurtin[m] uploaded an image: (35KiB) < https://libera.ems.host/_matrix/media/v3/download/ratml.org/WsGjJlTSMgWOQyEcgcCAlVDe/arma_accu_float_comparison.png >

16:40 <jonpsy[m]> rcurtin[m]: Interesting, you mean exceeding the batch size?

16:40 <rcurtin[m]> I take the minimum of 5 trials, but maybe I should do it with 10

16:41 <rcurtin[m]> yeah, basically, the nvidia warp size is usually 32 threads, so if I have a vector that's 33 elements, then I need two warps, and in the second warp I only use one thread (which is not good utilization)

16:49 CaCode has joined #mlpack

17:01 <jonpsy[m]> ohkay got it

17:01 <jonpsy[m]> Welp, im having a little trouble comprehending log graphs