ChanServ changed the topic of #mlpack to: "mlpack: a fast, flexible machine learning library :: We don't always respond instantly, but we will respond; please be patient :: Logs at http://www.mlpack.org/irc/
< zoq> rcurtin: In your bandicoot experiments did you look into cooperative groups? I did a first quick test and it looks like it's faster compared to the reduce kernel, I'm not using the exact same kernel that is in bandicoot right now, will do that next.
< zoq> But maybe you already did a quick test.
< rcurtin> not at all, I don't know much about cooperative groups
< rcurtin> do you mean that it's faster than the accu() kernel I wrote? if so awesome!
< zoq> Maybe, I'm not sure nvprof shows the correct numbers.
< rcurtin> I guess you could also test with the benchmarks/accu.cpp program to see
< rcurtin> I had a lot of fun tuning the accu kernel, but I didn't use any CUDA features that were super advanced (mostly because I am not *that* advanced at CUDA :))
< zoq> Yeah, it's quite a lot of fun :)
< zoq> 0.01% 76.185us 3 25.395us 7.1040us 58.421us cuLaunchKernel
< zoq> that's the result for the bandicoot accu kernel
< zoq> Time(%) Time Calls Avg Min Max Name
< rcurtin> did the output come through right? I only see the header and one row for cuLaunchKernel
< rcurtin> (and they came through out of order, at least here)
< zoq> Yeah, I posted the header afterwards.
< zoq> Now I can't find the result for the other kernel ..
< zoq> 0.01% 15.359us 1 15.359us 15.359us 15.359us cudaLaunchKernel
< zoq> that's the one that uses Cooperative Groups
< rcurtin> nice, although it only ran once, vs. the accu kernel that ran 3 times
< rcurtin> the min for the accu kernel is actually less than cooperative groups, but I wonder if that is just noise
< rcurtin> certainly the average looks way better for the cooperative groups kernel
< zoq> the accu kernel has 3 kernel calls, I think
< zoq> the device function is included
< rcurtin> oh, good point
< rcurtin> I think at least one of those calls is just getting the result element from the GPU
< zoq> I think it's accu, accu_warp_reduce and whatever the name of the device function is
< rcurtin> ah, ok
< zoq> Will integrate the kernel and use the benchmark script.
< rcurtin> yeah, awesome if you can get such a huge speedup in addition to what's already there
< rcurtin> how big is the matrix you are accu()'ing in those numbers?
< zoq> I use the same settings as for the benchmark script.
< zoq> 1000000000
< rcurtin> wow, am I reading that wrong? 1 billion elements in 15 microseconds?
< zoq> The numbers are strange.
< rcurtin> well, if they are *relatively* correct (i.e. the new kernel *is* that much faster than the old one), I guess it doesn't matter, it is still more speedup :)
< zoq> True :)
< rcurtin> I think the Julia numbers I showed in the bandicoot issue were incorrect---I had a small bug in the objective computation that meant it did a little less work than it needed to
< rcurtin> (I don't think it's a huge difference though)
< zoq> Good, it's the main reason why I look into the accu kernel :)
< rcurtin> well, I still think there will be a decent speed difference, so some acceleration will still be necessary, I just don't know quite how much :)
< rcurtin> I may still have a bug in my code somewhere---I'm still debugging some strange results
< rcurtin> but it appears that for the runs that I *do* have working, MNIST with 50 epochs with a batch size of 128, Julia takes ~9.1s, bandicoot+CUDA takes ~14.3s
< rcurtin> (both converge to the same accuracy)
< zoq> Didn't know Julia had such a good GPU acceleration.
< rcurtin> yeah, my guess is that they are just calling kernels with no overhead. I wouldn't be surprised if, for instance, bandicoot is waiting for results to come back from the GPU before starting the next operation, whereas the kernels could be enqueued even before they finish
< rcurtin> but, I haven't seen it in a profiler---at this point I think only you have :)
< zoq> The numbers aren't too far off, but would be great to close the gap.
< zoq> if not we can just always compare against TF and PyTorch :)
< rcurtin> :)
< rcurtin> there are still many kernels in bandicoot that are totally unoptimized and could be implemented way better
< rcurtin> so I think it shouldn't be *too* hard to get closer
< abernauer[m]> rcurtin: I will spend some time this weekend making those changes you suggested to the PR. Just a little drained from 9-4 zoom interviews this week :')
< abernauer[m]> My knowledge of R and taking James stats 385 class payed off today though.
< rcurtin> whew, 9-4! that's a lot
< abernauer[m]> There were breaks today. I had to take an intake form from excel with macros to another technology. Which I had to in a two and a half hour time block while in a call.
bumpsh[m] has left #mlpack []
ib07 has quit [Quit: https://quassel-irc.org - Chat comfortably. Anywhere.]
ib07 has joined #mlpack
ImQ009 has joined #mlpack
pradkrish has quit [Remote host closed the connection]
pradkrish has joined #mlpack
pradkrish has quit [Remote host closed the connection]
heisenbug has joined #mlpack
< heisenbug> msg heisenbug identify gopi@1234
heisenbug has left #mlpack []
heisenbug has joined #mlpack
< heisenbug> Hey, anybody active? I wanted to talk about data preprocessing in mlpack...
< heisenbug> Hello...
< zoq> heisenbug: I'm currently busy.
< heisenbug> Oh, its fine...just ping if you get free...
heisenbug has quit [Remote host closed the connection]
TurkeyGibby has left #mlpack []
< AakashkaushikGit> Hey this might just be a silly question but say we have a huge vector that's not going to change it's size while in a for loop. Why don't we store the vector size in a variable, it might just be a very small optimization but say we have 20 loops in a file and a very huge vector of layers as models might get big, maybe 150 layers is a okaish number. will this make any difference ?
< AakashkaushikGit> I haven't actually read the machine code that the compiler produces, so i am not sure if it caches that vector.size() in this specific case but this just came to my mind
togo has joined #mlpack
gotadachi has quit [Quit: Leaving...]
togo has quit [Ping timeout: 272 seconds]
gotadachi has joined #mlpack
ImQ009 has quit [Quit: Leaving]
< zoq> AakashkaushikGit: vector.size() is already a constant, it's not that a call to .size() will calculate the size of the vector; so I don't think this will have an effect, but maybe that wasn't your question?
< zoq> heisenbug: Maybe it's easier if you ask here and once someone has a change can provide an answer :)
< AakashkaushikGit> Hi @
< AakashkaushikGit> Hi @zoq, maybe i wasn't able to frame it right so when we call size() on a vector that is basically something like the end pointer minus the start and returning that right ?, And when we loop thay won't that take more time than assigning that to a temporary variable and using that variable in the loop.
< AakashkaushikGit> That*
< AakashkaushikGit> For the loop*
< abernauer[m]> I got the job!!!
< zoq> AakashkaushikGit: Haven't looked into the implementation, but if that's the case it will probbably optimized out.
< zoq> abernauer[m]: Congratulations!