<shrit[m]>
Naman Jain: You need to link with aramdillo and mlpack library.
<shrit[m]>
you need to add `-lmlpack -larmadillo -L/path/to/mlpack/lib/ -L/path/to/armadillo/lib` to your compiler command
aakashi2001 has joined #mlpack
aakashi2001 has quit [Remote host closed the connection]
aakashi2001 has joined #mlpack
aakashi2001 has quit [Ping timeout: 252 seconds]
Poozi has joined #mlpack
<jonpsy[m]>
rcurtin: I was hoping you'd be able a meet @ 10 PM IST?
<jonpsy[m]>
* rcurtin: I was hoping you'd be able to attend a meet @ 10 PM IST?
<rcurtin[m]>
that's right over lunch for me... I'm not sure if I'll be available
<rcurtin[m]>
some friends are in town so we were going to get lunch, but, I don't know when they are going to be awake ... anyway, I'll try and join... what is the meeting about? 😃
<jonpsy[m]>
so i really loved the way you parallelized code using SIMD
<jonpsy[m]>
i'm trying to do something similar
<jonpsy[m]>
like dividing into 4 blocks, and in each iteration update all four
<rcurtin[m]>
yeah, basically that strategy is just a "trick" to try and get the compiler to emit SIMD instructions
<jonpsy[m]>
yeah, i wanted to know a little on how i'd apply in my case
<jonpsy[m]>
<rcurtin[m]> "some friends are in town so we w" <- thats fyn, we can do it later when you're free?
<rcurtin[m]>
are you sure you need to? If you can express things as Armadillo operations, then it's likely that the Armadillo code will handle this type of thing under the hood
<ABHINAVANAND[m]>
zoq Is there a way to reduce the number of epoch in group norm gradient test. It will help me in debugging.
<jonpsy[m]>
rcurtin[m]: I've thought about it. The orig code used "torch.expand" and lot of redundant stuff.
<jonpsy[m]>
but we could save all those if we do it smartly in a loop
<rcurtin[m]>
so, my first suggestion would be to determine whether or not this is actually a bottleneck that uses a significant amount of computation time... if not, then it's probably not worthwhile to have the extra code for the SIMD loop
<rcurtin[m]>
second, maybe can you use `.transform()` or something like this? if we *can* push SIMD logic down into Armadillo, we should
<rcurtin[m]>
the key here is the dependencies of each iteration... in the first loop, every iteration depends on the previous iteration's output and thus it can't be parallelized. in the second loop, every iteration is independent and so the compiler can automatically apply SIMD (and I think it typically will)
<rcurtin[m]>
so you only need to do the "crazy" stuff if your loop is like the first one :)
<rcurtin[m]>
sure, I see, and you'll have a bunch of these chunks and a bunch of `w`s and they all need to be multiplied
<jonpsy[m]>
rcurtin[m]: yes, for each chunk there is a unique ```w```
<jonpsy[m]>
that ```w``` multipled with ```chunk``` gives us a vec, which we will append to ```res```
<rcurtin[m]>
can you express this as a matrix multiplication by reshaping the input array and grouping the `w`s into a matrix?
<rcurtin[m]>
the reason I suggest this is that while you can go to some lengths to get SIMD instructions and make things fast, the BLAS primitives like matrix-matrix multiplication are blindingly fast
<rcurtin[m]>
so if you can structure things such that they can be expressed as a BLAS primitive, that is the most likely route to making things fast
<rcurtin[m]>
(and as a bonus, the code remains clean because we can just do this as a couple Armadillo calls)
<rcurtin[m]>
yeah, can you do it *all* in batch? so, e.g., reshape `Q` such that each chunk corresponds to a single column, and then multiply directly with the `W` matrix?
<rcurtin[m]>
that is maybe not very satisfying because you don't get to play with SIMD instructions, but, if you can do that, the OpenBLAS matrix multiplication code is tuned like crazy and will almost certainly be faster than any other strategy, especially because 1 NxNxN matrix multiplication will tend to be faster than N vector-matrix multiplications of size 1xNxN
<jonpsy[m]>
rcurtin[m]: this could work...
<rcurtin[m]>
👍️ yeah, I like optimizing code at a really low level, but most of the time at least with matrix operations, if you can manage to express the problem as a BLAS primitive then OpenBLAS will blow away anything handwritten
halfy_ has joined #mlpack
<halfy_>
(testing the bridge quickly)
<NamanJain[m]>
<shrit[m]> "you need to add `-lmlpack -larm" <- Thanks, @shrit! It works. I have included only -lmlpack -larmadillo
<jonpsy[m]>
say4n: zoq you guys free?
<zoq[m]>
Still 5 minutes right?
<jonpsy[m]>
yep, jst confirming if anyobody wanted to opt out
<zoq[m]>
we are using meet right?
<jonpsy[m]>
lets use zoom ?
<zoq[m]>
you send a meet link
<zoq[m]>
So I would stick to that.
<jonpsy[m]>
that was only for reminder xD
<jonpsy[m]>
anyway sure
<shrit[m]>
zoq: I am happy to open a new pull request and Change the class name to `DiagonalGaussianDist` to `DiagonalGaussianDistType<>`, and then redefine `DiagonalGaussianDist` as using armadillo internaly. It is true that is policy should be easier to review and will not require to modify half of the code base. If you prefer this, I will open a new pull request and close the current one.
aakashi2001 has joined #mlpack
<shrit[m]>
* zoq: I am happy to open a new pull request and Change the class name to `DiagonalGaussianDist` to `DiagonalGaussianDistType<>`, and then redefine `DiagonalGaussianDist` as using armadillo internally. It is true that this policy should be easier to review and will not require to modify half of the codebase. If you prefer this, I will open a new pull request and close the current one.
aakashi2001 has quit [Ping timeout: 248 seconds]
<zoq[m]>
I just wanted to keep it simple to use (we discussed some of it here https://github.com/mlpack/mlpack/issues/2524) by adding a template parameter I think instead of making it easier to use it is actually more complicated.
<PranshuSrivastav>
> You are trying to link statically with armadillo, which is dynamic on your machine
<PranshuSrivastav>
Hey I don't really understand this, could you please explain it to me in simpler terms..
<shrit[m]>
zoq: Agreed, I did no remember that note when I opened the pull request