verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
vivekp has quit [Ping timeout: 245 seconds]
vivekp has joined #mlpack
ImQ009 has joined #mlpack
ImQ009 has quit [Ping timeout: 240 seconds]
ImQ009 has joined #mlpack
manish7294 has joined #mlpack
travis-ci has joined #mlpack
< travis-ci> manish7294/mlpack#22 (lmnn - 7c6c8a4 : Manish): The build has errored.
travis-ci has left #mlpack []
< manish7294> rcurtin: I have a bit different observation here. I agree that, the recalculation of arma::find and arma::unique is inefficient but I don't think it is the major troublemaker. I have implemented the precalculation as you advised, but it didn't have much affect. I have made quite a number of run by imposing timer on almost each and every part. Here is the observation: https://pastebin.com/TejDffks
manish7294 has quit [Quit: Page closed]
travis-ci has joined #mlpack
< travis-ci> manish7294/mlpack#23 (lmnn - 65920ee : Manish): The build passed.
travis-ci has left #mlpack []
< jenkins-mlpack> Project docker mlpack nightly build build #344: STILL UNSTABLE in 2 hr 41 min: http://masterblaster.mlpack.org/job/docker%20mlpack%20nightly%20build/344/
wenhao has joined #mlpack
sulan_ has joined #mlpack
witness_ has joined #mlpack
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#5041 (mlpack-3.0.2 - 0ace41c : Ryan Curtin): The build passed.
travis-ci has left #mlpack []
< rcurtin> manish7294: right, ok. what were the results of each of those timers?
< rcurtin> note also, in my runs, I used 5k points randomly selected from covertype, not the whole dataset
manish7294 has joined #mlpack
< manish7294> rcurtin: Sorry, I don't have accurate data I got on full covertype but approximately it was - computing_neighbors - 5 hrs 30 mins, whole impostors() - 7 hrs 10 mins, the cost and gradient calculation part - 7 hrs 10 mins, total - 13 hrs 30 mins, optimizer - sgd, k =12, batch size = 500
< manish7294> I also made run on first 5k points of covertype-small dataset and timings were more or less similar.
< rcurtin> ah, I did it with k=3, let's see if k=12 produces something more similar to yours
< rcurtin> actually, I realize I did it with default k, which is 1
< rcurtin> right, now I see results similar to yours
< rcurtin> but even so, with my timings, 21.6s is spent in Impostors() but only 5s is spent in the computing_neighbors timer
< rcurtin> I know we are on different systems so the optimizations could be somewhat different. also I am using the results before precalculation... so let's update and see what it does
< manish7294> Please try putting a timer over knn train and search.
< manish7294> and a timer over whole impostors
< rcurtin> that should be equivalent to tree_building + computing_neighbors, which was 0.126s + 5.33s, but yes, let me do that
< rcurtin> oh, interesting, now that timer is basically the same as the whole impostors timer
< manish7294> Ya, that's what I am hung up on
< rcurtin> hmm. it seems like maybe there is something that is not being timed by NeighborSearch, let me take a quick glance
< manish7294> sure
< rcurtin> ah! I see what it is
< rcurtin> if you look in neighbor_search_impl.hpp, the 'tree_building' timer is started _only_ for the query tree building in the Search() method
< rcurtin> so the building of the reference tree in Train() is not counted
< rcurtin> so then, it seems like a big bottleneck is the continual re-training of the reference tree
< rcurtin> do you remember this idea of using a tree with a MahalanobisDistance metric that I wrote a bit about? it seems like that could be useful here
< rcurtin> either that, or simply reducing the number of times we call Impostors() could be helpful
< manish7294> maybe we can do both
< rcurtin> right, I think that could also be helpful
< rcurtin> for acceleration of the rest of EvaluateWithGradient(), I think that by caching the results from Impostors(), we could avoid the calls to metric.Evaluate()
< rcurtin> I suspect that will be a lot of the runtime
< rcurtin> and I think that would be a fairly simple optimization
< manish7294> for reducing impostors call, does the earlier bounding idea sounds reasonable --- mentioned on PR
< manish7294> We can try that
< rcurtin> I believe that your idea will work, my only concern is that the overhead of calculating || (M_{t + 1} - M_t) p_i ||^2 for each p_i might cost more than just performing the search
< rcurtin> I guess if we keep oldCoordinates, it is just || p_i(t - 1) - p_i(t) ||^2 (sorry for the confusing notation, I wanted to use both i and t)
< rcurtin> but I do think it could provide some speedup, and maybe would not be too hard to implement. if you want to try and see what happens, I believe it would be worthwhile
< manish7294> correct
< rcurtin> I did not have time yesterday to finish writing my bound, unfortunately
< rcurtin> but I think that the bound I am coming up with could be used in tandem with your idea
< rcurtin> basically I want to calculate || M_{t + 1} - M_t ||^2 (or something like this that does not depend on the points), and if that quantity is small enough we skip the entire Impostors() step
< rcurtin> however, it would definitely be possible to do this check (once I figure out the bound), and if the check fails, then your idea can be done to reduce the number of points we need to find neighbors for
< rcurtin> so both could work in tandem
< rcurtin> do you think that would sound reasonable?
< manish7294> Right, I got that feeling from your idea but couldn't see beyond that
< manish7294> ya sure
< manish7294> and did you find something about divergence?
< rcurtin> not yet, I don't have any solution there. we could try adding a regularization parameter, but I haven't seen any issues with the implementation itself
< rcurtin> I'd like to take a look at the MATLAB solution and see how they solved the problem (or avoided it)
< rcurtin> I see, they have an adaptive learning rate:
< rcurtin> % Update learning rate
< rcurtin> if prev_C > C
< rcurtin> eta = eta * 1.01;
< rcurtin> else
< rcurtin> eta = eta * .5;
< rcurtin> end
< manish7294> Here sgd fails
< rcurtin> so it's a little like momentum... a little. if the step gives an improvement, increase step size by a factor of 1.01; if the step makes things bad, halve the step size
< rcurtin> I wonder if some of the cool optimizers Marcus has implemented could be used in place of SGD
< rcurtin> there are a couple in there with adaptive learning rates if I remember right
< manish7294> zoq: Please help out with optimizers having adaptive learning rate.
< manish7294> as you said earlier, we can try penalty term too.
< rcurtin> I think BigBatchSGD could be one way to try
< rcurtin> agreed, I think a penalty term is no problem
< rcurtin> if you'd like to try that, go ahead---basically just add || L ||^2 to the objective
< rcurtin> that or b * || L ||^2 for some user-specified parameter b, then you could play around with it and see if you find a good value
< manish7294> I think gradient will also be having some change
< rcurtin> right, that will be just + b*L
< manish7294> Okay I will try with these ideas.
< rcurtin> yeah, maybe it will help, let me know how it turns out :)
< manish7294> In one word - awesome :)
< manish7294> great results without any divergence at step size 50
< manish7294> on iris
< manish7294> which was earlier diverging on even 0.1
< rcurtin> ah, that is with the regularization penalty?
< manish7294> yes
< manish7294> tried with vc2 too
< rcurtin> great, that is nice to hear. I expect maybe with some larger datasets there may need to be some amount of tuning of how much regularization to apply
< rcurtin> hmm, I just realized, another thing we could do is not apply a regularization but force a normalization
< rcurtin> it gives us identical KNN results if we use some transformation matrix L, or if we use 2*L
< rcurtin> thus, it would be reasonable to, after each step, simply normalize the matrix L so that ||L||^2 = 1
< rcurtin> but I think the regularization you are doing now is just fine, it accomplishes roughly the same thing
< manish7294> that sounds promising
< manish7294> will try that too
< rcurtin> the MATLAB implementation does not seem to normalize the learned matrix
< rcurtin> but like I said, I think it is a minor issue. the penalty should be another completely sufficient way of doing it, and the runtime cost of computing either b*|| L ||^2 or b*L should be pretty insignificant
< rcurtin> (as compared to the rest of the calculation)
< manish7294> This may sound stupid but I made a deadly mistake, I didn't notice lbfgs flag was set to true this whole time in above runs and got our hopes high. Sorry for that, I should be more careful :(
< rcurtin> no worries, but it sounds like the regularization works for L-BFGS
< rcurtin> I would expect that the amount of penalty to apply (the 'b') would need to be tuned
< manish7294> no, I hust made change in batch evaluatewithgradient
< manish7294> moreover lbfgs doesn't suffer from this problem
killer_bee[m] has quit [Ping timeout: 256 seconds]
< manish7294> *just
prakhar_code[m] has quit [Ping timeout: 260 seconds]
< manish7294> maybe I should try normalization.
vivekp has quit [Ping timeout: 264 seconds]
manish7294 has quit [Quit: Page closed]
manish7294 has joined #mlpack
sulan_ has quit [Read error: Connection reset by peer]
sulan_ has joined #mlpack
< manish7294> rcurtin: Hopefully this time I didn't make any mistake, normalizing coordinates matrix by diving it with its norm helped avoiding the divergence. But it comes up with the issue on how to update coordinates, currently I made a change in sgd itself for testing purpose and I think optimization time has also increased a bit.
manish7294 has quit [Quit: Page closed]
sulan_ has quit [Quit: Leaving]
prakhar_code[m] has joined #mlpack
ImQ009 has quit [Quit: Leaving]
witness_ has quit [Quit: Connection closed for inactivity]
killer_bee[m] has joined #mlpack