verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
vivekp has quit [Ping timeout: 245 seconds]
vivekp has joined #mlpack
ImQ009 has joined #mlpack
ImQ009 has quit [Ping timeout: 240 seconds]
ImQ009 has joined #mlpack
manish7294 has joined #mlpack
travis-ci has joined #mlpack
< travis-ci>
manish7294/mlpack#22 (lmnn - 7c6c8a4 : Manish): The build has errored.
< manish7294>
rcurtin: I have a bit different observation here. I agree that, the recalculation of arma::find and arma::unique is inefficient but I don't think it is the major troublemaker. I have implemented the precalculation as you advised, but it didn't have much affect. I have made quite a number of run by imposing timer on almost each and every part. Here is the observation: https://pastebin.com/TejDffks
< rcurtin>
manish7294: right, ok. what were the results of each of those timers?
< rcurtin>
note also, in my runs, I used 5k points randomly selected from covertype, not the whole dataset
manish7294 has joined #mlpack
< manish7294>
rcurtin: Sorry, I don't have accurate data I got on full covertype but approximately it was - computing_neighbors - 5 hrs 30 mins, whole impostors() - 7 hrs 10 mins, the cost and gradient calculation part - 7 hrs 10 mins, total - 13 hrs 30 mins, optimizer - sgd, k =12, batch size = 500
< manish7294>
I also made run on first 5k points of covertype-small dataset and timings were more or less similar.
< rcurtin>
ah, I did it with k=3, let's see if k=12 produces something more similar to yours
< rcurtin>
actually, I realize I did it with default k, which is 1
< rcurtin>
right, now I see results similar to yours
< rcurtin>
but even so, with my timings, 21.6s is spent in Impostors() but only 5s is spent in the computing_neighbors timer
< rcurtin>
I know we are on different systems so the optimizations could be somewhat different. also I am using the results before precalculation... so let's update and see what it does
< manish7294>
Please try putting a timer over knn train and search.
< manish7294>
and a timer over whole impostors
< rcurtin>
that should be equivalent to tree_building + computing_neighbors, which was 0.126s + 5.33s, but yes, let me do that
< rcurtin>
oh, interesting, now that timer is basically the same as the whole impostors timer
< manish7294>
Ya, that's what I am hung up on
< rcurtin>
hmm. it seems like maybe there is something that is not being timed by NeighborSearch, let me take a quick glance
< manish7294>
sure
< rcurtin>
ah! I see what it is
< rcurtin>
if you look in neighbor_search_impl.hpp, the 'tree_building' timer is started _only_ for the query tree building in the Search() method
< rcurtin>
so the building of the reference tree in Train() is not counted
< rcurtin>
so then, it seems like a big bottleneck is the continual re-training of the reference tree
< rcurtin>
do you remember this idea of using a tree with a MahalanobisDistance metric that I wrote a bit about? it seems like that could be useful here
< rcurtin>
either that, or simply reducing the number of times we call Impostors() could be helpful
< manish7294>
maybe we can do both
< rcurtin>
right, I think that could also be helpful
< rcurtin>
for acceleration of the rest of EvaluateWithGradient(), I think that by caching the results from Impostors(), we could avoid the calls to metric.Evaluate()
< rcurtin>
I suspect that will be a lot of the runtime
< rcurtin>
and I think that would be a fairly simple optimization
< manish7294>
for reducing impostors call, does the earlier bounding idea sounds reasonable --- mentioned on PR
< manish7294>
We can try that
< rcurtin>
I believe that your idea will work, my only concern is that the overhead of calculating || (M_{t + 1} - M_t) p_i ||^2 for each p_i might cost more than just performing the search
< rcurtin>
I guess if we keep oldCoordinates, it is just || p_i(t - 1) - p_i(t) ||^2 (sorry for the confusing notation, I wanted to use both i and t)
< rcurtin>
but I do think it could provide some speedup, and maybe would not be too hard to implement. if you want to try and see what happens, I believe it would be worthwhile
< manish7294>
correct
< rcurtin>
I did not have time yesterday to finish writing my bound, unfortunately
< rcurtin>
but I think that the bound I am coming up with could be used in tandem with your idea
< rcurtin>
basically I want to calculate || M_{t + 1} - M_t ||^2 (or something like this that does not depend on the points), and if that quantity is small enough we skip the entire Impostors() step
< rcurtin>
however, it would definitely be possible to do this check (once I figure out the bound), and if the check fails, then your idea can be done to reduce the number of points we need to find neighbors for
< rcurtin>
so both could work in tandem
< rcurtin>
do you think that would sound reasonable?
< manish7294>
Right, I got that feeling from your idea but couldn't see beyond that
< manish7294>
ya sure
< manish7294>
and did you find something about divergence?
< rcurtin>
not yet, I don't have any solution there. we could try adding a regularization parameter, but I haven't seen any issues with the implementation itself
< rcurtin>
I'd like to take a look at the MATLAB solution and see how they solved the problem (or avoided it)
< rcurtin>
I see, they have an adaptive learning rate:
< rcurtin>
% Update learning rate
< rcurtin>
if prev_C > C
< rcurtin>
eta = eta * 1.01;
< rcurtin>
else
< rcurtin>
eta = eta * .5;
< rcurtin>
end
< manish7294>
Here sgd fails
< rcurtin>
so it's a little like momentum... a little. if the step gives an improvement, increase step size by a factor of 1.01; if the step makes things bad, halve the step size
< rcurtin>
I wonder if some of the cool optimizers Marcus has implemented could be used in place of SGD
< rcurtin>
there are a couple in there with adaptive learning rates if I remember right
< manish7294>
zoq: Please help out with optimizers having adaptive learning rate.
< manish7294>
as you said earlier, we can try penalty term too.
< rcurtin>
I think BigBatchSGD could be one way to try
< rcurtin>
agreed, I think a penalty term is no problem
< rcurtin>
if you'd like to try that, go ahead---basically just add || L ||^2 to the objective
< rcurtin>
that or b * || L ||^2 for some user-specified parameter b, then you could play around with it and see if you find a good value
< manish7294>
I think gradient will also be having some change
< rcurtin>
right, that will be just + b*L
< manish7294>
Okay I will try with these ideas.
< rcurtin>
yeah, maybe it will help, let me know how it turns out :)
< manish7294>
In one word - awesome :)
< manish7294>
great results without any divergence at step size 50
< manish7294>
on iris
< manish7294>
which was earlier diverging on even 0.1
< rcurtin>
ah, that is with the regularization penalty?
< manish7294>
yes
< manish7294>
tried with vc2 too
< rcurtin>
great, that is nice to hear. I expect maybe with some larger datasets there may need to be some amount of tuning of how much regularization to apply
< rcurtin>
hmm, I just realized, another thing we could do is not apply a regularization but force a normalization
< rcurtin>
it gives us identical KNN results if we use some transformation matrix L, or if we use 2*L
< rcurtin>
thus, it would be reasonable to, after each step, simply normalize the matrix L so that ||L||^2 = 1
< rcurtin>
but I think the regularization you are doing now is just fine, it accomplishes roughly the same thing
< manish7294>
that sounds promising
< manish7294>
will try that too
< rcurtin>
the MATLAB implementation does not seem to normalize the learned matrix
< rcurtin>
but like I said, I think it is a minor issue. the penalty should be another completely sufficient way of doing it, and the runtime cost of computing either b*|| L ||^2 or b*L should be pretty insignificant
< rcurtin>
(as compared to the rest of the calculation)
< manish7294>
This may sound stupid but I made a deadly mistake, I didn't notice lbfgs flag was set to true this whole time in above runs and got our hopes high. Sorry for that, I should be more careful :(
< rcurtin>
no worries, but it sounds like the regularization works for L-BFGS
< rcurtin>
I would expect that the amount of penalty to apply (the 'b') would need to be tuned
< manish7294>
no, I hust made change in batch evaluatewithgradient
< manish7294>
moreover lbfgs doesn't suffer from this problem
killer_bee[m] has quit [Ping timeout: 256 seconds]
< manish7294>
*just
prakhar_code[m] has quit [Ping timeout: 260 seconds]
< manish7294>
maybe I should try normalization.
vivekp has quit [Ping timeout: 264 seconds]
manish7294 has quit [Quit: Page closed]
manish7294 has joined #mlpack
sulan_ has quit [Read error: Connection reset by peer]
sulan_ has joined #mlpack
< manish7294>
rcurtin: Hopefully this time I didn't make any mistake, normalizing coordinates matrix by diving it with its norm helped avoiding the divergence. But it comes up with the issue on how to update coordinates, currently I made a change in sgd itself for testing purpose and I think optimization time has also increased a bit.
manish7294 has quit [Quit: Page closed]
sulan_ has quit [Quit: Leaving]
prakhar_code[m] has joined #mlpack
ImQ009 has quit [Quit: Leaving]
witness_ has quit [Quit: Connection closed for inactivity]