#mlpack on 2014-07-07 — irc logs at libera.irclog.whitequark.org

2014-05-21 16:24 naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

04:11 Anand has joined #mlpack

04:19 Anand has quit [Quit: Page closed]

05:35 witness___ has joined #mlpack

05:42 govg has quit [Read error: Connection reset by peer]

07:54 witness___ has quit [Quit: Connection closed for inactivity]

08:03 Anand has joined #mlpack

08:22 Anand has quit [Ping timeout: 246 seconds]

09:12 sumedh_ has joined #mlpack

09:55 Gys has joined #mlpack

09:57 < Gys> Hello

09:57 < Gys> Can someone help me with the k-means?

09:59 < sumedh_> Gys: hello Gys, what problem are you facing??

10:01 < Gys> hi sumedh_ , thanks for replying

10:03 < Gys> I would like to know if it's possible to do a "weighted k-means"

10:05 < Gys> I mean, is it a way to calculate clusters by considering a weight for each point?

10:09 < Gys> I thought to simulate this weight by setting several points at the same location. But I would be sure that all points from a same location, which define a weighted point, won't be split into 2 or more clusters...

10:11 govg has joined #mlpack

10:15 < sumedh_> Gys: I think the simulation idea would work... points at same location cannot be split into 2 different clusters... As I haven't implemented K-means in MLPACK... naywhayare will be able to help you out...

10:16 < sumedh_> you can test the simulation by creating a small test case...

10:16 < sumedh_> and check if the weighted point is getting split or not...

10:17 < Gys> Yes, I will. Thanks for your reply!

10:18 govg has quit [Ping timeout: 240 seconds]

10:24 govg has joined #mlpack

10:24 govg has quit [Changing host]

10:24 govg has joined #mlpack

10:35 Anand has joined #mlpack

10:36 < Anand> Marcus : It seems that a negative index got passed in the mse unit test which caused the build failure. I have taken care of that case now. It should not give an error. If it still does, then the problem is somewhere else.

10:37 < Anand> Also, let me know your views about weka logistic predictions.

10:37 < Anand> I am starting with linear regression soon.

10:42 govg has quit [Ping timeout: 240 seconds]

10:45 < naywhayare> Gys: unfortunately, mlpack doesn't implement weighted k-means, but it would probably be easy to modify the code to do so

10:46 Anand has quit [Ping timeout: 246 seconds]

10:49 < Gys> hi naywhayare, I'm diving into the src code :)

10:49 < naywhayare> :)

10:50 < naywhayare> what's implemented there is the standard k-means algorithm, so it should be pretty easy to deal with. if the FastCluster() function is still there (I think I removed it) you can just ignore it; the Cluster() function is the one of interest

10:51 < Gys> ok thanks

11:12 govg has joined #mlpack

11:28 < jenkins-mlpack> Starting build #1994 for job mlpack - svn checkin test (previous build: SUCCESS)

11:49 govg has quit [Read error: Connection reset by peer]

11:54 govg has joined #mlpack

11:54 govg has quit [Changing host]

11:54 govg has joined #mlpack

12:28 oldbeardo has joined #mlpack

12:29 < oldbeardo> naywhayare: L_BFGS works fine for Reg SVD

12:31 oldbeardo has quit [Changing host]

12:31 oldbeardo has joined #mlpack

12:33 < naywhayare> oldbeardo: ok, cool. does it run much faster than the mlpack implementation of SGD?

12:33 < naywhayare> also, I committed a CMake fix in http://www.mlpack.org/trac/changeset/16767 to bump the minimum version of Boost to 1.49

12:33 < naywhayare> I think you forgot that little bit

12:34 < naywhayare> very simple change :)

12:34 < oldbeardo> naywhayare: it runs quite a bit faster

12:35 < oldbeardo> I didn't mention 1.49 because the build doesn't work

12:35 < oldbeardo> on my system that is

12:35 < naywhayare> oh? can you explain more (about the build not working)?

12:36 < oldbeardo> well, it says Boost 1.49 package not found

12:36 < naywhayare> not "Detected version of Boost is too old. Requested version was 1.49 (or newer)." ?

12:37 < oldbeardo> no, I installed 1.49

12:38 < naywhayare> so it finds the package without the change I made (adding 1.49 to find_package(Boost ...)), but when you add 1.49 to the find_package call, it then fails?

12:38 < oldbeardo> yes

12:39 < naywhayare> where did you install boost?

12:39 < naywhayare> or, how? through the package manager?

12:39 < oldbeardo> usr/local/lib

12:39 < naywhayare> so, what happens if you call 'cmake -DBOOST_ROOT=/usr/local/ ../' ?

12:40 < oldbeardo> well I didn't try that, simply removed it

12:40 < naywhayare> ok, can you go ahead and try that? maybe cmake is not noticing the newer version of boost in /usr/local/ and only finding an older version in /usr, and then giving up

12:41 < naywhayare> we definitely need the '1.49' there, though, since the cosine tree code won't compile without boost.heap

12:42 < oldbeardo> I will try it tomorrow, I'm working on Reg SVD for the time being

12:43 andrewmw94 has joined #mlpack

12:44 < sumedh_> naywhayare: can we talk about that incremental learning issue right now??

12:44 < naywhayare> sumedh_: can it wait please? I have not had a chance to look into it yet, and am busy with a few other things... the day has just started for me...

12:44 < naywhayare> maybe a couple of hours?

12:45 < sumedh_> sure... no problem :)

12:45 < naywhayare> ok, thanks

12:46 < naywhayare> oldbeardo: so you said that regularized svd with L-BFGS runs much more quickly than the mlpack SGD implementation... but how does it compare to the SGD implementation you wrote by hand?

12:48 < oldbeardo> naywhayare: nice catch, I forgot to compare those two, I will do it today, also what about the unit tests?

12:48 < jenkins-mlpack> Project mlpack - svn checkin test build #1994: SUCCESS in 1 hr 19 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1994/

12:48 < jenkins-mlpack> Ryan Curtin: Bump minimum required Boost version to 1.49 for boost.heap.

12:48 < jenkins-mlpack> Starting build #1995 for job mlpack - svn checkin test (previous build: SUCCESS)

12:49 < naywhayare> oldbeardo: so for unit tests, what we ought to do is have a few tests that make sure Evaluate() and Gradient() are working correctly

12:49 < naywhayare> usually these are easiest to write with simple synthetic dataset

12:49 < naywhayare> datasets*

12:49 < naywhayare> and then you can calculate the result by hand and hard-code it into the test (just double-check your calculation first ;))

12:50 < naywhayare> then we should also make sure that we can actually perform regularized SVD using some mlpack optimizers on some simple datasets that won't take too long to run

12:50 < oldbeardo> okay, should I remove the Evaluate() and Gradient() functions written for SGD?

12:51 < naywhayare> let's leave them there for now, and maybe we can think of a good way to make them faster

12:52 < naywhayare> representing the gradient as an arma::sp_mat object might be one direction that could provide a solution, but that would require a decent amount of refactoring on the part of the optimizer, and it would need to take some template parameters and so forth...

12:53 < oldbeardo> right, I will leave it as it is for now

12:53 < naywhayare> also, you haven't posted a status update to the blog or mailing list in a while, can you please be sure to do that each week in the future?

12:55 < oldbeardo> sorry about that, I just posted it around half an hour back

12:58 < naywhayare> oh, ok... I think I looked 45 minutes ago, I should have waited a moment, I guess

12:58 < naywhayare> thank you for posting that

13:04 oldbeardo has quit [Quit: Page closed]

13:06 Anand has joined #mlpack

13:15 udit has joined #mlpack

13:23 udit has quit [Quit: leaving]

13:31 Anand has quit [Ping timeout: 246 seconds]

13:34 Anand has joined #mlpack

13:53 Anand has quit [Ping timeout: 246 seconds]

14:07 < jenkins-mlpack> Project mlpack - svn checkin test build #1995: SUCCESS in 1 hr 18 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1995/

14:07 < jenkins-mlpack> Ryan Curtin: Split RAQueryStat into its own class.

14:07 < jenkins-mlpack> Starting build #1996 for job mlpack - svn checkin test (previous build: SUCCESS)

14:09 < andrewmw94> naywhayare: do you have any data on how important it is to have the exact value of furthestDescendantDistance?

14:09 < andrewmw94> because with deletion of points, it could get ugly

14:09 < andrewmw94> should I just use 0.5 * bound.Diam()

14:12 < naywhayare> so, technically, furthestDescendantDistance just needs to be greater than or equal to the true furthest descendant distance

14:12 Anand has joined #mlpack

14:12 < naywhayare> you are right, deleting points could make the calculation really ugly...

14:12 < jenkins-mlpack> Project mlpack - svn checkin test build #1996: FAILURE in 5 min 50 sec: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1996/

14:12 < jenkins-mlpack> Ryan Curtin: Refactor RASearch so that it does not accept a leafSize parameter and can build

14:12 < jenkins-mlpack> arbitrary tree types.

14:13 < naywhayare> you could just use 0.5 * bound.Diam(), unless you have a cleverer idea

14:13 < naywhayare> I'm pretty sure that's what the BinarySpaceTree does

14:14 < jenkins-mlpack> Starting build #1997 for job mlpack - svn checkin test (previous build: FAILURE -- last SUCCESS #1995 1 hr 26 min ago)

14:14 < andrewmw94> yeah, I misread the comment. That's what it does too.

14:15 < naywhayare> the cover tree actually calculates the furthest descendant distance, but it only does it because the construction algorithm happens to calculate that anyway, and the node just caches the information

14:17 < andrewmw94> yeah. Should I change the comment on the variable in the Binary space tree

14:18 < andrewmw94> well, actually I guess it is cached, it just caches the 0.5 * diam rather than the actual distance

14:18 < andrewmw94> so I should probably leave it

14:19 < andrewmw94> but do you mind if I clarify what is cached?

14:19 Anand has quit [Ping timeout: 246 seconds]

14:22 < naywhayare> yeah, go for it

14:23 < naywhayare> you don't need to ask before clarifying comments, presuming you don't have doubts that what you wrote isn't correct :)

14:23 < andrewmw94> "don't doubts isn't correct." Let's see. Yes, that is true.

14:24 < andrewmw94> :)

14:24 < naywhayare> yeah, the double negative was a little bit odd, but if I wrote "presuming you have no doubts that what you wrote is correct", that would imply a level of certainty that's higher than necessary

14:25 < naywhayare> but maybe the double negative ends up having the same meaning? I'm not sure. I can debug template errors but not English errors

14:26 < andrewmw94> yeah. I find it amusing how programmers use English in a different way than most people. I frequently nest parentheses for example.

14:26 < andrewmw94> and or is always taken in the inclusive sense

14:28 < naywhayare> nested parentheses are incredibly useful for clarifying potentially ambiguous clauses (like nested boolean operators)

14:33 < andrewmw94> yeah, but english teachers don't seem to like them. I have the traversal working, at least, the output matches that of the other traversers for the first few lines

14:33 < naywhayare> do you mean the output of the algorithm?

14:33 < andrewmw94> yes, the neighbors.csv file

14:34 < andrewmw94> I'll write a test case for it to automate it

14:34 < naywhayare> ok. if it's not the same for some lines, I would try to debug with the smallest dataset you can reproduce the error with

14:34 < naywhayare> so the allknn_test.cpp file has a dataset used for the ExhaustiveSyntheticTest that may be useful here

14:34 < naywhayare> it's 13 points, I think

14:34 < naywhayare> so you can trace the entire traversal through the 13-point tree and make sure no invalid prunes are happening

14:35 < andrewmw94> yeah. I meant it's the same for all of the lines I checked, which happened to be the first few

14:35 < naywhayare> ah, ok

14:35 < naywhayare> you can just diff neighbors_rtree.csv neighbors_kdtree.csv, that's what I usually do

14:35 < naywhayare> and distances_rtree.csv distances_kdtree.csv

14:36 < naywhayare> if you end up having to debug the traversal, which you probably will sooner or later, ideas include commenting out all of the 'return DBL_MAX' in the rules class, so that nothing is pruned

14:36 < naywhayare> and if it still gives the wrong result with nothing pruned, the traversal must not be reaching all of the nodes, or must be reaching some nodes twice, or something like that

14:36 < andrewmw94> yeah. Diff says they're the same

14:37 < naywhayare> cool; I'd test for a handful more datasets (the larger the better... my usual arsenal includes corel.csv, covertype.csv, and LCDM_q.csv/LCDM_r.csv; if you need those files, I can make them available)

14:38 < andrewmw94> That would be great. I had trouble getting most of the datasets last time.

14:39 < naywhayare> marcus_zoq hosts a bunch of them on bitbucket: https://bitbucket.org/zoqbits/benchmark-data.git

14:39 < naywhayare> I'm trying to convince bitbucket to show me what files are in that repo...

14:41 < naywhayare> ok, I ran out of patience for bitbucket; use this: http://ratml.org/datasets/

14:41 < naywhayare> the LCDM_q set is meant to be the query set and the LCDM_r set is meant to be the reference set

14:41 < naywhayare> for nearest neighbor search, I'd expect a runtime of... maybe 15 minutes? for LCDM

14:41 < naywhayare> should be about 4-5 mins for covertype

14:41 < naywhayare> and about a minute or two for the corel dataset

14:43 < andrewmw94> thanks

14:43 < andrewmw94> my comcastic internet is at work.

14:44 < naywhayare> heh, good times. fast for a few seconds then painfully slow for the rest of the file

15:07 govg has quit [Quit: leaving]

15:08 govg has joined #mlpack

15:08 govg has quit [Changing host]

15:08 govg has joined #mlpack

15:19 govg has quit [Quit: leaving]

15:21 govg has joined #mlpack

15:31 sumedh__ has joined #mlpack

15:33 < jenkins-mlpack> Yippie, build fixed!

15:33 < jenkins-mlpack> Project mlpack - svn checkin test build #1997: FIXED in 1 hr 19 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1997/

15:33 < jenkins-mlpack> * Ryan Curtin: Minor formatting changes according to the style guide (mostly, I think?).

15:33 < jenkins-mlpack> * Ryan Curtin: First pass -- move files to match naming policy, change initialize() to

15:33 < jenkins-mlpack> Initialize(), standardize comment formatting, fix some Doxygen commands. No

15:33 < jenkins-mlpack> serious functionality changes.

15:33 < jenkins-mlpack> * Ryan Curtin: Fix constructor calls, and automatically construct a cover tree with the default

15:33 < jenkins-mlpack> RASearch constructor.

15:35 sumedh_ has quit [Ping timeout: 272 seconds]

15:40 < naywhayare> sumedh__: ok, let's talk about incremental SVD, whenever you're ready. sorry for the delay

15:41 < sumedh__> naywhayare: I will be taking dinner in 15 min :( but okay I can at least tell you the problem...

15:41 < sumedh__> before that...

15:41 < naywhayare> sure, that sounds good. that'll give me a little time to think

15:42 < naywhayare> I'm assuming you are implementing Algorithm 3 in the paper "A Guide to Singular Value Decomposition for Collaborative Filtering"

15:42 < sumedh__> yes... 3 and 4... they are almost similar...

15:43 < sumedh__> sorry... 2 and 3...

15:43 < sumedh__> 1 and 4 are done...

15:43 < naywhayare> yeah, ok, I see

15:43 < sumedh__> if you see algorithm 2 ... the wUpdate and HUpdate are linked with each other,,,

15:44 < sumedh__> In that we have to call WUpdate and HUpdate for each individual user...

15:44 < naywhayare> I understand the WUpdate and HUpdate functions to be steps 2(a)iii and 2(a)iv

15:45 < sumedh__> yes... but that step is just updating ith row or jth column of respective matrix

15:45 < sumedh__> not the entire one...

15:45 < sumedh__> after updating both of these vectors... we move on to the next user...

15:46 < sumedh__> our current abstraction is incapable of handling this scenario...

15:46 < naywhayare> my first thought is that you can implement WUpdate() and HUpdate() to only work on one user

15:46 < naywhayare> and each time you call WUpdate you increment the index of the user that is being used

15:46 < sumedh__> yes... that is another solution...

15:46 < naywhayare> do you think that would work?

15:47 < sumedh__> but combining both of these function would give a better solution...

15:48 < naywhayare> ok, so the choices seem to be that we can keep them separate and use the AMF abstraction, or we can create a new class that implements those in a different way

15:48 < sumedh__> I will just take dinner and come ...

15:49 < naywhayare> ok, sounds good

15:49 < naywhayare> let me know when you're back, and we can continue

15:49 < naywhayare> I'll try and think of any other possible solutions

16:13 < sumedh__> naywhayare: okay I am back...

16:17 < sumedh__> naywhayare: if I am using an iterator... and I say it++ ... which direction is given priority?? row or column??

16:19 < naywhayare> it will increment the row first, then the column

16:19 < naywhayare> so you'll access all the elements in one column, then move to the next column

16:19 < naywhayare> if you need the other way you can use row_iterator, but row_iterator can be very slow...

16:21 < naywhayare> about the incremental svd... I can't think of anything better. you could modify the termination policy so it only checks for convergence every time all users have been looped over

16:21 < naywhayare> but maybe this doesn't fit into the AMF abstraction very well

16:21 < naywhayare> what do you think? do you think writing another class for incremental SVD is the right option?

16:21 < sumedh__> ohh... I was thinking about how we can work around with current abstraction... we have to use iterator... or sp_mat operations will be slow...

16:22 < sumedh__> changing termination policy is not a good option I agree...

16:23 < sumedh__> so we are left with template specialization... and combining both update functions...

16:23 < sumedh__> or... use some kind of dynamic checking ....

16:25 < naywhayare> I think it's possible to use the current abstraction with the idea I suggested -- increment the user index each time WUpdate() is called and then perform HUpdate() for that user

16:25 < naywhayare> then only check for termination every time the user index is 0

16:26 < sumedh__> we can keep both options open... either implement a single function for both updates and implement 2 separate functions...

16:26 < naywhayare> so I would say if you would rather implement just a single function, we should write a separate class because that doesn't fit in the abstraction very well

16:26 < naywhayare> but with 2 separate functions and a custom termination policy, I think the abstraction could work

16:27 < naywhayare> in my view, either of these options work, with the 2 functions idea being nice because you get some code reuse

16:27 < naywhayare> what do you think?

16:28 < sumedh__> okay... according to current abstraction step function increments iteration...

16:28 < sumedh__> okay I got it...

16:29 < sumedh__> lets pass step function a boolean...

16:29 < sumedh__> depending on that boolean step will increase the iteration count...

16:30 < sumedh__> but this boolean has to be generated by update function somehow...

16:31 < naywhayare> try this -- have the TerminationPolicy class be owned by (or the same as) the update rules class

16:31 < naywhayare> then you can hold the iteration index internally in the TerminationPolicy class and access it in the update rules class

16:31 < naywhayare> or something like that. maybe that is helpful?

16:32 < sumedh__> what if we force update function to return a variable which will be passed to step function??

16:33 < naywhayare> then we have to modify the AMF abstraction; I'd prefer not to do that, if possible

16:34 < sumedh__> no... we don't have to do that... right now we have ...

16:34 < sumedh__> update.WUpdate(V, W, H);

16:34 < sumedh__> update.HUpdate(V, W, H);

16:34 < sumedh__> t_policy.Step(W, H);

16:34 < sumedh__> now it will be

16:35 < sumedh__> bool w_res = update.WUpdate(...

16:35 < sumedh__> bool h_res = update.HUpdate(...

16:35 < sumedh__> t_policy.Step(W, H, w_res && h_res)

16:36 witness___ has joined #mlpack

16:38 < sumedh__> the problem with the other approach is... how can be make update class own termination policy class.. if we make it a template parameter... it will be difficult to initialize update object...

16:39 < sumedh__> keeping update rule and termination policy separate is a better design...

16:41 < sumedh__> what about this approach...

16:42 < sumedh__> do

16:42 < sumedh__> {

16:42 < sumedh__> update.WUpdate(..)

16:42 < sumedh__> update.HUpdate(..)

16:42 < sumedh__> } while(update.ISUpdated())

16:43 < sumedh__> t_policy.step()

16:44 < naywhayare> I'm reluctant to make the abstraction more complex, because it means that a user who just wants to try something simple has to do more work and understand a more complex abstraction

16:44 < naywhayare> you could make the UpdateRules policy take a reference to the TerminationPolicy as part of the constructor

16:44 < naywhayare> and the TerminationPolicy holds the iteration number

16:48 < sumedh__> but then update_rule has to be templatized with termination_policy... its better to keep these two policies separate I think,... there has to be a better way..

16:49 < naywhayare> another idea is to have the update rule take a 'const double&' in its constructor, which will represent the iteration number

16:49 < naywhayare> either that or just have the update rule count iterations on its own and also have the termination policy count iterations on its own

16:54 < sumedh__> if we make update rule own termination policy then there is no need of termination policy in AMF... AMF will only interact with update rule...

16:54 < sumedh__> humm.. you are right...

16:54 < sumedh__> that is a good option...

16:54 < sumedh__> make termination policy part of update rule rather than AMF...

16:55 < sumedh__> naywhayare: what you say??

16:58 < naywhayare> no, my idea was that the termination rule and update rule can just count the iteration separately

16:58 < naywhayare> I don't think combining the termination policy and update rule is the best idea, because the two things are fairly orthogonal (in most cases, with this case being an exception)

17:00 < sumedh__> yes... keeping them separate is the best choice... but if they count iteration separately how to force termination??

17:01 < naywhayare> I don't know what you mean; the termination should happen when the validation rmse increases

17:01 < naywhayare> but you only want to check the validation rmse when (iteration % numUsers) == 0

17:01 < naywhayare> so you just write some code to do that in a TerminationPolicy class and I think that should work; what do you think?

17:03 < sumedh__> yes.. thats right... but what about other update rules... when they are using termination policies this additional thing has to be turned off... I am confused about that...

17:03 < sumedh__> do we pass numUsers in termination policy??

17:04 < naywhayare> the constructor of the termination policy can take the number of users as input

17:04 < naywhayare> you can write a special termination policy for these algorithms, 'IncrementalValidationRMSETermination'

17:04 < naywhayare> and we'll only use this for Algorithms 2 and 3 (the incremental SVD algorithms)

17:04 < naywhayare> so a user _could_ use IncrementalValidationRMSETermination with some other set of update rules, but it's meant to be used with IncrementalSVDUpdateRules (or whatever you name the class)

17:05 < naywhayare> does that make sense? or maybe I have overlooked something?

17:06 < sumedh__> won't this create more confusion?? :(

17:07 < naywhayare> I don't think so; we're not setting IncrementalValidationRMSETermination to be the default termination policy or anything

17:07 < naywhayare> it's just one of several options users have (including writing their own), and we can write in the documentation that this termination policy is meant to be used with the incremental SVD rules and will probably act weirdly if used with something else

17:11 < sumedh__> yes... thats the issue... user has to use that update with SVDIncrementalLearning... but yes... user has to read the documentation...

17:11 < naywhayare> a user could use another termination policy, but it might be slower

17:11 < naywhayare> it'll still work, though

17:12 < naywhayare> it'll just check for convergence at every user

17:12 < sumedh__> and might not converge... cause of maxIterations...

17:13 < sumedh__> if there are 6000 users... like in the case of MovieLens...

17:13 < naywhayare> yeah; but in that case, the user should have read the documentation, or just increased the number of iterations

17:14 < naywhayare> I would like to provide some nice typedefs so that users who aren't playing with new ideas can just have some convenient typedefs

17:14 < naywhayare> i.e. 'typedef AMF<> NMF'

17:14 < naywhayare> (because AMF with default parameters is AMF)

17:14 < naywhayare> 'typedef AMF<RandomInitialization, SVDBatchLearning, ValidationRMSETermination<sp_mat> > SVDBatch'

17:14 < naywhayare> things like that

17:14 < naywhayare> and that should help clarify things, too, in addition to documentation

17:15 < sumedh__> yes.. thats a very good idea...

17:16 < sumedh__> that will definitely clarify things while reading the code..

17:16 < sumedh__> and most people do read header file to see the abstraction...

17:19 < naywhayare> yeah; and we can make the abstraction clear in a tutorial, too, which is probably what most users read first

17:19 < naywhayare> I think many users are just thinking "I want to run NMF/SVD/incremental SVD/batch SVD and I don't care how, I just want the right commands to type"

17:25 < sumedh__> the problem still persists for algorithm 3... :(

17:26 < naywhayare> which problem? how?

17:29 < sumedh__> in algorithm 3... we have to change the parameter from numUsers to total non zero entries...

17:29 < naywhayare> okay... then you can write a very slightly different validation RMSE termination class

17:30 < sumedh__> but user will have to compute the number of non-zero entries in the matrix...

17:31 < sumedh__> rather than that... lets pass the matrix...

17:32 < naywhayare> that works too

17:32 < sumedh__> but then we have to write 2 different set of termination rules ... one for algorithm 2 and one for algorithm 3...

17:33 < sumedh__> cause in algorithm 2 we have to count number of users...

17:33 < sumedh__> and in algorithm 3 we have to count number of non-zero entries...

17:33 < naywhayare> you can structure those two classes in such a way to reduce code reuse

17:34 < naywhayare> they can hold a ValidationRMSETermination object inside of them and forward the calls when (iteration % numUsers) or (iteration % n_nonzero) is 0

17:34 < naywhayare> and you can use a boolean template parameter to indicate whether or not the number of users or the number of nonzero entries should be used

17:35 < sumedh__> I think dynamic checking is a better way... one can implement just one update function... or separate functions for W and H update....

17:35 < sumedh__> we can check it with templates...

17:35 < sumedh__> like the implementation in PrefixedOutStream...

17:36 < sumedh__> where we check if we have ToString function...

17:36 < naywhayare> that's a really ugly implementation and I'd prefer to avoid reusing that code anywhere else

17:36 < naywhayare> honestly I'd prefer to completely get rid of it because it's so ugly

17:36 < naywhayare> but it's the only way I know of to provide operator<< automatically...

17:37 < naywhayare> what's wrong with something like this? (below)

17:37 < naywhayare> class IncrementalValidationRMSETermination {

17:37 < naywhayare> oops, crap

17:37 < naywhayare> hang on, I'll do this in pastebin, it'll be easier

17:41 < sumedh__> there will be too many TerminationPolicies... and user may get confused... :( I am only concerned about that...

17:42 < sumedh__> how to use pastebin btw??

17:42 < naywhayare> http://ratml.org/misc/termination.txt

17:42 < naywhayare> pastebin is overloaded, but when it works, I just write into the window and hit 'post', then give a link to the page it produces

17:43 < naywhayare> so I ended up just putting it on my website, and I gave the link above

17:43 < naywhayare> if each termination policy is documented adequately, I don't think it's a problem to have very many. if the user doesn't know what to do, then they can just use the default

17:45 oldbeardo has joined #mlpack

17:45 < sumedh__> ohh... and we don't have to pass anything new in the constructor... initialize function in Termination policy takes V matric...

17:45 < sumedh__> we can initialize our increment index there...

17:53 < andrewmw94> naywhayare: I'm getting some puzzling results from running the r_tree and the dual_tree algorithms against the corel.csv data

17:53 < andrewmw94> The diff of the distances file only returns one line where they differ, but the diff of the neighbors file returns many

17:53 < andrewmw94> [awells@host-113-53 bin]$ diff distances_out.csv distances_out1.csv 17087c17087

17:54 < andrewmw94> < 1.403729444017e-01,1.541401652717e-01,1.762187944176e-01

17:54 < andrewmw94> ---

17:54 < andrewmw94> > 1.403729444017e-01,1.541401652717e-01,1.769238377184e-01

17:54 < andrewmw94> [awells@host-113-53 bin]$ diff neighbors_out.csv neighbors_out1.csv 15269c15269

17:54 < andrewmw94> < 2727,35448,36680

17:54 < andrewmw94> ---

17:54 < andrewmw94> > 2727,36680,35448

17:54 < andrewmw94> 17087c17087

17:54 < andrewmw94> < 26542,9524,19580

17:54 < andrewmw94> ---

17:54 < andrewmw94> > 26542,9524,9712

17:54 < andrewmw94> 19687c19687

17:54 < andrewmw94> < 35448,36680,617

17:54 < andrewmw94> ---

17:55 < naywhayare> so it is possible that two points are equidistant from a particular query point

17:55 < naywhayare> which would explain why the distance results are the same but the neighbor indices are not

17:55 < naywhayare> I didn't think that happened with the corel dataset, but if the distances are right and the neighbors are wrong, it's probably just mis-ordering

17:56 < naywhayare> and you can see that in the first diff section of neighbors, it's just a change in ordering: 2727,35448,36680 vs. 2727,36680,35448

17:56 < andrewmw94> yeah, that's true for most of them. But how would the tree cause that. I thought the BaseCase took care of sorting?

17:56 < naywhayare> but that one different distance in the distances file indicates a bug to me... which was distances_out.csv? the kd-tree or r-tree?

17:57 < naywhayare> BaseCase sorts by distance, but doesn't make any guarantees about identical distances

17:57 < naywhayare> if the distances file is the same but the neighbors is different, then I would still classify the implementation as correct

17:57 < naywhayare> but the neighbors file is a useful place to start when trying to figure out why a particular point gets pruned...

17:58 < andrewmw94> distances_out.csv is the r_tree, distances_out1.csv was the kd-tree

17:58 < naywhayare> that implies a bug in the kd-tree... can you test against allknn run with the -N option (naive)?

17:59 < andrewmw94> let me make sure I remember correctly first.

17:59 < andrewmw94> yeah, that was right

17:59 < andrewmw94> ok, I'll try naive

18:00 < naywhayare> fascinating. let's see what the naive approach gives...

18:00 < naywhayare> if the r-tree results agree with the naive approach, but the kd-tree results don't, then it seems like we have uncovered a kd-tree bug (or... something like that)

18:01 < andrewmw94> once in 37,000 points. That'll be fun to fix...

18:03 < naywhayare> well, it takes a bit of tracing, but the general workflow will look like this:

18:03 < naywhayare> determine the sequence of nodes leading to the point that is getting incorrectly pruned

18:04 < naywhayare> set breakpoints in gdb on Score(size_t, TreeType&) for each of these nodes

18:04 < naywhayare> then figure out which one is getting incorrectly pruned

18:04 < naywhayare> then figure out why that's happening (this is the hardest part...)

18:06 < andrewmw94> ok, so the cover tree says the kd-tree is wrong in the same way. I'll do the naive now

18:07 < naywhayare> if it's a kd-tree bug, I'll dig in and try to figure out what's wrong

18:07 < naywhayare> since I wrote that code

18:08 < andrewmw94> ok. I don't think I mentioned that it was the dual-

18:09 < andrewmw94> tree traverser. I haven't tried --single yet

18:11 < oldbeardo> naywhayare: hey, I tried SGD vs L_BFGS

18:12 < oldbeardo> naywhayare: SGD performs better taking 3.26 secs as opposed to 4.50 secs taken by L_BFGS

18:12 < naywhayare> oldbeardo: this is your implementation of SGD, right?

18:12 < oldbeardo> naywhayare: this is the time taken for 10 iterations on the GroupLens dataset

18:12 < oldbeardo> naywhayare: yes, that's right

18:13 < naywhayare> how many iterations does it take to converge?

18:13 < oldbeardo> naywhayare: well, somewhere in between 50 to 100 iterations

18:14 < oldbeardo> the relative error converges to 0.19

18:15 < oldbeardo> SGD also performs better in that sense, after 10 iterations the relative error for SGD is 0.23 whereas for L_BFGS it is 0.25

18:23 < naywhayare> okay, I figured SGD would converge better

18:25 < naywhayare> also, as a side note, for some reason the svn authentication file has just completely been nuked for a reason I don't understand. a backup is on the way, but until then, svn commits will fail...

18:26 < naywhayare> oldbeardo: I see a couple of options for how we can accelerate the mlpack SGD+regularized SVD implementation

18:27 < oldbeardo> okay, thanks for that

18:27 < naywhayare> the first I already mentioned -- use arma::sp_mat for the gradient calculation

18:27 < naywhayare> another option is to write a template specialization for SGD<RegularizedSVDFunction>::Optimize()

18:27 < oldbeardo> naywhayare: I ran a test with 100 iterations, SGD is still more time efficient, but L_BFGS gives a lower relative error

18:29 < oldbeardo> naywhayare: I guess that makes sense, since SGD is more of a quick approximation

18:30 < naywhayare> yeah, if you used a specialization you could avoid the expensive calls to mat.zeros() for the gradient

18:30 < naywhayare> and just reset the relevant columns and rows to 0

18:30 < naywhayare> a sparse matrix would also be applicable, and the .clear() operation would reset it to 0

18:31 < naywhayare> you would need to refactor SGD so that it could accept mat or sp_mat from the Gradient() call (using templates, probably)

18:32 < naywhayare> andrewmw94: I can reproduce the error for kd-trees with the corel dataset; lots of errors compared to the naive implementation or the cover tree implementation

18:32 < naywhayare> I'll take a look into it, but probably not today

18:32 < naywhayare> you don't need to worry about it though, other than just noting that comparing with the kd-tree results isn't a great idea at the moment :)

18:34 < oldbeardo> naywhayare: what should I do then? I don't really understand the improvements you are suggesting

18:35 < naywhayare> oldbeardo: I suggested two ways in which you can accelerate the SGD optimization, since that is the main problem you are facing with your implementation

18:36 < naywhayare> the first is to use a template specialization for SGD<RegularizedSVDFunction>::Optimize(); if you do that, you can write custom code that calculates the gradient without calling gradient.zeros(), which is probably more like the implementation you wrote by hand

18:36 < naywhayare> the other option is to refactor RegularizedSVDFunction::Gradient so that it takes an arma::sp_mat& as the gradient and not an arma::mat&

18:37 < naywhayare> because the gradient for an individual user is going to be sparse, this might be a good choice

18:37 < naywhayare> is that a decent clarification? if not, how can I help clarify further?

18:40 < oldbeardo> yes, I understood, I'm not sure of how a template specialization might be written

18:42 < oldbeardo> though it sounds that this may be an easier option than using sp_mat

18:42 < naywhayare> take a look in src/mlpack/core/optimizers/lrsdp/lrsdp_function.cpp, down at the bottom

18:42 < naywhayare> there are two template specializations there

18:43 < oldbeardo> okay, which one of the two would you prefer?

18:44 < naywhayare> I think either is fine; it's up to you. I think you're right that the template specialization is easier, so if I was in your position I'd probably choose to do that

18:46 < oldbeardo> okay, I will give that a try tomorrow, is there anything else that I may be forgetting?

18:49 < naywhayare> about regularized SVD? not that I can think of

18:50 < naywhayare> have you thought about how to integrate QUIC_SVD into the CF class?

18:51 < oldbeardo> I think we had this discussion in the past week

18:53 < naywhayare> we had a discussion on how well QUIC-SVD works for collaborative filtering, but I don't think we can conclude that it works poorly for all datasets

18:54 < naywhayare> therefore, we should make sure that CF<QUIC_SVD> is possible (or something like that) and uses QUIC-SVD for matrix decomposition

18:57 < oldbeardo> I think we can conclude that, GroupLens is a fairly small dataset, the performance will once be worse for bigger datasets

18:57 < oldbeardo> *only be worse

18:58 < oldbeardo> or maybe I'm mistaken, in what case will it work well?

19:12 < naywhayare> my intuition suggests that QUIC-SVD will work better for denser datasets

19:12 < naywhayare> but I'm not certain. either way, if a user could quickly do collaborative filtering with QUIC-SVD, we could easily determine where it performs well and where it does not :)

19:14 < oldbeardo> okay, now for the technical difficulties

19:15 < oldbeardo> QUIC-SVD gives u, v and sigma, how should we handle that?

19:18 < naywhayare> please develop a couple possible ideas for how this can be handled, and then we can talk about which one is best for mlpack

19:21 < oldbeardo> okay, can you give me pointers to resources which deal with problems like these?

19:24 < naywhayare> I am not certain of what resources to point you towards; the SVD itself decomposes a matrix X into U*S*V^T, and CF does a similar decomposition

19:24 < naywhayare> many of the papers you referenced in your proposal or in various IRC conversations should have sufficient detail of how to put the two pieces together, I think

19:25 < oldbeardo> okay, I will think about it

19:27 oldbeardo has quit [Quit: Page closed]

19:55 Gys has quit [Quit: Page closed]

20:10 < sumedh__> naywhayare: incremental learning working... :)

20:11 < sumedh__> we don't have to create another set of termination policies...

20:11 < sumedh__> we can do this...

20:11 < sumedh__> AMF<IncrementalTermination<SimpleToleranceTermination<mat> >, RandomInitialization, SVDIncrementalLearning> amf;

20:12 < sumedh__> works like a wrapper...

20:18 < naywhayare> cool, it just wraps any other termination policy?

20:18 < naywhayare> that sounds like a good solution

20:18 < sumedh__> yup ... any termination policy :)

20:18 < sumedh__> tried them all...

20:18 < naywhayare> we just need to be sure that IncrementalTermination is documented so that people know to use it with SVDIncrementalLearning

20:19 < sumedh__> okay I have changed its name to IncompleteIncrementalLearning... its actually algorithm 2 of the paper...

20:19 < sumedh__> yes... that very important in this case...

20:20 < sumedh__> I am right now trying to speed up incremental learning for sparse matrix... I guess I have to use row_iterator... it must be faster than addressing each element.. right??

20:32 < andrewmw94> naywhayare: just in case you didn't find out already, it seems that the Binary Space Tree traverser works correctly when in single mode; it's a bug in the dual-tree traverser.

20:39 < sumedh__> naywhayare: oops...

20:40 < sumedh__> the paper uses SVDUSER... but we pass the V matrix as m * n... I need to change the implementation...

20:44 < jenkins-mlpack> Starting build #1998 for job mlpack - svn checkin test (previous build: FIXED)

20:49 < sumedh__> naywhayare: there is no sp_mat column iterator :(

21:08 < naywhayare> andrewmw94: yeah, I had it isolated to the dual-tree case

21:08 < naywhayare> sumedh__: the default sp_mat iterator is the column-major iterator

21:09 < sumedh__> but how start from certain column??

21:09 < sumedh__> or end at certain column??

21:13 < naywhayare> sumedh__: I think begin_col() and end_col() are what you are looking for

21:13 < naywhayare> http://arma.sourceforge.net/docs.html#iterators_mat

21:14 < sumedh__> I thought those are for normal matrices... okay thanks...

21:14 < naywhayare> it says they are for SpMat too :)

21:20 < sumedh__> naywhayare: sorry for that :) okay I have changed the implementation of incomplete incremental learning to column major fashion...

21:22 < sumedh__> incomplete incremental learning is much more stable and much faster... down to 0.98 in 15 iterations...

21:23 < sumedh__> naywhayare: I will just commit the code right now... so that you can have a look... tomorrow I will write tests... is that fine??

21:35 < naywhayare> sumedh__: of course, that's fine; I probably won't look over it until tests are there

21:36 < sumedh__> naywhayare: it too late right now... or I would have done it right now :(

21:36 < sumedh__> should get some sleep...

21:53 sumedh__ has quit [Quit: Leaving]

22:06 < jenkins-mlpack> Project mlpack - svn checkin test build #1998: SUCCESS in 6 hr 32 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1998/

22:06 < jenkins-mlpack> andrewmw94: tree traversal

22:06 < jenkins-mlpack> Starting build #1999 for job mlpack - svn checkin test (previous build: SUCCESS)

23:18 andrewmw94 has left #mlpack []

23:26 < jenkins-mlpack> Project mlpack - svn checkin test build #1999: SUCCESS in 1 hr 19 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1999/

23:26 < jenkins-mlpack> * sumedhghaisas: * added SVD Incomplete incremental learning

23:26 < jenkins-mlpack> * added Termination Policy wrapper for SVD Incomplete Learning

23:26 < jenkins-mlpack> * andrewmw94: R tree traversal test code.

23:26 < jenkins-mlpack> * birm: Added check to solve ticket 347. Note that this does not changeany functionality, just adds a warning.