naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
naywhayare has joined #mlpack
< jenkins-mlpack> Project mlpack - nightly matrix build build #523: STILL FAILING in 4 hr 54 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/523/
< jenkins-mlpack> * saxena.udit: Adaboost.mh implemented
< jenkins-mlpack> * saxena.udit: Changes to implementation of adaboost. Implemented adaboost.m1
< jenkins-mlpack> * Marcus Edel: Avoid direct multiplication with a diagmat.
Anand has joined #mlpack
Anand has quit [Quit: Page closed]
andrewmw94 has joined #mlpack
< naywhayare> marcus_zoq: I took a look through the NaiveKernelRule and NystroemKernelRule classes... do you think maybe we could condense the code a little by changing ApplyKernelMatrix() to a function that just returns the kernel matrix?
< naywhayare> then the KernelPCA class can do the centering and eigendecomposition
< naywhayare> it looks like right now the NaiveKernelRule class uses eig_sym() and the NystroemKernelRule uses svd()... maybe svd() is faster?
< marcus_zoq> naywhayare: This was also my first intuition, but then I realized in this case we need to add a new parameter to the ApplyKernelMatrix function which returns G (nystroem method), so that we can transform the data. So, I ended up in a code duplication for matrix centering and eigendecomposition. Because the naive method don't provide G. Maybe this was a bad idea and we should change this?
< marcus_zoq> naywhayare: In some cases svd is faster, so I ended up in using svd.
< naywhayare> marcus_zoq: I see what you mean; the G matrix is necessary for the Nystroem method
< jenkins-mlpack> Starting build #2034 for job mlpack - svn checkin test (previous build: SUCCESS)
< naywhayare> marcus_zoq: I can't think of a better solution for now, so let's leave it as-is
< naywhayare> do you think we should add the option to use the Nystroem method to kernel_pca_main.cpp?
< marcus_zoq> naywhayare: If you like, sure :)
< naywhayare> yeah, it'd probably be useful to someone out there
< naywhayare> returning an arma::mat* from KMeansSelection::Select() is a little scary to me because it requires the user to delete it, but I don't see an easy way to make it better without a lot of refactoring
< naywhayare> so it's probably not worth changing it
< naywhayare> anyway everything else looks great, I don't see any issues, and the tests seem fine too
< naywhayare> andrewmw94: you can use BOOST_REQUIRE_CLOSE(), BOOST_REQUIRE_EQUAL(), etc. in place of assert() in your tests
< naywhayare> you can even use it in the functions that are called by the tests, like checkContainment(), etc.
< andrewmw94> ahh
< andrewmw94> that would be nice, I wasn't sure if the bool &= stuff was a good idea
< andrewmw94> I have a design question I was hoping you could help me with.
< naywhayare> it's clever, at least :)
< naywhayare> yeah, sure, go ahead
< marcus_zoq> naywhayare: Yeah, we can use an auto_ptr or something like that? I can do the neccessary changes for the kernl_pca_main.cpp, any deadline for the new release?
< andrewmw94> so the R* tree is done, except for the reinsertion. The paper says that they allow one reinsertion on each level of the tree for each (non reinsertion) insertion
< andrewmw94> so I was thinking the best way to do that would be to have a static std::vector<bool> stored in the root node, and keep track of which levels have already had insertions. But that would be shared with other trees
< andrewmw94> unless I'm mistaken?
< naywhayare> marcus_zoq: don't worry about the auto_ptr for now; I was hoping to get the release done by the end of this week, but if you want more time that's fine; I'm still waiting on one or two other things
< naywhayare> andrewmw94: let me read a little so I can understand the problem better
< andrewmw94> do you want a link to the paper?
< naywhayare> http://epub.ub.uni-muenchen.de/4256/1/31.pdf is what I'm looking at
< andrewmw94> yeah. So it's section 4.
< naywhayare> ok, thanks
< andrewmw94> 4.3 sorry
< marcus_zoq> naywhayare: Okay, I integrate the nystroem method tonight or tomorrow.
< naywhayare> marcus_zoq: excellent, thanks
< naywhayare> andrewmw94: why not hold the std::vector<bool> locally in the Insert() method?
< naywhayare> i.e. if I call Insert(), then the std::vector<bool> is created and initialized to false... as Insert() calls other methods, a reference to the vector is passed
< naywhayare> to me it doesn't make much sense to hold the vector in the tree itself, since it's only relevant when inserting points
< andrewmw94> hmm. How will that work with insert being called, and then called again before the first one returns?
< naywhayare> when you say called again, do you mean in the algorithm OverflowTreatment when ReInsert is called?
< andrewmw94> yeah. What I do is delete the point and then call root.Insert(point). But it's also called on leaf nodes when I split them.
< andrewmw94> though I could change those to be a different method I guess.
< naywhayare> so if you are going to hold the vector locally in Insert(point), then you'll probably need some method like Insert(point, vector)
< naywhayare> which is what gets called on all of the children nodes, to pass around that local vector
< naywhayare> if you re-inserted, then you could call root.Insert(point, vector)
< naywhayare> do you think that might work?
< andrewmw94> I think so
< andrewmw94> thanks
< andrewmw94> also, are the tests rigorous enough?
< andrewmw94> I think I should change the checkContainment to ensure all of the bounds are as tight as possible (I wrote that, but had problems and lost it when I tried to fix them with SVN). Other than that, is there anything that needs to change.
< naywhayare> let me take a look through the tests quickly
< andrewmw94> and also about the tests, do I need to test every possible combination of DescentHeuristic and SplitHeuristic ?
< naywhayare> it would be nice if you did; is that tractable? I think there will be something like 8 possibilities total?
< andrewmw94> at least. I'm not sure. I'm also considering templatizing the bulk loading strategy
< naywhayare> reading the tests is a little hard because there aren't very many comments, so I may be overlooking some things that the tests actually check
< andrewmw94> ahh. I'll have to add comments. I thought the names described them pretty well, but it's obviously easier for me to tell what they do if I wrote them
< naywhayare> yeah; when you write comments, assume that whoever is trying to read them is up at four in the morning trying to figure out why the test is failing
< andrewmw94> Hehe. So write something like: "Relax and go to bed. The IRC will solve it tomorrow. Oh wait, you have a paper do at noon. Sucks to be you..."
< naywhayare> haha
< naywhayare> also, you may want to use some typedefs to simplify the really really long lines
< naywhayare> I think SingleTreeTraverserTest should actually be in AllkNNTest, but that's a minor detail
< andrewmw94> yeah, I haven't gone through that code again to clean it up. Actually, a lot of the code has changed since I last refactored it. Basically the tests ensure (unless the tests have bugs) that all of the points are inserted and that the bounds are all legal. There are some other things I could do, like ensuring all of the nodes are on the same level, the bounds are as tight as possible, etc. But I wanted to make sure I had the gener
< naywhayare> yeah, ok. the tests seem reasonable to me
< naywhayare> instead of testing all the possible combinations of descent heuristic and split heuristic, you might consider writing tests for those individually
< naywhayare> for instance, a test for RTreeSplit::SplitLeafNode() and RTreeSplit::SplitNonLeafNode()
< andrewmw94> hmm, that's a good idea because then they can be exhaustive
< naywhayare> yeah; you can manually create some R tree nodes as input and then call the splitting procedure
Anand has joined #mlpack
< andrewmw94> Ok. I'll write up a list of all the guarantees for Rectangle Trees and write tests for Min / Max fill too. Some of these things seem like you couldn't possibly break them, but I guess it doesn't hurt to test.
< naywhayare> they're useful as tests, because when someone comes along to debug something bigger they can easily say "well, the problem should not be in <some component whose tests are not failing>"
< naywhayare> and then hope that the tests are comprehensive enough that that statement is true...
Anand_ has joined #mlpack
< andrewmw94> yeah.
Anand has quit [Ping timeout: 246 seconds]
< jenkins-mlpack> Project mlpack - svn checkin test build #2034: SUCCESS in 1 hr 26 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2034/
< jenkins-mlpack> Ryan Curtin: Comment normalization.
< jenkins-mlpack> Starting build #2035 for job mlpack - svn checkin test (previous build: SUCCESS)
< Anand_> Marcus : I added the scripts for perceptron and decision stump. However, I am not sure about the output file formats and structure. Do the two methods generate label files like other methods do (one label per line)?
< naywhayare> Anand_: that should be how those two function
< naywhayare> (one label per line, that is)
< naywhayare> they produce labels into the file given with -l (or --labels_file) for each test point from the testing dataset (given with -T or --test_file)
< Anand_> so, decision_stump -i self.dataset[0] -T self.dataset[1] -l "name_of_labels_file" -v options will work?
< Anand_> Ryan : May be Marcus will answer this?
< marcus_zoq> Anand_: Yeah, this should work.
< Anand_> Ok, so unlike others we have to specify the labels file name for these two, right?
< marcus_zoq> Anand_: No, "If not specified, the labels are assumed to be the last row of the training data. Default value ''.
udit_s has joined #mlpack
< marcus_zoq> Anand_: decision_stump -i self.dataset[0] -T self.dataset[1] -v options
< Anand_> Marcus : Ok, so no need of the "-l" in the scripts?
< marcus_zoq> Anand_: Right, sorry for the confusion, Btw. actually it's -t not -i
< Anand_> -t for train and -T for test?
< marcus_zoq> Anand_: Yes!
< Anand_> Alright!
< Anand_> Marcus : You can have a look now!
< marcus_zoq> Anand_: Okay, sure.
< marcus_zoq> Anand_: I've sent you an email.
Anand_ has quit [Ping timeout: 246 seconds]
Anand has joined #mlpack
< Anand> Marcus : Fixed the issues. And yes, we can definitely add scikit and shogun perceptron.
< marcus_zoq> Anand: Okay, great!
< Anand> matlab must also have perceptron, then?
< marcus_zoq> Anand: Yeah, I think we can use the nerual network toolbox.
< Anand> Ok. I might need your help there. We will need matlab code, right? I will ask
oldbeardo has joined #mlpack
< oldbeardo> naywhayare: hey, there?
< naywhayare> oldbeardo: yeah, I am here. did your travel go well?
< oldbeardo> naywhayare: yes, it did, not exactly travel, I switched cities
< naywhayare> yeah; maybe I should have said "transit" instead of "travel" :)
< oldbeardo> heh, yeah, so did you get a chance to go through the paper?
< naywhayare> ack, I forgot. let me do that now
< marcus_zoq> Anand: The matlab code is simple: http://www.mathworks.de/de/help/nnet/ref/perceptron.html
< oldbeardo> no problem, I'm here for another hour or so
< naywhayare> I'm not sure I will have it done by then, so if you are gone I will send an email
< naywhayare> just to make sure I am doing the right thing, the question is whether or not regularized SVD is a special case of PMF
< naywhayare> is that right?
< oldbeardo> sure, but the part I'm supposed to do shouldn't take more than half an hour
< oldbeardo> yes, that's right
< naywhayare> what do you mean? I'm planning to read both papers so I can understand them completely
< naywhayare> I doubt I can do it in thirty minutes :(
< Anand> Marcus : Ok, so we need to use the net(x)? And will it give us the labels?
< oldbeardo> well, the PMF paper has the description of the cost function in the first three pages itself I think
< naywhayare> sure, but I'm going to read the rest of it anyway so that I'm fully clear on what the algorithms are
< marcus_zoq> Anand: Yes, the predicted labels.
< oldbeardo> naywhayare: okay, no problem
< Anand> Ok. I will see that. Should be done by tomorrow.
< jenkins-mlpack> Project mlpack - svn checkin test build #2035: SUCCESS in 1 hr 25 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2035/
< jenkins-mlpack> * Ryan Curtin: Refactor to be stricter with types (arma::Col<size_t> instead of arma::vec).
< jenkins-mlpack> * andrewmw94: R* tree split. Default to using the R* tree in allknn
< marcus_zoq> Anand: Sounds good to me!
< udit_s> naywhayare: Hey. You there ?
< oldbeardo> naywhayare: I had one more question, how do I test the template specialization that I wrote for SGD?
< naywhayare> udit_s: I am here, but I am busy for probably 30 to 45 minutes
< naywhayare> oldbeardo: I'm not sure what you mean. it shouldn't be any different than writing tests for anything else
< udit_s> naywhayare: Okay. I'll catch you in a while then. I just wanted to talk to you about tests.
< oldbeardo> naywhayare: I have written the Optimize() method, I'm not sure how I should test that
< naywhayare> call it on a handful of different datasets (very simple to moderately hard) and ensure that it performs well on each of them?
< naywhayare> I'm not sure I am answering the exact question that you're asking
< udit_s> my problem was running on datasets was the only test available.
< oldbeardo> hmmm, how should I quantify 'performing well'?
< naywhayare> udit_s: sorry, I should have addressed oldbeardo specifically for that, I didn't realize my response might apply to both of you :)
< naywhayare> oldbeardo: take a look at nystroem_method_test.cpp for an example of the type of test you might want to try
< oldbeardo> naywhayare: okay, I saw the file, how did you decide the number 1e-3 in Rank10Test?
< oldbeardo> was it selected after running trials?
< naywhayare> no, I believe I chose 1e-3 as something that was reasonably larger than the absolute value of the noise I added (which is 1e-5)
< naywhayare> writing tests like that is difficult, though
< naywhayare> if I select a value by running trials with my own implementation, I am still not guaranteed that the test results are accurate
< naywhayare> the only guarantee that particular test makes is that if someone comes along later and makes it way worse, the test will fail
< naywhayare> further, even if I compare against the results in some paper, I am not even guaranteed that their implementation is correct
< naywhayare> so for testing numerical algorithms like this, there's a limit to how rigorously we can actually test it, and to some extent we just have to do the best we can
< naywhayare> if you have some other ideas for testing that I've overlooked, feel free to use those too
< oldbeardo> well, I think that test is heavily dependent on the algorithm and the problem you trying to solve
< oldbeardo> I'm not sure how that will work for SVD
< naywhayare> yeah, of course -- all three of those tests are very specific to the nystroem algorithm and aren't specifically applicable to SVD
< naywhayare> what I meant to share was the basic strategy of the test itself
< naywhayare> *of the tests
< naywhayare> a very simple synthetic dataset that SVD should be able to recover basically perfectly (maybe start with U, s, and V, then build the input matrix and use SVD on it and ensure the results are about the same?)
< naywhayare> a less easy synthetic dataset that regularized SVD should be able to perform well on
< naywhayare> maybe some tests that check that regularized SVD performs better than non-regularized SVD
< naywhayare> and then maybe a real-world test where you are trying to reproduce the results of some paper
< oldbeardo> the last part maybe difficult, since people test Reg-SVD only with Netflix :)
< oldbeardo> anyway, I get your point
< naywhayare> yeah, the netflix dataset is very large, unfortunately :(
< naywhayare> sumedh has found in his work that the movielens/grouplens datasets are often benchmarked against too
< naywhayare> so maybe you can find something using that dataset
< naywhayare> for your algorithms
< naywhayare> I've finished going through the PMF and regularized SVD papers
< naywhayare> do you have time to talk about that now, or would you rather wait for tomorrow, or would you rather I write up an email you can read in the morning?
sumedhghaisas has joined #mlpack
< oldbeardo> I guess I can talk now
< naywhayare> if you are too tired, we can easily wait; it's no problem
< oldbeardo> no, I'll get a good night's sleep if I finish this now
< naywhayare> okay
< naywhayare> so, after reading through the papers, I would agree that both PMF and Regularized SVD are very similar, if not exactly the same
< naywhayare> however, each paper (the NIPS PMF paper and the Paterek SVD paper) has further extensions for each which are different
< naywhayare> for example the Paterek paper adds a bias to the model, and the PMF paper describes extensions for constrained PMF
< oldbeardo> right, I think that answers my question
< naywhayare> so, the way I think we can proceed is by implementing just one of them, since they are essentially the same
< naywhayare> and then this sort of invalidates your project proposal a little bit since those two things are the same, leaving you with a lot of extra time
< naywhayare> so, we can consider some of the extensions to PMF or regularized SVD; or, we could focus on the deep learning code that you had been implementing
< naywhayare> do you think that seems reasonable, or did you have another idea?
< oldbeardo> oh, this reminds me, the Softmax Regression code hasn't been added yet
< oldbeardo> I had sent it to you the week before GSoC started
< oldbeardo> with written tests
< oldbeardo> and since you are planning a release, it would be great if you could have a look at it
< oldbeardo> coming back, your plan seems reasonable, I'll look into what else I can do
Anand has quit [Ping timeout: 246 seconds]
< naywhayare> oldbeardo: ok, I remember some issue with the softmax regression code, but I will try to look into it this week before the release
< naywhayare> I am still waiting on a few things from a few people, but I'm hoping to do it by the end of the week
< oldbeardo> naywhayare: thanks, I will add the Reg SVD tests in a day or two
oldbeardo has quit [Quit: Page closed]
< udit_s> naywhayare: free now ?
< naywhayare> udit_s: no, one or two other things to take care of first; sorry
< naywhayare> udit_s: okay, sorry for the delay
< naywhayare> I'm not really interested in going through the adaboost code until there are tests that show that it works
< udit_s> okay.
< udit_s> I'm having some difficulties coming up with tests...
< naywhayare> so I just had a discussion with oldbeardo on the types of ideas for tests you can come up with
< naywhayare> your problem is classification, not matrix decomposition, so it's a little different, but the same ideas still apply
< naywhayare> you could start with a simple synthetic dataset that will obtain perfect performance
< naywhayare> then you could move on to more complex datasets where boosting is necessary to obtain perfect performance
< udit_s> So basically testing datasets should suffice ? Because, due to the mathematical nature of the algorithm, I can't really see any way to test that.
< naywhayare> if the algorithm makes no theoretical guarantees that you can easily test, then only testing datasets is probably the only thing you can do
< udit_s> If possible, I could test the lemmas which accompany the algorithm.
< naywhayare> yes, that would be a good idea
< udit_s> Okay - let me see if I find any that can be properly tested.
< udit_s> And I think I found a paper which discusses weighted instances in Decision Stumps.
< udit_s> Basically modifying the gain value.
< naywhayare> can you send me a link to the paper?
sumedhghaisas has quit [Ping timeout: 240 seconds]
< udit_s> Yeah, but it's not what we want - it gives weight to attributes not instances - Nunez (1998).
< udit_s> But I'm unable to access this: http://dl.acm.org/citation.cfm?id=2116011
< udit_s> And I think this might be a very good lead.
< naywhayare> I'm trying to get a copy of the PDF for you
< udit_s> I was hoping you could.
< udit_s> naywhayare: Hold on ! Actually, this one too talks about attributes and not instances.
< udit_s> naywhayare: So it won't work !
< udit_s> I think I'll try modifying the information gain function and extending it to weights.
< udit_s> I'll get back to you tomorrow on this. I've also pushed the adaboost.hpp with the BiBTex citation...
udit_s has quit [Quit: Leaving]
< jenkins-mlpack> Starting build #2036 for job mlpack - svn checkin test (previous build: SUCCESS)
< jenkins-mlpack> Project mlpack - svn checkin test build #2036: SUCCESS in 1 hr 26 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2036/
< jenkins-mlpack> saxena.udit: BiBTex added to adaboost_impl.hpp