naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
sumedh_ has quit [Ping timeout: 240 seconds]
witness___ has quit [Quit: Connection closed for inactivity]
andrewmw94 has quit [Quit: Leaving.]
Anand has joined #mlpack
Anand has quit [Ping timeout: 246 seconds]
zGz|govg has quit [Ping timeout: 240 seconds]
zGz|govg has joined #mlpack
zGz|govg has quit [Ping timeout: 248 seconds]
witness___ has joined #mlpack
zGz|govg has joined #mlpack
Anand has joined #mlpack
< marcus_zoq>
Anand: Hello, I think in line 32 in the 'LogisticRegression.java' file you should return the index.
< Anand>
Marcus: I didn't get you. Are you talking about weka?
< Anand>
Btw, I have fixed the error. Now all tests are succeeding
< Anand>
I have also added Linear regression modified files for scikit and mlpack
< Anand>
You can have a look
< marcus_zoq>
Anand: I'm talking about the 'maxProb()' function the weka code. Right now the function return the max propb but we need the predicted class right?
< Anand>
Ok, I got you. I have made the changes. But, will this always work?? I doubt
< marcus_zoq>
Anand: It shoudl work, if you use the index of the max probability.
< Anand>
But can we always say that the index of the prob will directly denote its class? What if indices are 0,1,2 but classes are 1,2,3?
< Anand>
You go my point, right?
< Anand>
*got
< marcus_zoq>
Anand: Yes, good point, we need to map the predicted classes.
< Anand>
But to map the classes, we need to the classes, first.
< Anand>
And for this, I guess there is no other way but to parse the true labels file once and retrieve all the classes
< Anand>
What do you say?
< marcus_zoq>
Anand: We can iterate through the train set (weka instances) and extract the classes with getClass.
< marcus_zoq>
Anand: Thats what we do in the mlpack nbc code.
< Anand>
Ok. You mean data.instance(i).getClass() ?
< Anand>
Meanwhile, please try to build again and have a look at the other code I mentioned, whenever you get time. :)
< marcus_zoq>
Anand: Did you commit the mlpack linear regression?
< Anand>
Yes, I did. I have some doubts regarding the structure there. We never changed the structure for mlpack.
< marcus_zoq>
Anand: Okay I take a look in a few minutes.
< Anand>
Ok, cool!
witness___ has quit [Quit: Connection closed for inactivity]
Anand has quit [Ping timeout: 246 seconds]
Anand has joined #mlpack
< Anand>
Marcus : Added the class mapping too! Have a look
Anand has quit [Ping timeout: 246 seconds]
zGz|govg has quit [Ping timeout: 245 seconds]
andrewmw94 has joined #mlpack
< naywhayare>
andrewmw94: ok, I did some thinking and I think I have an idea
< naywhayare>
it involves... a bit of refactoring. but it's not impossible
< naywhayare>
(at least, I think)
< naywhayare>
so I think the assumption to make is that when we insert a new point into a tree, its given index should be (new number of points in tree - 1)
< naywhayare>
we can have the Insert() function return a size_t with the index of the point, or, just make it clear what the index will be in the documentation
< naywhayare>
for the R tree, then, each node can hold a std::vector<arma::vec*> or arma::mat or some structure that only holds the points that are held in that particular node
< naywhayare>
and it can also hold a std::vector<size_t>, which holds the indices of each of those points
< naywhayare>
then, we use return type overloading to provide both 'size_t Point(const size_t i)' and 'arma::vec& Point(const size_t i)'
< naywhayare>
lastly, we change all the BaseCase() functions to have the signature BaseCase(const size_t queryIndex, const VecType& queryPoint, const size_t referenceIndex, const VecType& referencePoint)
< naywhayare>
and then modify the traversers for the slightly modified BaseCase() signature
< naywhayare>
what do you think of this idea? if you think it's terrible, that's okay :)
< andrewmw94>
I'm not sure I understand it entirely, but it sounds very similar to what we have now
< andrewmw94>
with the addition of holding both the arma::vec* and the size_t in the leaf nodes
< andrewmw94>
I think we discussed this idea about a week ago
< andrewmw94>
let me see if I can find the logs
< naywhayare>
yeah, this is similar to a discussion we had about a week ago
< naywhayare>
the only real thing I came up last night with is the modification to BaseCase() and how Insert() should work
< naywhayare>
I have to walk to campus... I'll be back in about 15
< andrewmw94>
ok
< naywhayare>
alright, back
< andrewmw94>
ok
< andrewmw94>
So I think I can prove that it's not possible to store points the way you do it in Binary Space Trees and have insertion/deletion
< naywhayare>
yes, you are probably right about that
< naywhayare>
the key about the BinarySpaceTree point storage is just that points in a leaf are contiguous
< naywhayare>
if you were to store each point in an R tree leaf in a single arma::mat object (or something like that), you'd end up with much the same effect
< naywhayare>
unless I have overlooked something
< andrewmw94>
yeah. That's what it does now. You just lose the contiguity
< naywhayare>
well, it has contiguity at least for points in a single node
< andrewmw94>
inserting points can be done at the end. Deleting points would just require swapping the point in question with the last point (in the matrix) and then updating the last point's index in the tree
< naywhayare>
hm, I didn't think about what to do with the indices when a point is deleted
< andrewmw94>
The R tree?
< naywhayare>
yeah, I'm talking about the R tree
< andrewmw94>
currently, at least, it doesn't have any contiguity. And I don't think it's possible to have contiguity in one matrix and have arbitrary insertions/deletions
< andrewmw94>
we can get contiguity for about 15 points at a time by storing the matrices in the leaf nodes
< naywhayare>
yeah, and that's probably good enough contiguity
< andrewmw94>
(conveniently, the user can adjust that at will by changing the leaf size and minimumLeafFill)
< naywhayare>
I think when a user calls Delete() it should remove the requested point, but not update any indices
< naywhayare>
updating the indices could take forever...
< naywhayare>
so I guess when a user calls Insert(), if they have deleted lots of things from the tree, the index of the inserted point may be much larger than the number of points currently in the tree
< andrewmw94>
Well, that depends. Can I tell you my thought
< andrewmw94>
thoughts*
< naywhayare>
yeah, go for it
< andrewmw94>
Ok, so I can think of two options.
< andrewmw94>
1)We can store an arma::mat in each leaf. This gives us contiguity for about 15 points at a time. However, it is unclear how you insert and delete points.
< andrewmw94>
sorry, that should have been, "it is unclear how you index the points."
< andrewmw94>
2) We store all of the points in a central matrix, and each leaf stores indices to the points it holds. We have the cost of dereferecing, and if we can insert and delete points, we lose the ability to have the data matrix continuous
< andrewmw94>
if it is not continuous, we can insert points at the end, and we can delete points by swapping the last point for the deleted point in the centeral matrix, and then finding that point in the tree and updating it (searching for a point you have an exact match for should be very efficient, so I think that's ok)
< andrewmw94>
Can you think of any other ways you could theoretically do it?
< naywhayare>
those are basically the two ways that I see
< naywhayare>
for idea (1), I think indexing the points can be done by having each leaf hold a std::vector<size_t>
< andrewmw94>
Also, as a side note, can I assume that there are never two points at the exact same location?
< naywhayare>
no, that's probably not a good assumption, unfortunately
< naywhayare>
there are some test datasets I use that can have duplicate points (I think the corel dataset does, actually)
< andrewmw94>
hmm. ok. I still think searching for a point which you have an exact match of will be fairly quick.
< andrewmw94>
but I'm not positive
< naywhayare>
the only thing is that if you delete a point and change the index of another, the user has to have some way to know what point's index changed and what it changed to
< naywhayare>
which is why I thought maybe it was better to just remove deleted points from the tree without modifying any indices
< naywhayare>
and then when you insert a new point, the root node has some variable which is tracking the next index it should use
< andrewmw94>
ahh, I didn't think of that. Why wouldn't the index of the next point always be the index of the column in the data matrix?
< andrewmw94>
(last column/column where the point was added)
< naywhayare>
that depends on whether you are using strategy (1) or (2)
< naywhayare>
if you're using strategy (2), then yeah, the index of the last column in the data matrix works; but then, insertion will require allocation of an entire new data matrix and copying of the whole thing
< andrewmw94>
well, I suspect that the best way to do it is to combine them. Strategy 1 has no obvious way to index points and allow arbitrary insertions/deletions
< andrewmw94>
if by index we mean a way to map it back to a central matrix
< andrewmw94>
but now I'm not sure if that's what you mena
< andrewmw94>
mean*
< naywhayare>
by index I don't necessarily mean a way to map it back to a central matrix
< naywhayare>
which I mean is just some unique identifier for each point in the dataset, whether or not it is contiguous
< andrewmw94>
ok. That should be equivalent. The central matrix that is used now isn't contiguous.
< andrewmw94>
How much do you know about how arma::mat works? Do you know if a matrix stores each arma::vec?
< naywhayare>
no; arma::mat stores a contiguous block of memory, not individual arma::vec objects
< naywhayare>
I wrote a quick little program to demonstrate kind of what I have in mind when a user may be adding points to a tree
< andrewmw94>
ahh. I think that would work too. Is there a way to get the memory address of an element of the matrix?
< naywhayare>
I guess you could do mat.memptr() + row + mat.n_rows * col
< naywhayare>
but that's a little hackish-seeming when armadillo is meant to provide a nice interface to matrices so you don't have to do pointer arithmetic...
< andrewmw94>
yeah. My idea is that we could use the memory address of the first row of each point (vector) as an input to a Map to get the index out. But if we have to know the index in the matrix, that doesn't work.
< naywhayare>
wouldn't just holding a std::vector<size_t> in each RTree node work?
< andrewmw94>
I think it should but I thought there was a problem with it when I suggested it last week. I'm looking through the IRC logs.
< andrewmw94>
14:05 < andrewmw94> that should work. We discussed it before and I thought we agreed it would work but it would have bad memory locality
< andrewmw94>
14:05 < naywhayare> it could have bad memory locality, yeah
< andrewmw94>
14:05 < andrewmw94> perhaps storing both the vectors and the size_t?
< andrewmw94>
14:06 < andrewmw94> it's not too much memory and would solve the locality issue
< andrewmw94>
14:06 < naywhayare> that won't help, it has the same memory locality problem
< andrewmw94>
14:06 < naywhayare> because you'll have to store a pointer to the vector
< andrewmw94>
14:06 < andrewmw94> oh, I forgot the base case will use the index rather than the stored point
< andrewmw94>
14:06 < andrewmw94> duh
< andrewmw94>
14:06 < naywhayare> either that or you store the vector itself, and that takes a ton of space
< andrewmw94>
14:07 < naywhayare> the NeighborSearchRules abstractions are modifiable, if you have an idea that I don't, so don't rule that out as a possibility
< andrewmw94>
14:07 < naywhayare> the key being that it can still work with the other types of trees
< andrewmw94>
14:08 < naywhayare> unfortunately making all these things work together can get really difficult :-S
< andrewmw94>
so if I understand your solution correctly, we do what I suggested and change the base case so that it takes both an index and a reference to arma::vec
< andrewmw94>
That seems like it should work. Or am I missing something?
< naywhayare>
yeah, I think that could work
< naywhayare>
it looks like a week ago I said that exactly what I was proposing was a bad idea
< naywhayare>
and now I'm saying exactly the opposite
udit has joined #mlpack
udit has quit [Client Quit]
< andrewmw94>
haha. I'll mention it in my blog post :P
< naywhayare>
and it also seems possible that I told you your idea had problems, and then a week later came back with the exact same idea
udit_s has joined #mlpack
< andrewmw94>
Is it ok if I store the whole vector twice though?
< andrewmw94>
matrix*
< naywhayare>
hm, yeah, that is undesirable
< andrewmw94>
I mean, if I'm using pointers to arma::vec, I might as well just store size_t right?
< naywhayare>
I'm not sure what you mean, could you clarify?
< andrewmw94>
so I'm going to store a vector<size_t> to map to the original indices
< andrewmw94>
and then we also want to store a vector<arma::vec> or a vector<arma::vec*> to handle the points
< andrewmw94>
but I don't think the vector<arma::vec*> is much better than just using the vector<size_t> to find the points
< naywhayare>
I think you could just store an arma::mat instead of a vector<arma::vec> or vector<arma::vec*>, and resize it as necessary
< naywhayare>
I was also struck with another idea
< naywhayare>
suppose we created another matrix class and shoved it into the Armadillo namespace 'cause why not... we could call it 'arma::split_mat' or something, and it wouldn't hold contiguous memory, but instead a vector of vectors
< naywhayare>
this means insert_cols() is O(1) (amortized) and so is shed_cols()
< naywhayare>
this also means we get to have just one matrix object, and we don't have to modify our abstractions
< naywhayare>
it would need to imitate the Armadillo API, but that's not too hard to do, I don't think; we just need col(), row(), element-wise operators, and other things like that
< naywhayare>
this is kind of a half-baked idea right now. there are probably lots of things I am overlooking
< andrewmw94>
So this would replace the centeral matrix or both of them?
< naywhayare>
what do you think? hopefully I communicated the idea at least semi-coherently
< naywhayare>
this would replace the central matrix
< naywhayare>
the whole tree would only need to hold one of these objects
< naywhayare>
and indexing is no longer a problem, since we have just one matrix
< andrewmw94>
and the main advantage is that adding a column to an arma::mat is really slow?
< naywhayare>
yeah; insert_cols() for arma::mat is devastating if the matrix is large
< andrewmw94>
ok. Then I think I understand and I think it's a good idea.
< naywhayare>
so, in this case, write your tree to hold one central matrix, and insert points into it using insert_cols()
< andrewmw94>
ok sounds good
< naywhayare>
if the user is planning to insert lots of points, they should use MatType = arma::split_mat (or whatever we call it)
< naywhayare>
otherwise they can use arma::mat and get memory locality bonuses
< naywhayare>
we (or I) can worry about implementing arma::split_mat later
< andrewmw94>
alright
< naywhayare>
if you wanted to implement it, feel free, but maybe it's a little outside the scope of your project, so that one's up to you
< naywhayare>
almost certainly Conrad Sanderson (the Armadillo guy) will never accept the matrix type as a patch
< naywhayare>
but that's why we have src/mlpack/core/arma_extend/ :)
< andrewmw94>
haha. We'll have to put the lunar cycle stuff in there sometime too.
< naywhayare>
n
< naywhayare>
oops, IRC is not gdb
< andrewmw94>
I'm rather surprised that CMake didn't think it was an issue. I guess it is pretty easy to work around, but you would think using the whole file name would be an obvious thing to do.
< naywhayare>
I imagine whoever wrote the code just needed to get it done fast and thought "hopefully people don't have duplicate filenames"
< naywhayare>
I don't think it would be too hard to dig into the CMake codebase and figure out a solution, but that is absolutely not on my list of "things I am enthusiastic about doing"
< naywhayare>
we have a workaround, at least, so that is good
< naywhayare>
I also get the impression that the CMake developer team is quite a bit overworked, given by the number of backlogged bugs...
< andrewmw94>
yeah. It's a popular project.
Anand has joined #mlpack
< Anand>
Marcus, I fixed the weka logistic regression code as per your suggestion. However, I don't understand what you wanted to convey regarding the mlpack linear regression code.
< udit_s>
naywhayare: Have you had time to look at the mail I sent you and Marcus ?
< udit_s>
naywhayare: Also, let's talk about the Perceptron Code Review when you're free today ?
< naywhayare>
udit_s: I saw the mail you sent. I hope you are feeling better! I've been looking through the perceptron code, so I'll let you know when I am done
< naywhayare>
currently I am writing one more test for the decision stump
< udit_s>
naywhayare: Thanks. Almost better. About the decision stump, a test for what ? Anything I can help with ? Were you satisfied with the decision stump code changes I had done ?
Anand has quit [Ping timeout: 246 seconds]
< naywhayare>
yeah, the changes were fine. I just wanted to write one more that I was thinking of, it's no problem
< naywhayare>
udit_s: can you tell me how you generated the dataset for the CorrectAttributeChosen test?
Anand has joined #mlpack
< udit_s>
naywhayare: I picked up that data set from the example where we were discussing it from.
< naywhayare>
the document there says that Attribute1 is the best attribute to split on, but in your test you are checking that Attribute2 is the one that is split on
< naywhayare>
I think this maybe was a simple zero-indexing off by one error :)
oldbeardo has joined #mlpack
< oldbeardo>
naywhayare: hey, needed some help
< naywhayare>
oldbeardo: I'm about to step out for lunch, but maybe it's quick?
< oldbeardo>
yeah, I wrote a template specialization for SGD<>::Optimize
< oldbeardo>
I get this error
< oldbeardo>
error: no matching function for call to ‘mlpack::svd::RegularizedSVDFunction::Gradient(arma::mat&, size_t&, arma::mat&)
< oldbeardo>
I have commented the Gradient() function out
< oldbeardo>
but it is not using the specialization for some reason
< naywhayare>
do you mean that you have commented the call to the Gradient() function out inside of your SGD<>::Optimize specialization?
< naywhayare>
can you show me the code you used to declare the specialization?
< naywhayare>
just as a test, can you try putting it into sgd_impl.hpp? make it an inline function so that the compiler doesn't complain about it being defined multiple times
< naywhayare>
you might have to add an #include <../regularized_svd_function.hpp> at the top of that file just to make it work
< udit_s>
naywhayare: have a look at how they have arrived at that. It isn't a one-off error.
< oldbeardo>
naywhayare: /usr/local/include/mlpack/core/optimizers/sgd/sgd_impl.hpp: In member function ‘double mlpack::optimization::SGD<DecomposableFunctionType>::Optimize(arma::mat&) [with DecomposableFunctionType = mlpack::svd::RegularizedSVDFunction, arma::mat = arma::Mat<double>]’
< udit_s>
naywhayare: The way they split the second column is different than the way we're doing it.
< oldbeardo>
naywhayare: it is using the Optimize already present in the library
< udit_s>
Needless to say, I checked the maths behind our method for inpBucketSize = 3,4,5 and it was the second column which was the best split.
< naywhayare>
udit_s: your math must be backwards; it is picking the dimension with worst split, not with the best split
< naywhayare>
take a look at this test I have written:
< naywhayare>
actually, hang on, this will be easier if I check it in
< naywhayare>
give me just a moment
< naywhayare>
oldbeardo: I understand it is using the wrong one, but put your specialization into sgd_impl.hpp to see if it picks up the correct one
< oldbeardo>
naywhayare: okay, I'll try that
< udit_s>
naywhayare: Okay. The calculateEntropy returns the least entropy (most negative) according to our method for the second attribute. somewhere around -0.089...
< udit_s>
what is the test that you have written ?
< naywhayare>
udit_s: okay, I checked in a new test in r16793; could you take a look at it?
< naywhayare>
I have made a 4-dimensional dataset
< udit_s>
yeah, hang on.
< naywhayare>
the observations in each dimension come from two gaussians
< naywhayare>
and in some dimensions the gaussians are overlapping, and in others they aren't
< naywhayare>
so the decision stump should choose to split on the dimension with gaussians that are furthest apart
< naywhayare>
but what I found was that it always split on the dimension with the closest gaussians... the worst dimension
< naywhayare>
the first couple diffs in decision_stump_impl.hpp
< naywhayare>
I think your calculation might actually be returning information gain, which is negative entropy. I'm not certain, which is why I'm asking you to take a look
< naywhayare>
but regardless of whether or not my change is correct, I think something was still wrong either way
< naywhayare>
I need to get lunch now... I am starving
< naywhayare>
oldbeardo: hopefully that idea will work. I will try and think about more things that the problem could be, but my best guess is that the compiler is not finding your specialization where it needs to find it
< naywhayare>
I'll have to think about where it should actually go, or why this is even a problem. C++ can be very weird sometimes...
< oldbeardo>
naywhayare: okay, but the same abstraction works for LRSDP, that's why I asked you
< naywhayare>
yeah, which is why I thought it might have to do with the file that it's in
< naywhayare>
do you have regularized_svd_function.cpp in the CMakeLists.txt?
< jenkins-mlpack>
Starting build #2003 for job mlpack - svn checkin test (previous build: SUCCESS)
< oldbeardo>
I'm not working in the sources code, I have a separate folder for Reg SVD
< oldbeardo>
*source code
< naywhayare>
oldbeardo: can you tell me how you are compiling your code and which command is producing the error?
< udit_s>
naywhayare: If you look at the diff, you've changed the initial value of bestEntropy to -DBL_MAX and then are looking for the maximum entropy by changing the 'if' condition statement from '<' to '>'.
< udit_s>
But if you look at the value of entropy you are returning, it is actually sum(x*p(x)) and not -sum(x*p(x))
< udit_s>
Hence you're getting the worst split and not the bestsplit.
< udit_s>
*it should be sum(p(x)*log(p(x)))
< udit_s>
I feel that is what you might have overlooked.
< naywhayare>
udit_s: sorry, maybe I should have clarified more
< naywhayare>
if you revert the changes I made to the selection algorithm
< naywhayare>
i.e. change > back to < and change -DBL_MAX to DBL_MAX
< naywhayare>
then the stump selects the worst splitting attribute on the test I wrote
< naywhayare>
but the changes I made cause the test to pass
< naywhayare>
so even if what I have changed it to is wrong (which it may very well be), I think there was a problem before I changed things too
< udit_s>
hmm... I'm trying with different bucket sizes, but I don't think that makes a difference.
< oldbeardo>
naywhayare: I compile the code using 'g++ test.cpp regularized_svd_function.cpp -lmlpack -larmadillo'
Anand has quit [Ping timeout: 246 seconds]
< naywhayare>
oldbeardo: did putting the specialization into sgd_impl.hpp and marking it inline help?
< naywhayare>
udit_s: I tried different bucket sizes but that didn't seem to make much of a difference
< oldbeardo>
naywhayare: didn't try it yet, had gone for a bath
< naywhayare>
ah, ok
< oldbeardo>
naywhayare: I'm going to include everything into the source and then try, it may be easier that way
< naywhayare>
udit_s: I noticed that when I printed the entropy, it was ordered backwards -- the best dimension had smallest (most negative) entropy, the next best dimension had next smallest, etc., and the worst dimension had entropy closest to 0
< udit_s>
naywhayare: hang on, let me check something.
oldbeardo has quit [Ping timeout: 246 seconds]
sumedhghaisas has joined #mlpack
< sumedhghaisas>
naywhayare: free for some time??
oldbeardo has joined #mlpack
< oldbeardo>
naywhayare: I built the source with that part of the code commented out
< jenkins-mlpack>
* Ryan Curtin: Another test to make sure the correct splitting attribute is used.
< jenkins-mlpack>
* Ryan Curtin: Fix some formatting, fix backwards entropy splitting, add getters/setters, and
< jenkins-mlpack>
comment a little bit about the internal structure of the class.
< naywhayare>
oldbeardo: but you are trying to use SGD; I don't understand what you mean
< naywhayare>
sumedhghaisas: I am here, what do you need?
< oldbeardo>
naywhayare: now when I include the code, I get the following error
< naywhayare>
oh, ok, you weren't done typing yet, sorry about that
< oldbeardo>
for some reason my messages aren't going though
< oldbeardo>
*through
< oldbeardo>
expected initializer before ‘<’ token
< naywhayare>
you probably need to include regularized_svd_function.hpp, or make a forward declaration of the RegularizedSVDFunction class
< sumedhghaisas>
with the use of reverseStepTolerance... sometimes local optimum is not returned... like RMSE reaches 0.91... but then increases to 0.98 ... cause less than 3 jumps are allowed... so it takes 2 jumps... then again decreases... again 2 jumps... can we somehow store the matrices associated with least RMSE...
< naywhayare>
sumedhghaisas: sometimes that is necessary, to keep the local optimum. I guess there is no other choice if reverseStepTolerance > 1
< naywhayare>
oldbeardo: unfortunately that's not enough information for me to be able to figure it out. can you send me a copy of the file you are using or something like that?
< oldbeardo>
naywhayare: okay, I'll send you all the files
< sumedhghaisas>
oldbeardo: sometimes I get that error cause of interlinked includes...
< sumedhghaisas>
naywhayare: so is it okay to not return that minimum value??
< naywhayare>
sumedhghaisas: what do you mean? it sounds like the best thing to do is store the best matrix, and return that when the algorithm has converged
< naywhayare>
sorry, the best W and H, I mean
< sumedhghaisas>
ohh okay... I thought you were against storing those matrices... sorry...
< oldbeardo>
naywhayare: I sent you the files
< naywhayare>
well, in this case we don't have a choice
< oldbeardo>
sumedhghaisas: I will check that out
< naywhayare>
but I would like to do this only in the ValidationRMSETermination policy
< naywhayare>
so the bestW and bestH matrices should be stored there, not in AMF
< naywhayare>
because not all termination policies require holding the bestW or bestH matrices
< sumedhghaisas>
oldbeaddo: I am not sure... But when SGD is not defined C++ will give such an error...
< sumedhghaisas>
naywhayare: ummm... both validation RMSE and simple tolerance...
< naywhayare>
oldbeardo: I need the test file too
< sumedhghaisas>
naywhayare: Okay I will figure out some way...
< naywhayare>
sumedhghaisas: you're right, we will need to modify both of those
< naywhayare>
sumedhghaisas: thank you; I don't think it should be too hard to do that
< oldbeardo>
naywhayare: got it?
< naywhayare>
oldbeardo: yes, thank you
< udit_s>
naywhayare: I wrote one more test. And a little change/improvement. I think I have it, but I'm just going to be sure, and get back to you in a moment.
< naywhayare>
udit_s: thank you for looking into that. let me know when you check it in
< naywhayare>
oldbeardo: I added #include <mlpack/core/optimizers/sgd/sgd.hpp> to the top of regularized_svd_function.cpp and it seems to be getting further
< oldbeardo>
naywhayare: ah, I had commented that out
< naywhayare>
but the errors are different now, and are not the same ones you were getting with an undefined reference to RegularizedSVDFunction::Gradient
< oldbeardo>
yeah, I got those, have no idea what they mean
< naywhayare>
the issue is that you are trying to access local variables of RegularizedSVDFunction, but you are inside of the class SGD
< naywhayare>
so you may need to refactor a little bit so that you can access the lambda parameter in RegularizedSVDFunction
< naywhayare>
maybe provide RegularizedSVDFunction::Lambda()
< oldbeardo>
I'm also getting errors like these -> /home/mlpack/trunk/src/mlpack/../mlpack/core/optimizers/sgd/sgd.hpp:85:13: error: ‘size_t’ does not name a type
< naywhayare>
now there is an interesting one
< naywhayare>
what is the first error in that chain of errors? or is that the only one?
< oldbeardo>
this is the first one
< naywhayare>
oh... <mlpack/core.hpp> wasn't included in sgd.hpp
< naywhayare>
just add that to your list of includes, before including sgd.hpp
< naywhayare>
I'm going to modify sgd.hpp to include that
< naywhayare>
(committed in r16794)
< oldbeardo>
right, now I get the manageable errors, I will get back in some time
< andrewmw94>
naywhayare: I have another question that can hopefully resolved with a C++ feature. The RectangleTree class takes several template parameters: SplitType and DescentType are the ones I need to change for the R* trees and X trees. I want to pass a RectangleTree<> class to methods in each of these (SplitType and DescentType) classes. However, as far as I know, that means these classes need to have templates for each other, which put
< andrewmw94>
How do you deal with stuff like that?
< naywhayare>
template<typename TreeType> class SplitType ?
< naywhayare>
the second is better if you only need the tree for the constructor
< andrewmw94>
But if I do that, when I declare an R Tree, wouldn't I need to do: RectangleTree<SplitType<RectangleTree<SplitType...
< jenkins-mlpack>
Starting build #2004 for job mlpack - svn checkin test (previous build: SUCCESS)
< naywhayare>
in the second case, no, the RectangleTree<...> gets inferred by the compiler
< naywhayare>
also, I think we have some confusion, maybe
< naywhayare>
in the changeset you just committed, it seems like the R tree nodes now each hold their own dataset
< andrewmw94>
they each hold a reference to the central dataset and a local dataset
< naywhayare>
ok; can you clarify what each is for?
< naywhayare>
I know we've gone back and forth on this several times, so confusion is likely (especially since I'm involved and I can never keep anything straight)
< andrewmw94>
The central dataset is the original matrix, so it is used to keep track of the indices
< andrewmw94>
The local dataset holds say 8 to 20 points contiguously
< naywhayare>
oh, ok, I see
< andrewmw94>
It should be possible to remove the central dataset entirely I think, but I kind of want to keep it because I have the start of an idea of how to make it contiguous
< naywhayare>
so when tree.Insert(vec&) is called, then insert_cols() is called on the main dataset, and the local dataset has the vector appended?
< andrewmw94>
(continuing my previous comment) if the user specifies that they don't want to have a dynamic tree
< naywhayare>
I guess I was under the impression that the best thing we can do is store one central matrix, and later write an arma::split_mat class that holds a vector of arma::vec
< andrewmw94>
(in answer to yours) currently, you would add the datapoint to the matrix yourself, and then call tree.Insert(index) to insert it
< naywhayare>
okay
< andrewmw94>
Yeah, I think the arma::split_mat idea is good. But what if we had the option to build a static tree? Then the central dataset can be remapped in a reasonable time at the end of tree construction I think.
< naywhayare>
true
< naywhayare>
maybe this is something the user should do after construction? some function like RectangleTree::LinearizePoints(std::vector<size_t>& oldFromNew /* the mappings */)
< naywhayare>
and it would only really make sense for the user to call that if they were using a static tree with arma::mat not arma::split_mat
< naywhayare>
or we could even use templates to detect when arma::mat is being used and then assume that the user isn't interested in inserting/deleting points, and linearize it
< andrewmw94>
Yeah. That's the idea. I'm not sure how to handle the point remapping in treetraits then
< andrewmw94>
since it is sometimes true sometimes false
< andrewmw94>
I guess you could just say it modifies the dataset and have it be slightly slower if it really didn't
< naywhayare>
hm, we'll have to make a decision... if arma::mat always implies remapping of points, then we just partially specialize TreeTraits to the case where RectangleTree is using arma::mat
< naywhayare>
I think that's reasonable to say, because if you're using arma::mat it doesn't really make any sense to insert and delete since it will take forever to call insert_cols() on the big dataset
< naywhayare>
(in fact we could even use templates to disable Insert() and Delete() when arma::mat is being used, or issue some huge warning or something)
< andrewmw94>
yeah, I think this should work.
< naywhayare>
I think we can worry about the mapping later, though; that should be straightforward to work in according to the ideas you've proposed after we have the whole thing working and cleaned up
< naywhayare>
also, I got to the bottom of the kd-tree issue
< naywhayare>
I traced out the entire path down to the query node and reference node, then kept track of all the relevant pruning decisions and Score() calls (and some other ones too...)
< naywhayare>
I ended up collecting way more information than necessary, but I don't ever mind because it often results in interesting insights
< naywhayare>
for instance, note that for whatever reason, on this dataset, the children at the top of the tree are very unbalanced
< naywhayare>
I've written down the first point in the node, and the count of points in the node
< naywhayare>
so the first left child has 36858 points... and the first right child has about 900
< naywhayare>
the next right child has about 200
< naywhayare>
and there continues like 30 or 40 levels of the tree where the right child has very few points in it
< andrewmw94>
interesting.
< andrewmw94>
I guess there's always the possibility of getting a skewed tree. But that's a really big difference
< naywhayare>
using median split instead of mean split would produce a perfectly balanced tree, but I wonder how it would perform in practice in comparison to mean split
< naywhayare>
I think the corel dataset is special because it appears to be a sparse dataset
< naywhayare>
so points only have nonzero values in a handful of the 32 dimensions
< naywhayare>
my guess is that the mean in a dimension ends up being a very small number, with the majority of points having values of 0
< naywhayare>
so there are tons of points to the left of the mean, but only a handful of outliers to the right
< andrewmw94>
ahh.
< andrewmw94>
Perhaps it would work better to sample n points randomly and use the median of those?
< andrewmw94>
but that still doesn't deal with seleting a dimension
< andrewmw94>
I guess you could choose the dimension that has the most even split with that sample
< naywhayare>
the dimension is selected as the dimension with maximum variance; but the actual splitting is done in the MeanSplit class
< naywhayare>
and it's templatized, so it'd be really easy to figure something else out
< naywhayare>
I suppose the dimension selection could be templatized too...
< naywhayare>
either way, the actual bug doesn't have anything to do with the weird splits, it's an invalid assumption in neighbor_search_rules_impl.hpp
< naywhayare>
I haven't yet figured out the best way to fix it
< andrewmw94>
I would actually move it to the SplitType code. (the dimension selection)
< naywhayare>
anyway, back to the original topic, any confusion on how we should proceed with the RectangleTree?
< naywhayare>
and yeah, maybe the dimension selection should be in the SplitType code
< andrewmw94>
I'm still not sure on how to do the templatized function for TreeDescent.
< andrewmw94>
It isn't a constructor, so I don't think the second option would work.
< naywhayare>
hm, let me think for a minute
< naywhayare>
could you just templatize all the functions? template<typename TreeType> RTreeSplit::SplitLeafNode(TreeType* node)
< naywhayare>
and assume that TreeType is RectangleTree<...>
< andrewmw94>
Basically, I want to pass a node of the tree to the function DescentType::EvalNode() so that it can look at the children. The problem is that if I templatize DescentType, then I need to include SplitType in the template. And SplitType needs to have DescentType included in it's template. I can do what you suggested there (I think, I'm still not fluent with templates) but when I declare the tree won't I have issues?
< naywhayare>
specifically, line 12 -- you don't need to explicitly specify the type to the template function DoSomeThings(), because the compiler can already deduce it from the argument passed to DoSomeThings()
< naywhayare>
so this is basically the same idea... let the compiler deduce the tree type by what's passed to SplitNode::SplitLeafNode() and DescentType::EvalNode()
< andrewmw94>
alright sounds like it should work.
< naywhayare>
it should, and it will make all your function signatures a lot cleaner too :)
< naywhayare>
let me know if you have any problems with that approach
< andrewmw94>
ok thanks.
< naywhayare>
udit_s: you said you had added a new test to the decision stump?
< udit_s>
yeah, I just pushed the code, r16796
< udit_s>
so basically what we were calculating was gain. So I changed the constructor to calculate the least negative gain (take care of the minus sign), hence the gain < bestgain condition.
< naywhayare>
ah, ok! I had thought maybe the issue was something like that, but I wasn't exactly sure
< udit_s>
I was going through the Classify function.
< udit_s>
and I couldn't understand the line 93.
< udit_s>
why are you taking it the row as the splitAttribute row from test?
< naywhayare>
that's the value that we need to place in one of the bins
< naywhayare>
for test point i, we want the attribute (or dimension) splitAttribute, which is the attribute/dimension that the stump is built on
< naywhayare>
or did I misinterpret what that should be?
< naywhayare>
(I made a very simple style fix in r16797, by the way)
< udit_s>
Yeah, I had left it there for a your clarity. Btw, what is the analogue of git diff in git svn ?
< naywhayare>
svn diff :)
< naywhayare>
you can specify revision numbers too... svn diff -rXXXXX
< naywhayare>
that will give you the changes between revision XXXXX and trunk
< naywhayare>
or you can get the changes between revisions X and Y... svn diff -rX:Y
< naywhayare>
there may be a lot of lines that look identical in diffs... those are usually my editor stripping whitespace from the end of lines
< udit_s>
oh. okay.
< sumedhghaisas>
naywhayare: okay making a copy at each iteration is very costly...
< sumedhghaisas>
so I implemented a different solution...
< sumedhghaisas>
everytime the index starts to increase I make a copy in termination policy...
< sumedhghaisas>
and keep a note that copy is made...
< naywhayare>
udit_s: my last comment (at least for now) about the decision stump is that I think the if statement at line 47 (if (isDistinct<double>(data.row(i)))) is a situation that will happen extremely infrequently with continuous data
< naywhayare>
so I think that's going to slow things down a lot; it's an O(N) loop each time it's called
< naywhayare>
sumedhghaisas: ok, that sounds good
< udit_s>
( :D - "at least for now" )
< udit_s>
okay, but it is an edge case, is it not ?
< naywhayare>
what do you mean? will SetupSplitAttribute() have problems if all the points are the same?
< sumedhghaisas>
naywhayare: then some case handling here and there... and we can save the minimum in minimum number of copies... :)
< sumedhghaisas>
group lens taking only 2 copies...
< naywhayare>
sumedhghaisas: great, good to hear that
< sumedhghaisas>
but only one thing... I am storing W and H after increase is detected ....
< udit_s>
naywhayare: Ah- that, no it won't.
< sumedhghaisas>
so the minima is not stored...
< sumedhghaisas>
but jumps are not that high... so its almost equivalent to the minima...
< naywhayare>
udit_s: ok; so I think we can do one of two things here: we can drop the call to isDistinct<> since it happens so infrequently, or, we can rewrite isDistinct to take a lot less time, something like this:
< naywhayare>
double val = featureRow[0]; for (size_t i = 1; i < featureRow.n_elem; ++i) { if (featureRow[i] != val) return false; } return true;
< naywhayare>
(sorry that it's all on one line, I just did that to keep it compact for IRC)
< naywhayare>
the cost should be two comparisons for most datasets (it gets to the i = 1 iteration of the for loop, and featureRow[0] != featureRow[1] so it terminates
< naywhayare>
)
< naywhayare>
sumedhghaisas: that's okay, I think that's still better than copying it every single time
< udit_s>
yeah, we could do that. This would be definitely better (faster) than what we have now.
< naywhayare>
ok, would you like to make that change?
< naywhayare>
also we should change it from isDistinct to IsDistinct so it's in line with the rest of mlpack function names (I know, it is a very trivial change... :))
< udit_s>
Yeah, I'll do that.
< naywhayare>
ok, thank you
< udit_s>
Anything else ? Or will look at Perceptron after this ?
< oldbeardo>
naywhayare: when I include the specialization in sgd_impl.hpp I get this error : /home/mlpack/trunk/src/mlpack/../mlpack/core/optimizers/sgd/sgd_impl.hpp:117:12: error: ‘svd’ is not a member of ‘mlpack’
< naywhayare>
udit_s: yeah, I am looking at the perceptron now. usually I think it's easier to do a code review in email, so I am writing one up and I'll send it to you (and CC marcus) soon
< naywhayare>
oldbeardo: I thought you had it working with the specialization in regularized_svd_function.cpp ?
< naywhayare>
I got it to compile and run after fixing the issues where SGD was trying to access internal variables of RegularizedSVDFunction
< naywhayare>
(or, more specifically, I just removed those bits to make sure I could get it to compile...)
< oldbeardo>
well, the source code builds, but the test function does not
< naywhayare>
really? I had it compiling fine... let me send you what I have
< oldbeardo>
it gives the same Gradient() missing error
< naywhayare>
ok, sent... be careful, the overload is no longer correct because I simply removed the lambda parameter from the two lines it appeared in
< naywhayare>
and it compiled fine; you'll have to change the name of the directories given to -I and -L, but I think that should work for you
< oldbeardo>
naywhayare: the code you sent me has the optimizer as L_BFGS, the errors arise when SGD is used as AugLagrangian is used for LRSDP
< naywhayare>
oh. I see...
< naywhayare>
hang on...
< oldbeardo>
I added this in the initialization list in regularized_svd_impl.hpp -> rSVDFunc(data, rank, lambda), optimizer(rSVDFunc, 0.01, iterations * data.n_cols),
< oldbeardo>
and declared the same in the header file
< naywhayare>
okay, now I am reproducing the same problem you were
< oldbeardo>
yeah, this doesn't make sense to me, the same thing works for LRSDP
< naywhayare>
okay, I made it compile, by providing a forward declaration of the specialization in regularized_svd_function.hpp
< naywhayare>
I also had to include sgd.hpp in that file
< oldbeardo>
okay, let me try it out
< oldbeardo>
how do you declare a specialization?
< naywhayare>
the same way you declare a function --
< jenkins-mlpack>
* andrewmw94: R tree now has dataset and indices
< jenkins-mlpack>
* Ryan Curtin: Include mlpack/core.hpp.
< jenkins-mlpack>
Starting build #2005 for job mlpack - svn checkin test (previous build: SUCCESS)
< naywhayare>
did you use the version I sent you? I knowingly broke that one just to make it compile
< naywhayare>
the specialization is way wrong because it doesn't use the lambda parameter of the RegularizedSVDFunction at all
< oldbeardo>
yes, but the error is for index out of bounds
< oldbeardo>
not a performance issue
< naywhayare>
was it preceded by this message? [WARN ] Cannot open file 'GroupLens100k.csv'; load failed.
< naywhayare>
I didn't put that file in the tarball I sent you
< oldbeardo>
heh, no, I put the file in the folder
< naywhayare>
I would start over with the implementation you sent me earlier today... any changes I made were only to make it compile, and like I said, I probably broke the whole thing
< oldbeardo>
yeah, I will try it out with the source version
< udit_s>
naywhayare: I've committed changes in r16801. So that's Decision Stump, for now.
< naywhayare>
thanks! I think I am just about done with the code review
< naywhayare>
the perceptron is way simpler than the decision stump
< udit_s>
Yeah. It is. I don't I really grasped how complex the decision stump could be when I proposed a time line for it. :)
< udit_s>
*think
< udit_s>
And, now I'm off to sleep. I'll look at the mail when I wake up tomorrow. Thanks for everything today.
udit_s has quit [Quit: Leaving]
< oldbeardo>
naywhayare: the mistake was in the new Optimize() function
< oldbeardo>
rating = data(2, i)
< oldbeardo>
the 'i' should have been 'currentFunction'
< naywhayare>
ah, glad that you found the issue
< oldbeardo>
naywhayare: the implementation works, though it takes as much time as the L_BFGS implementation
< jenkins-mlpack>
Starting build #2006 for job mlpack - svn checkin test (previous build: SUCCESS)
< sumedhghaisas>
naywhayare: okay that problem is fixed... committing that code...
< sumedhghaisas>
how to test SVDIncremental learning?? regularization is same as SVDBatch ... so need to test that...
< naywhayare>
even if the regularization is the same, there's no guarantee (in the future) that it'll be implemented exactly the same as SVDBatch, so you should write a test for it too
< naywhayare>
you can reuse many of the test cases from SVDBatch, but you will probably have to change the RMSE that it converges too because it's a slightly different algorithm
< sumedhghaisas>
okay... so 2 tests...
< sumedhghaisas>
negative element test and regularization test...
< naywhayare>
we should also have a test to make sure it converges for simple datasets
< naywhayare>
you could randomly generate simple datasets or load them
< naywhayare>
there should also be a test like that for SVDBatch
< sumedhghaisas>
generating random matrix would be better for this test...
< naywhayare>
if you want it to be sparse you can use sp_mat::sprandu() or sprandn()
< sumedhghaisas>
i won't make it too big...
< sumedhghaisas>
yes... I was going to ask that... thanks...
< sumedhghaisas>
very boring match.... :(
< naywhayare>
less exciting than yesterday, huh? :)
sumedhghaisas has quit [Ping timeout: 240 seconds]