naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
govg has quit [Ping timeout: 248 seconds]
udit has joined #mlpack
< jenkins-mlpack> Starting build #1974 for job mlpack - svn checkin test (previous build: SUCCESS)
sumedh_ has joined #mlpack
< jenkins-mlpack> Project mlpack - svn checkin test build #1974: SUCCESS in 1 hr 14 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1974/
< jenkins-mlpack> saxena.udit: Fixed subvec calls.
< jenkins-mlpack> Project mlpack - nightly matrix build build #502: FAILURE in 4 hr 31 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/502/
< jenkins-mlpack> * saxena.udit: Fixed subvec calls.
< jenkins-mlpack> * saxena.udit: Rewinding the code review
< jenkins-mlpack> * siddharth.950: Combined CosineNode and CosineTree classes.
< jenkins-mlpack> * andrewmw94: memory leak fix
udit has quit [Quit: leaving]
andrewmw94 has joined #mlpack
< andrewmw94> naywhayare: let me know when you are free
< naywhayare> andrewmw94: on my way to lab now, I need about 15 minutes
< andrewmw94> ok
< naywhayare> alright, I am here now
< andrewmw94> right. So I have questions about the API I should have for the R tree and about the neighbor_search_rules
< andrewmw94> which do you want first?
< naywhayare> let's start with the R tree
< andrewmw94> ok. So as I think I mentioned, I have the normal destructor and a "soft delete" that I use for when I copy nodes. Is it ok if I assume the user only uses the normal destructor and only on the root node?
< naywhayare> that's a valid assumption; for the most part, the user won't explicitly call the destructor but it'll be implicitly called when the tree goes out of scope
< andrewmw94> ok, that's good
< naywhayare> I remember talking about your soft destructor... why not just clear the array of children and call the regular destructor?
< andrewmw94> that's basically what it does
< naywhayare> ah, okay
< andrewmw94> The next question is about copy constructors. Is there a standard way these are done? Will the user expect certain behavior from these?
< andrewmw94> because I use this to get an exact copy when I duplicate nodes, which may be bad
< andrewmw94> I can set everything by hand, but if the user wanted to try this it could cause issues
< naywhayare> the copy constructor should have signature RectangleTree(const RectangleTree& other)
< naywhayare> just like BinarySpaceTree
< naywhayare> and it should produce an exact copy of the identical tree
< andrewmw94> including pointers?
< naywhayare> but the only thing to be careful about is that all pointers point to new objects, and not to the same objects that the old tree points to
< naywhayare> yeah, pointers are the one exception :)
< naywhayare> so you will have to create all new children, etc., but the non-pointer members you can just copy
< andrewmw94> ahh, that's the problem. I need to get the pointers so that I can have the root stay at the same address
< naywhayare> why does the root need to stay at the same address?
< andrewmw94> so that the address returned by the constructor is the root
< naywhayare> ...but why is the regular constructor employing the copy constructor?
< andrewmw94> it isn't. But when I split the root node I make a copy.
< naywhayare> ah, okay
< andrewmw94> we discussed this a while back. The alternative was to pass the address of the new root back to the user, but I think this is better
< naywhayare> I think we talked about this some time back, and for the root node, there is no perfect solution
< naywhayare> the user should expect to type 'RectangleTree r(dataset)' and then suddenly a tree is built and they don't have to think about it
< andrewmw94> yeah. But is there a way for me to have a method say getCompleteCopy() that will copy all data exactly?
< naywhayare> but I think we are on the same page about that
< naywhayare> well, you can copy the contents of the object exactly
< naywhayare> however, you still end up with a different object that's located at a different place in memory
< naywhayare> so I don't think this solves your problem (unless I've misunderstood)
< andrewmw94> yeah, that's fine. I just need to have all of the same pointers
< naywhayare> so, the default copy constructor will do that (a shallow copy, where it copies each object exactly) but that's very unsafe to expose to the user
< naywhayare> because the user will try to make a copy of a tree, delete the original, and try to use the copy and suddenly all the pointers reference dead memory
< andrewmw94> yeah.
< andrewmw94> so I want to have a copy constructor that get's new memory, but still have a way to copy the root node and have all the old pointers.
< naywhayare> is this a copy constructor that the user is going to use, or a copy constructor that is for splitting the root node?
< andrewmw94> I should be able to set all of the pointers by hand, but then I need to add a bunch of methods so I can get a pointer to the dataset rather than a reference etc.
< naywhayare> (that is, a copy constructor that is used in the process of splitting the root node)
< andrewmw94> I think I agree with you on what the user needs and I need to change the tree for that.
< andrewmw94> I'm trying to figure out what I can do for splitting the root now
< andrewmw94> it isn't possible to have multiple copy constructors (unless I missed something)
< naywhayare> so the best solution that I can think of (which is not necessarily the best solution) is to create two new children, split everything from the root node into those two, and then set the children of the root to be those two new nodes
< naywhayare> and yeah, you can't have multiple copy constructors that use the same semantics as the normal copy constructor
< andrewmw94> and is there a way to have a good copy constructor and also have a method that makes an exact copy?
< naywhayare> I mean, you could do some hack constructor like RectangleTree(const RectangleTree& other, const bool exactCopy = 0)
< naywhayare> but I'm still not certain this is the best way to approach the problem
< andrewmw94> Let's see if I understand the two new children thing. So I would create two empty nodes, and then have a method that will assign the information from the root node to these two new ones
< andrewmw94> ok. It seems like that should be equivalent to the current functionality.
< naywhayare> something kinda like RectangleTree leftChild, rightChild; leftChild.children().add(children.subvec(all left children)); rightChild.children().add(children.subvec(all right children)); children.clear(); children.add(leftChild, rightChild);
< naywhayare> that's not actually valid C++ but it's semi-psuedocode that maybe gets the idea across
< naywhayare> maybe that is helpful?
< andrewmw94> yeah, I think that should work
< naywhayare> (back in a moment)
< andrewmw94> while I'm at it, I should probably change the code so that a split node only makes one new node rather than two and then deleting itself
< andrewmw94> ok, that's all my questions about that. Ready to move on to the nearest_neighbor_rule or did you have something else to add?
< andrewmw94> ok
< naywhayare> (ok, back)
< naywhayare> yeah, nothing else to add; I will look through the code when you have it working and tested (so I can try ideas and still check that it all works)
< andrewmw94> ok (construction is working and tested. Traversal needs this question.)
< andrewmw94> so, I'm looking through the neighbor_search_rule, and I'm concerned about the baseCase()
< andrewmw94> when it takes the indices of points, will that still work when I store all of the points in the leaf nodes?
< naywhayare> no; it will require a little bit of refactoring of BinarySpaceTree, CoverTree, and maybe RectangleTree
< naywhayare> right now, the BaseCase(queryIndex, referenceIndex) function basically calls metric.Evaluate(querySet.col(queryIndex), referenceSet.col(referenceIndex))
< naywhayare> however, if each TreeType.Dataset() function returned a reference to the dataset, then we could do something like
< naywhayare> metric.Evaluate(queryNode.Dataset().col(queryIndex), referenceNode.Dataset().col(referenceIndex))
< naywhayare> I think this needs a little more thought, though; we still need some way of obtaining a unique index for each point
< andrewmw94> ok. should I change them to do that?
< naywhayare> if each RectangleTree node holds its own arma::mat object, then the first point in every node has index 0
< andrewmw94> I think I understand
< andrewmw94> yes
< naywhayare> so we need some way of obtaining the original index of each point
< andrewmw94> do we want to keep that in reference to the original data matrix?
< andrewmw94> theoretically we could just use a map from the points to the indices in the original matrix right?
< andrewmw94> but that may be too much overhead
< naywhayare> the BinarySpaceTree maps points all over the place and then returns a std::vector<size_t> at construction time to map from new indices to old indices
< naywhayare> but the BinarySpaceTree also builds the entire tree on one contiguous dataset
< naywhayare> but I seem to remember that this is not realistic for the RectangleTree, or, we chose not to do the RectangleTree on one arma::mat object for some reason, but I can't remember the exact reason
< andrewmw94> yes. That makes less sense with an R tree though. Because we split up the data and there isn't a logically "leftmost" child of each node
< andrewmw94> though I guess if nothing else we can just descend to the node that happens to be at index 0, then 1, ...
< naywhayare> an arbitrary ordering is fine; that doesn't completely make sense for a BinarySpaceTree either
< naywhayare> the key being that the points in each leaf are contiguous in memory
< andrewmw94> but then will we have to traverse the tree like that to find the index of each point?
< andrewmw94> because if so, a map is probably faster
< naywhayare> no, when you build the tree, you can assemble the std::vector<size_t> object
< naywhayare> but again, we don't have completely unique indices
< naywhayare> if you have two RectangleTree nodes, the indices for each will start at 0
< andrewmw94> hmm. So what is the ordering in the BSP tree? The original ordering of the matrix?
< andrewmw94> or more to the point, what would the order for the R tree have to end up as so that the output would be exactly the same
< naywhayare> no, it's basically an arbitrary ordering; the leftmost child holds the first points in the matrix; the next leftmost child (left.left.left...right) has the next points, and so forth
< naywhayare> the actual ordering of the points is less important
< andrewmw94> what I mean is for the test code to be able to compare the outputs, what would the R tree need to do? I thought I remembered the brute force being compared to the BSP tree
< naywhayare> the more important part for the NeighborSearchRules abstraction (and other abstractions) is that the BaseCase() function knows what index to store in the case where it gets a new best nearest neighbor candidate
< naywhayare> yeah, the brute force algorithm gets compared
< naywhayare> if you can set up the RectangleTree in such a way that it knows the true indices of points that it is holding, then it will work
< naywhayare> i.e. instead of holding std::vector<arma::vec*> (or similar), it holds std::vector<size_t> where each size_t is just an index of a particular column in the dataset
< naywhayare> do you think refactoring like that would be straightforward?
Anand has joined #mlpack
< naywhayare> then there is no need to split the big dataset into little pieces
< andrewmw94> I think the most reasonable would be to map them to the points in the indices in the original data matrix. It doesn't seem like it would be that much harder than anything else.
< naywhayare> well, but I think this would be difficult unless you have some kind of unique mapping
< naywhayare> again consider two RectangleTree nodes in the same tree, but in different places
< andrewmw94> I'm actually expecting that having a map would be faster than that. It seems like using the indices would have bad memory locality
< naywhayare> the first point in each of these nodes has index 0
< andrewmw94> well, if nothing else we can use Map<arma::col_view, size_t> right?
< andrewmw94> or will the arma::col_view (or whatever the real name is) be different each time it is made?
< naywhayare> yes, it will be a different object every time you call mat.col()
< naywhayare> I'm not sure the actual problem is entirely clear. the NeighborSearchRules class maintains a list of nearest neighbor candidates (and distances) for every possible queryIndex
< naywhayare> when I call BaseCase(queryIndex, referenceIndex), these lists are updated if the distance between the points referred to by queryIndex and referenceIndex are closer than any distances in the list for that particular queryIndex
< naywhayare> so if I am going to update the list of candidates, then we need a referenceIndex that we can use to do so
< andrewmw94> ok. I think I get it. The original index of the point in the dataset (the one that was passed to the constructor) should work right?
< andrewmw94> but anything else will also work, as long as it is unique?
< naywhayare> well, if you use anything else, then we will later need to remap it to the old index
< naywhayare> but yes, it could be made to work
< andrewmw94> ok. So if we had a Map that went from an arma::vec to the original index that should work, provided that it doesn't have too much overhead and I can think of a way to have each vector mapped by value.
< naywhayare> but then each vector comparison is going to be O(d) and that is going to be particularly slow
< andrewmw94> other alternatives are to store the original indices of the points in each node.
< naywhayare> yes, if you stored the original indices of the points in the node, then you can just store a std::vector<size_t> that holds the indices
< andrewmw94> or to come up with an order of traversing the tree. If we go say "left most" first, we could cache "base" values as we build the tree. but they will be updated a lot, slowing construction of the tree
< naywhayare> what do you mean? can you explain more thoroughly?
udit has joined #mlpack
< andrewmw94> well, suppose we traverse by always going to the lowest numbered child
< andrewmw94> then we could cache say a 0 as the root
< andrewmw94> and it's left most child would have a 0 cached
< andrewmw94> let's say that one is a leaf to make it easier.
< andrewmw94> then it has say 17 vectors
< andrewmw94> so the child of index 1 from the root would need to cache 17 as its "base" value
< andrewmw94> and it could number the children starting there
< naywhayare> ok, that's what I thought
< udit> naywhayare: shall we start now ? are you free ?
< naywhayare> udit: let me finish this first please
< naywhayare> andrewmw94: I think that'll require a lot of updating of those base values during construction
< andrewmw94> yeah, that's what I think.
< naywhayare> do you think that it's not too restrictive to just store a std::vector<size_t> with each of the indices?
< andrewmw94> that should work. We discussed it before and I thought we agreed it would work but it would have bad memory locality
< naywhayare> it could have bad memory locality, yeah
< andrewmw94> perhaps storing both the vectors and the size_t?
< andrewmw94> it's not too much memory and would solve the locality issue
< naywhayare> that won't help, it has the same memory locality problem
< naywhayare> because you'll have to store a pointer to the vector
< andrewmw94> oh, I forgot the base case will use the index rather than the stored point
< andrewmw94> duh
< naywhayare> either that or you store the vector itself, and that takes a ton of space
< naywhayare> the NeighborSearchRules abstractions are modifiable, if you have an idea that I don't, so don't rule that out as a possibility
< naywhayare> the key being that it can still work with the other types of trees
< naywhayare> unfortunately making all these things work together can get really difficult :-S
< andrewmw94> yeah. I'm probably going to need to check that in more detail to see exactly how it works
< andrewmw94> but I'm definitely stumped on how to do this
< andrewmw94> efficiently that is
< andrewmw94> doing it with a map is easy
< naywhayare> for now I think it would be better to store a std::vector<size_t> with indices of the points
< naywhayare> and then the dataset is separate from the tree, and each node in the tree does not have to store its own small dataset
udit has quit [Quit: leaving]
udit_s has joined #mlpack
udit_s is now known as udit
< andrewmw94> yeah. I'll try that for now. The locality won't actually be as bad as I was thinking. I forgot the distance computations don't use the points in leaf nodes anyways
< andrewmw94> it seems like it may mostly be the extra dereference
< naywhayare> so, locality definitely helps; this is why the BinarySpaceTree rearranges the points in the matrix so that they are contiguous
< naywhayare> but if we decide that is a problem, we can deal with that later
< naywhayare> we don't need to worry about it now
< andrewmw94> yeah
< andrewmw94> ok. Thanks
< naywhayare> ok, anything else I can help with for now?
< andrewmw94> No I think that's enough for now. I'll let you talk with udit.
< naywhayare> sure, feel free to get in touch if you are getting hung up on other things
< naywhayare> I'm sorry that I maybe sent you in the wrong direction before
< naywhayare> udit: you wanted to talk about the decision stump entropy calculation, right?
< udit> naywhayare: yeah. Did you have time to go through the mail ?
< udit> the last one on the thread.
< udit> oh and about the subvec parameter issue. I did get to that. It didn't give an error when I built it on my machine. But the build is unstable.
< naywhayare> the failures are specifically on i386 systems
< naywhayare> I'm pretty sure the issue is the use of 'long unsigned int'
< naywhayare> line 163
< naywhayare> use size_t not long unsigned int; size_t is not guaranteed to be long unsigned int on all platforms
< udit> hmm. okay. I was going to templatize CountMostFreq as well, but a similar problem was occurring. Maybe size_t should cut it.
< naywhayare> use whatever the type held by sortedLabels is
< naywhayare> which should be size_t
< udit> okay. Now, about the entropy calculation.
< naywhayare> why did you templatize CalculateEntropy to take a type for the labels? that should always be size_t anyway
< udit> I see that now.
< udit> Earlier I was using a rowvec for them, for more precise calculations.
< naywhayare> I don't understand how that makes it more precise; have I overlooked something?
< udit> no, it was my mistake.
< naywhayare> okay
< naywhayare> ah, okay, I understand what happened
< naywhayare> so when I mentioned templatizing, my suggestion would have been more template-y: template<typename VecType, typename LabelType> double DecisionStump<MatType>::CalculateEntropy(const VecType& attribute, const LabelType& labels)
< naywhayare> that way it handles both rowvec and subview_row
< udit> actually, subview_row seems as if it should be handled in a very dicey manner. Either that, or I haven't understood it that well because of very little documentation in the armadillo pages.
< naywhayare> yes, those are kind of "hidden" armadillo classes that aren't documented very well
< naywhayare> I disagree with that decision, to hide them like that
< naywhayare> either way, they should work about the same as a rowvec
< udit> so should I try with size_t as it currently stands ? because I don't see where it'll need to be rowvec the way I'm using it now...
< naywhayare> I'd use my suggestion, to avoid explicitly using the subview_row class
< naywhayare> because that class isn't documented, it's possible that the API may change in the future
< naywhayare> and that class may change name or disappear entirely
< naywhayare> but rowvec::subvec() can reasonably be expected to always return something that acts like a rowvec
< udit> so the typename VecType will take rowvec::subvec ? and LabelType - Row<size_t>::subvec ?
< naywhayare> yeah
< udit> okay, now about the entropy calculation. Did you understood what I talked about in the mail ?
< naywhayare> yeah, I get what you are saying, but I don't see why this much simpler snippet won't act the same way:
< naywhayare> the entropy calculation does not in any way depend on the actual values of the points; only the labels themselves
< naywhayare> unless I have misunderstood the ID3 entropy calculation you are using
< udit> I think you might have. Think about the way oneR works. Entropy calculation might not depend directly upon the values of the points, but it still does depend on which value takes what label in which split/bin.
< udit> Try it the way you would proceed as in OneR.
< udit> Given an attribute, you create/mark bins. Then for each of the bins, you calculate the entropy, which is used to calculate the total entorpy of the split.
Anand has quit [Ping timeout: 246 seconds]
< naywhayare> yeah, true -- but for each bin you are calling CalculateEntropy(), and for each bin, the calculation only depends on the distributions of labels and not the attribute
< naywhayare> is that correct?
< udit> so the calculateEntropy function is acting over each bin.
< udit> yeah
< udit> Each attribute's entropy is calculated in setupsplitAttribute.
< udit> this is then compared to the bestEntropy in the constructor.
< naywhayare> right
< udit> now in each bin,
< udit> you need the uniqueAtt to construct the entropyArray, which is used to calculate entropy just for that bin.
< udit> it helps to know the labels of each element and numElem helps calculate the value of p2.
< naywhayare> but in each bin, the entropy does not depend on the actual attribute values; just the labels
< udit> it does.
< udit> upon the distribution.
< naywhayare> H(S) = -sum_{x \in X} p(x) log_2 p(x)
< naywhayare> where X is the set of classes
< naywhayare> and p(x) is the number of points in class x divided by the total number of points in the set S
< udit> p(x) is for this bin.
< udit> the current bin.
< naywhayare> yes, S is the set of points in the current bin, is it not?
< udit> yes, which changes in each bin, but the classes, numClass here, remains the same.
< naywhayare> well, you have to ignore any classes which aren't present in X anyway otherwise H(S) is infinity
< naywhayare> so I don't think that makes a difference and I'm still not seeing the necessity of uniqueAtt instead of just a count of how many points are in each class
< udit> well, the way you ignore those classes is that p3 is 0 when p2 is 0. you take 0log(0) to be 0
< naywhayare> yeah, that's fine; that part works just fine
< udit> also, I looked into the oneClass and defaultClass issue too. It works fine. I just needed to map those to using mapping vectors
< naywhayare> I took a look at the references you sent, but this is still not making it clear to me why uniqueAtt is necessary. one of the references mentions that information gain only needs to be computed when the class label changes, but that is considering where the bin cutoff is, not the actual entropy calculation
< naywhayare> I'm sorry that I am kind of hung up on this, but I am concerned that the entropy calculation could be way faster
< naywhayare> and I still am not seeing how the snippet I pastebin'ed is incorrect; if it's correct, it (or something like it) will be significantly faster than the implementation that calls arma::unique()
< udit> okay, take the set: {(1,0),(2,1),(2,1)(1,0),(2,0)}
< udit> here, your code in the pastebin does not take into consideration the entropy difference due to (2,1) and (2,0), that is 2 having 2 different labels 1 and 0
< naywhayare> so, what should the entropy in this case be?
< udit> but while calculating entropy you need to do that.
< naywhayare> by my understanding it should be 3/5 * log2(3/5) + 2/5 * log2(2/5)
Anand has joined #mlpack
< udit> naywhayare: just a sec.
< udit> naywhayare: okay, I think I got it wrong this whole time. It should not depend upon the discreet values of the attribute points, rather only the labels.
< udit> naywhayare: I just came across an example which confirmed your position.
< naywhayare> ok, thanks for digging into that. I found that most sources I could find on the entropy calculation algorithm were somewhat unclear
< udit> they've not talked in depth on continuous, attributes.
< naywhayare> yeah, that is what I found also
< naywhayare> it made it difficult to figure out how to extrapolate to continuous features
< naywhayare> anyway, if you think my idea is correct, we can refactor CalculateEntropy() significantly, and it won't need the attribute vector anymore
< naywhayare> I think we also need a test case that ensures that the stump splits on the correct dimension, because we have no test case that does that
< naywhayare> so if the stump is splitting incorrectly, it won't throw an error in the tests
< naywhayare> does that seem reasonable? or maybe I have misunderstood one of the tests?
< udit> yeah, one of the tests should also test that.
< udit> okay. good that we got this sorted out.
< udit> now about MergeRanges()
< udit> we still need it right ?
< naywhayare> if it is possible that we will have two adjacent bins with the same label, then yes
< udit> Because, once the bins have been decided, what this effecttively does is is merge adjacent bins wiht same labels.
< udit> yeah.
< naywhayare> does the bin splitting algorithm take that into account?
< naywhayare> i.e. does it ensure that two adjacent bins will have different labels?
< udit> it can't. It just splits based on inputBinSize
< udit> it jsut splits the attribute. I take the corresponding labels and send them to calculateEntropy
< udit> the begin and end variables in SetupSplitAttribute
< naywhayare> okay, so then we do need MergeRanges()
< udit> now, then. How do we go about the concerned number of bins ?
< udit> Any ideas on that, because I thought it would become too complex to get that.
< naywhayare> hang on, let me do a little quick reading
< udit> Even in the papers which talked about continuous attributes talked about the number of points in each bin - not the number of bins.
< udit> yeah.
< naywhayare> fair enough
< naywhayare> looking at the OneR documentation, it doesn't support number of bins either
< naywhayare> so let's just leave it as it is
< naywhayare> although, I guess there is a fairly easy way
< udit> okay, so as of now, I have the templatization, the rewriting of calculateEntropy, that one unit test.
< udit> go on...
< naywhayare> if you have too many bins, just merge bins in a way that minimizes the increase in entropy until you have the right number of bins
< naywhayare> and you can only merge adjacent bins, so I don't think that would take too long
< naywhayare> whether or not you want to do that is up to you. other implementations don't seem to provide that functionality, so if you don't want to, don't worry about it
< naywhayare> but I think that would be probably the simplest and best way of doing it
< udit> that's what I thought initially too, but the complexity of managing the labels and their respective indexes while at the same time handling the entropy minimization across the feature set might be complex
< udit> because while going for entropy minimization, the attribute might change due to merging of bins after the attribute has been selected - but I think merging of bins will be done before selection.
< naywhayare> yeah, you could merge the bins before selecting the attribute, and take that into account in the entropy calculation
< udit> okay - I'll think about this, and if I do have time at the end, I'll get it done.
< naywhayare> I had thought that you would merge after selecting the attribute, but you are right that merging after selection could mean that the best attribute wasn't chosen
< naywhayare> anyway, you listed three things -- templatization, CalculateEntropy(), and the unit test
< udit> but the number of ways to merge the bins would be too many - it would be expensive too.
< naywhayare> true; there are many ways to merge the bin. the calculation of the best bin to merge for an attribute should be O(b) time where b is the number of existing bins
< naywhayare> my only other thought for how to improve the decision stump code is that maybe TrainOnAtt() and SetupSplitAttribute() have some code duplication, and maybe could be merged or refactored in some way to reduce that?
< naywhayare> I'm not sure about that one
< udit> yeah, so I'll look at it in the end. But as of now, the things that I listed previously - the templatization, the rewriting, and at least one more unit test is what we need to change.
< naywhayare> so maybe it is not possible
< naywhayare> yeah, just those three, and then let me know what you think about TrainOnAtt() and SetupSplitAttribute()
< udit> Yeah, some part of it could be. But that might be slightly more expensive. Let me think about that and I
< udit> *I'll get back to you once I've thought of something valuable.
< naywhayare> okay, sounds good
< udit> I guess that does it, then.
< naywhayare> yeah; I have one more thing I have to do on my end, though
< naywhayare> I need to go through SetupSplitAttribute() and understand how the binning decisions are done
< naywhayare> the comments you added will help, so thank you for committing those
< udit> sure
< udit> I guess you're back now, for good ?
< naywhayare> yeah, I am not traveling anymore
< udit> You also ( or maybe Marcus) need to go through the Perceptron.
< naywhayare> so I should have more time
< naywhayare> yeah, the perceptron is on my list too :)
< naywhayare> this Friday (July 4) is a holiday here, so I will probably be gone that day
< naywhayare> but other than that I don't have any travel plans for the rest of the summer, I think (unless I forgot some)
< udit> Oh, you'll have an extended weekend ! Nice.
< udit> My recollection (from various pop culture references) is that people usually have a barbecue. :D Especially on a 4th of July weekend.
< udit> good stuff.
< udit> I also wanted to talk you about some other stuff. Maybe later. I'll have dinner and start work again in a while then.
< udit> see you later, then.
< naywhayare> sorry, I got intercepted by someone
Anand_ has joined #mlpack
< naywhayare> and yeah, I am planning to go to the mountains and barbecue :)
Anand has quit [Ping timeout: 246 seconds]
< naywhayare> if you want to talk about any of those other things now, I'll be here for most of the rest of the day (minus some short breaks here and there)
< andrewmw94> naywhayare: I forgot. One reason to have the data in separate matrices was that it allowed easy insertion and deletion. Deleting points is important when constructing R*trees as they do forced reinsertion. I don't really see a way around this with how the code is setup though, but I wanted to double check with you that I should go ahead. I think I recall you saying that rebuilding a tree is generally faster than moving the point
< naywhayare> well, but I think easy insertion and deletion is similarly possible if you're just holding size_t indices instead of actual arma::vec objects
< andrewmw94> how would you do deletion? The best I can think of is to remove the size_t from the vector in the tree, then on the actual data, you swap the arma::vec with the last arma::vec, but that requires updating the size_t in the tree, which requires a query.
< andrewmw94> not too good, but perhaps ok. I'm not sure how it works though, since the tree is being changed and thus the query may not actually work
< andrewmw94> I need to think about that more
< naywhayare> well, you're deleting points from the tree, but not from the actual dataset
< naywhayare> like you said they do forced reinsertion
< naywhayare> so you delete that particular index, then later reinsert it
< naywhayare> maybe that makes sense?
< andrewmw94> ah, yes. I just realized that.
< naywhayare> I don't think it actually really makes sense to have trees of dynamic size
< naywhayare> for mlpack algorithms, every point _has_ to have a unique index
< naywhayare> otherwise the results of, say, nearest neighbor search don't actually make any sense -- they have to return an index of the nearest point
< andrewmw94> yeah
< naywhayare> and if each point has a unique index, and has the same dimensionality (which it has to) then it all fits in a big arma::mat structure anyway
< naywhayare> if a user wants to grow a tree to add points, though, I'm not sure exactly what needs to happen. a call to insert_cols() or something?
< andrewmw94> ok. The changed code passes my tests and valgrind. Now I'm thinking: "that's way too easy. I MUST have broken something."
< andrewmw94> the way it was setup, I had an insertPoint() function that would take an arma::vec
< andrewmw94> but I just changed that
< andrewmw94> the tree can no longer grow
< andrewmw94> well, I guess it technically could. You would have to expand the arma::mat and then you could use insertPoint(size_t)
< andrewmw94> on the new columns.
< andrewmw94> that should work as it used to
< naywhayare> yeah, I think if we ever offer generalized support for adding new points to trees, it's going to have to be along those lines
< naywhayare> expand the matrix, then call a function to grow the tree... or something
< naywhayare> either way, unless you have a solution I don't see, I don't think we need to worry about that use case for now
< andrewmw94> yeah. I'll try getting the traversal working now that the dataset is handled like the other trees
< naywhayare> ok, sounds good. please let me know if you need help
< andrewmw94> then add the ability to delete points, then R*-trees, then X-trees
< naywhayare> traversals and rules are... pretty complex
< andrewmw94> yep. Thanks.
oldbeardo has joined #mlpack
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
< oldbeardo> naywhayare: hey, did you get a chance to look at Reg SVD?
< naywhayare> I have not had a chance to look as comprehensively as I would like
< naywhayare> could you go ahead and add the QUIC_SVD class and code to the repository?
< oldbeardo> yes, about that, where should I add it?
< naywhayare> src/mlpack/methods/ seems fine to me; that or src/mlpack/core/lin_alg/
Anand_ has quit [Ping timeout: 246 seconds]
govg has quit [Ping timeout: 272 seconds]
govg has joined #mlpack
< naywhayare> oldbeardo: one other thing -- do you want to integrate QUIC_SVD into CF now, or were you planning to do that as a last part of your project?
< oldbeardo> naywhayare: yes, this is relevant to the point we had discussed before
< oldbeardo> if you remember, I had explained to you why QUIC-SVD will not work very well in the case of sparse matrices
< oldbeardo> so, I'm not sure if should even link it to the CF implementation
< oldbeardo> *we should
udit has quit [Quit: Leaving]
< naywhayare> right, I remember that now
< naywhayare> let me see if I can dig up the logs to refresh my memory more completely
sumedh__ has joined #mlpack
sumedh_ has quit [Ping timeout: 260 seconds]
< naywhayare> I've plotted the singular values of the GroupLens matrix
< naywhayare> although it is very sparse, the image seems to indicate that it can be approximately fairly well by a fairly small number of singular vectors
< naywhayare> I know you tested your implementation against the declaration dataset, and you got good results there
< naywhayare> I suggested that you try adding the average rating for each item to the dataset, but this had a problem because many items only had one rating
< naywhayare> so maybe the better idea is to add the average rating for each user, not each item
< naywhayare> either way, maybe quic-svd just works poorly for that dataset, and that's okay (it's not your fault, it's Michael Holmes' fault :))
< naywhayare> I still think there may be situations in practice where quic-svd is useful, though, so we should at least make sure that CF can work with the QUIC_SVD class
< oldbeardo> okay, I was thinking that it should work well for every dataset, which will certainly not be the case for CF
< naywhayare> yeah. so we shouldn't make QUIC_SVD the default for CF, but we should make it so a user can type CF<QUIC_SVD> and it will work
< naywhayare> also, I responded to your email about the regularized SVD code, so hopefully is helpful input
< jenkins-mlpack> Starting build #1975 for job mlpack - svn checkin test (previous build: SUCCESS)
< oldbeardo> naywhayare: the problem persists even on taking average ratings for items
< naywhayare> yeah, you said that some time back. did you try average ratings for users?
< oldbeardo> it doesn't matter, that's what I'm trying to say
< naywhayare> oh, ok, you tried both and it failed either way
< oldbeardo> since the matrix is sparse, almost all the columns will be the same whether we take average user ratings or average item ratings
< naywhayare> each column will be similar but not the same; it will have few nonzero values, but at least one
< oldbeardo> yes, but even if the matrix has 100 users/items, the difference in the vectors won't be able to capture the information we need
< oldbeardo> in addition replacing missing values with averages will have a huge impact on the outcome
< naywhayare> yes, definitely that will have a big impact
< naywhayare> so my understanding is that the cosine tree, during construction of a node, randomly samples a point, and then takes the cosine similarity of all other points to that randomly sampled point (or something similar to that)
< naywhayare> then, it splits the points with higher similarity to the left, and lower to the right
< naywhayare> if the matrix is very very highly sparse, it's possible that the cosine similarity between one point and all others is 0
< naywhayare> for instance if a user (where a user is a point) has rated only one movie that nobody else has rated
< oldbeardo> yes, didn't think about that, this is an additional issue
< naywhayare> I would imagine the length-squared sampling (where it more often picks points with higher norms) should help to prevent this
< naywhayare> but it is not guaranteed to
< naywhayare> I'd think that adding row averages or column averages would help with this issue
< naywhayare> you pointed out that with row averages (item averages), the average is often completely wrong because the item was only rated once
< naywhayare> sorry, I miswrote "row averages" in that last message when I meant "column averages"
< naywhayare> I would think that using row averages (user averages) would work better because each user in the GroupLens dataset has rated at least 5 movies
< naywhayare> but you say that that didn't give any better performance
< naywhayare> so I'm trying to think about what is actually going on there
< oldbeardo> well, that's because the cosine similarity is not the issue here
< oldbeardo> the issue is the selected basis vector, which is the centrois of the columns in the node
< oldbeardo> it does not represent the data well, and that is because the columns are very similar to one another after averaging
< oldbeardo> even if that wasn't the case, the obtained SVD won't be representative of the data that we have, since we introduced the average ratings
govg has quit [Ping timeout: 248 seconds]
< naywhayare> I see! thank you
< naywhayare> let me think about this more and wrap my head around it completely
< oldbeardo> sure, I have a question about what you wrote about Reg SVD in the mail
< oldbeardo> the cost function for Reg SVD is a sum over all the examples
< oldbeardo> so when you say that I should L_BFGS, do you mean I should sum over everything inside a for loop?
< oldbeardo> if yes, then how will the gradient function look? since every user/item parameter vector maybe affected by more than one rating
< naywhayare> yeah, you could probably write it all in a for loop
< naywhayare> for the gradient, you call zeros(), then you iterate over all the points, and add each of their contributions
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
< oldbeardo> yes, but what will the 'contributions' be?
< naywhayare> regularized_svd_function_impl.hpp, lines 55 and 57
< naywhayare> for each user/item combination
< oldbeardo> in the case of SGD we have a learning rate, and every update is independent of the other
< naywhayare> yeah; L-BFGS is different, but it's still optimizing the same objective function
< oldbeardo> true, but won't adding two separate contributions be different from the actual gradient?
< oldbeardo> or maybe I'm getting confused over something trivial
< naywhayare> the gradient function you have implemented is the partial gradient with respect to one user/item combination
< naywhayare> and that's how SGD works -- when you have a function where you can easily calculate partial gradients, then you iterate over each partial gradient
< naywhayare> but many other optimizers don't work that way and operate on the full gradient
< naywhayare> so my suggestion is, don't calculate just the partial gradient -- calculate the full gradient, all at once, then let L_BFGS take a step based on that information
< oldbeardo> yes, my question is 'is the sum over all the contributions the full gradient?'
< naywhayare> yeah, that is correct
govg has quit [Ping timeout: 264 seconds]
< oldbeardo> okay, that solves it then
< naywhayare> it is possible that when you calculate them all at once, you can find a more optimized implementation, but summing them all up works too
< naywhayare> it's also possible that L_BFGS won't perform as well as SGD, but we'll see :)
< oldbeardo> I will try it out tomorrow, also, do you follow football?
< oldbeardo> oh sorry, *soccer
< sumedh__> I do :)
< sumedh__> late night match is not that good though...
< oldbeardo> sumedh__: heh, which team are you supporting?
< naywhayare> oldbeardo: sumedh__: I watched some of the world cup games when I was eating; they had them on TV. but I haven't really followed very much more than that :(
< sumedh__> right now Netherlands.... they are playing with exceptional chemistry... and robben...
< sumedh__> ohh... me and my friends usually watch every match together... today's match was boring...
< oldbeardo> naywhayare: you may want to eat after around 45 minutes :)
govg has joined #mlpack
< oldbeardo> sumedh__: Netherlands is a nice choice
< naywhayare> oldbeardo: hah; but I already had lunch. :( I guess things are exciting, so maybe I should take a second lunch or something :)
< oldbeardo> naywhayare: heh, maybe you should, USA is playing Belgium
< sumedh__> oldbeardo: very risky though... they play constant counter ... any strategic manager like mourinho will make them cry... but its not the case in this world cup is it :)
< sumedh__> anyway which team do you support??
< sumedh__> naywhayare: hehe :)
< oldbeardo> sumedh__: I have always been a Germany supporter, mainly because of Klose and Schweinsteiger
< sumedh__> yesterday's match was very close....
< oldbeardo> yes, really poor defense from Germany's side
< sumedh__> i am sad that muller didn't get a goal...
< sumedh__> yeah totally... in the first half German defense was very poor...
< sumedh__> oldbeardo: hats off to Neur... really... great defending...
< oldbeardo> yeah, anyway, I have to go, see you later sumedh__ naywhayare
oldbeardo has quit [Quit: Page closed]
< sumedh__> naywhayare: getting weird results with that function... index keeps on increasing...
< sumedh__> means residue keeps on increasing...
< naywhayare> residue is increasing... that's shouldn't be happening...
< naywhayare> arma::norm(V - W * H, "fro"), right?
< sumedh__> yes exactly...
< jenkins-mlpack> Project mlpack - svn checkin test build #1975: SUCCESS in 1 hr 14 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1975/
< jenkins-mlpack> * Ryan Curtin: Very simple change to fix build on i386.
< jenkins-mlpack> * andrewmw94: change the tree to store size_t in the nodes and keep the dataset together. Other misc changes.
< naywhayare> is it increasing a lot or a little each iteration?
< sumedh__> starts with 4000... and increases in 4 or 5 iterations t0 10000 and stops... I checked it with simple residue termination... its doing fine...
< andrewmw94> naywhayare: so if I keep the dataset as it originally is, does that mean I can just remove the unmapping code for the R tree
< naywhayare> andrewmw94: yeah, if you don't modify the dataset at all, then the unmapping code isn't necessary
< naywhayare> in the TreeTraits for the R tree, just set RearrangesDataset to false
< naywhayare> and then you shouldn't need to mess with any of the algorithm classes
< andrewmw94> ok. But there's code in the allknn file to handle unmapping. Can I just remove that when I copy stuff for the R tree?
< naywhayare> yeah, or you can ignore it with an if statement
< naywhayare> it still needs to be there for the BinarySpaceTree
< naywhayare> but you are right that it's unnecessary for the R tree
< andrewmw94> ok, thanks
andrewmw94 has quit [Quit: Leaving.]