#mlpack on 2014-07-01 — irc logs at libera.irclog.whitequark.org

2014-05-21 16:24 naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

04:53 govg has quit [Ping timeout: 248 seconds]

07:04 udit has joined #mlpack

07:06 < jenkins-mlpack> Starting build #1974 for job mlpack - svn checkin test (previous build: SUCCESS)

08:17 sumedh_ has joined #mlpack

08:21 < jenkins-mlpack> Project mlpack - svn checkin test build #1974: SUCCESS in 1 hr 14 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1974/

08:21 < jenkins-mlpack> saxena.udit: Fixed subvec calls.

08:32 < jenkins-mlpack> Project mlpack - nightly matrix build build #502: FAILURE in 4 hr 31 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/502/

08:32 < jenkins-mlpack> * saxena.udit: Fixed subvec calls.

08:32 < jenkins-mlpack> * saxena.udit: Rewinding the code review

08:32 < jenkins-mlpack> * siddharth.950: Combined CosineNode and CosineTree classes.

08:32 < jenkins-mlpack> * andrewmw94: memory leak fix

11:48 udit has quit [Quit: leaving]

12:39 andrewmw94 has joined #mlpack

12:55 < andrewmw94> naywhayare: let me know when you are free

12:59 < naywhayare> andrewmw94: on my way to lab now, I need about 15 minutes

12:59 < andrewmw94> ok

13:08 < naywhayare> alright, I am here now

13:09 < andrewmw94> right. So I have questions about the API I should have for the R tree and about the neighbor_search_rules

13:10 < andrewmw94> which do you want first?

13:10 < naywhayare> let's start with the R tree

13:11 < andrewmw94> ok. So as I think I mentioned, I have the normal destructor and a "soft delete" that I use for when I copy nodes. Is it ok if I assume the user only uses the normal destructor and only on the root node?

13:12 < naywhayare> that's a valid assumption; for the most part, the user won't explicitly call the destructor but it'll be implicitly called when the tree goes out of scope

13:12 < andrewmw94> ok, that's good

13:12 < naywhayare> I remember talking about your soft destructor... why not just clear the array of children and call the regular destructor?

13:13 < andrewmw94> that's basically what it does

13:13 < naywhayare> ah, okay

13:13 < andrewmw94> The next question is about copy constructors. Is there a standard way these are done? Will the user expect certain behavior from these?

13:13 < andrewmw94> because I use this to get an exact copy when I duplicate nodes, which may be bad

13:14 < andrewmw94> I can set everything by hand, but if the user wanted to try this it could cause issues

13:14 < naywhayare> the copy constructor should have signature RectangleTree(const RectangleTree& other)

13:14 < naywhayare> just like BinarySpaceTree

13:14 < naywhayare> and it should produce an exact copy of the identical tree

13:14 < andrewmw94> including pointers?

13:14 < naywhayare> but the only thing to be careful about is that all pointers point to new objects, and not to the same objects that the old tree points to

13:14 < naywhayare> yeah, pointers are the one exception :)

13:15 < naywhayare> so you will have to create all new children, etc., but the non-pointer members you can just copy

13:15 < andrewmw94> ahh, that's the problem. I need to get the pointers so that I can have the root stay at the same address

13:15 < naywhayare> why does the root need to stay at the same address?

13:15 < andrewmw94> so that the address returned by the constructor is the root

13:16 < naywhayare> ...but why is the regular constructor employing the copy constructor?

13:16 < andrewmw94> it isn't. But when I split the root node I make a copy.

13:17 < naywhayare> ah, okay

13:17 < andrewmw94> we discussed this a while back. The alternative was to pass the address of the new root back to the user, but I think this is better

13:17 < naywhayare> I think we talked about this some time back, and for the root node, there is no perfect solution

13:17 < naywhayare> the user should expect to type 'RectangleTree r(dataset)' and then suddenly a tree is built and they don't have to think about it

13:18 < andrewmw94> yeah. But is there a way for me to have a method say getCompleteCopy() that will copy all data exactly?

13:18 < naywhayare> but I think we are on the same page about that

13:18 < naywhayare> well, you can copy the contents of the object exactly

13:18 < naywhayare> however, you still end up with a different object that's located at a different place in memory

13:18 < naywhayare> so I don't think this solves your problem (unless I've misunderstood)

13:18 < andrewmw94> yeah, that's fine. I just need to have all of the same pointers

13:19 < naywhayare> so, the default copy constructor will do that (a shallow copy, where it copies each object exactly) but that's very unsafe to expose to the user

13:19 < naywhayare> because the user will try to make a copy of a tree, delete the original, and try to use the copy and suddenly all the pointers reference dead memory

13:20 < andrewmw94> yeah.

13:20 < andrewmw94> so I want to have a copy constructor that get's new memory, but still have a way to copy the root node and have all the old pointers.

13:21 < naywhayare> is this a copy constructor that the user is going to use, or a copy constructor that is for splitting the root node?

13:21 < andrewmw94> I should be able to set all of the pointers by hand, but then I need to add a bunch of methods so I can get a pointer to the dataset rather than a reference etc.

13:21 < naywhayare> (that is, a copy constructor that is used in the process of splitting the root node)

13:21 < andrewmw94> I think I agree with you on what the user needs and I need to change the tree for that.

13:21 < andrewmw94> I'm trying to figure out what I can do for splitting the root now

13:22 < andrewmw94> it isn't possible to have multiple copy constructors (unless I missed something)

13:22 < naywhayare> so the best solution that I can think of (which is not necessarily the best solution) is to create two new children, split everything from the root node into those two, and then set the children of the root to be those two new nodes

13:23 < naywhayare> and yeah, you can't have multiple copy constructors that use the same semantics as the normal copy constructor

13:24 < andrewmw94> and is there a way to have a good copy constructor and also have a method that makes an exact copy?

13:24 < naywhayare> I mean, you could do some hack constructor like RectangleTree(const RectangleTree& other, const bool exactCopy = 0)

13:24 < naywhayare> but I'm still not certain this is the best way to approach the problem

13:27 < andrewmw94> Let's see if I understand the two new children thing. So I would create two empty nodes, and then have a method that will assign the information from the root node to these two new ones

13:28 < andrewmw94> ok. It seems like that should be equivalent to the current functionality.

13:29 < naywhayare> something kinda like RectangleTree leftChild, rightChild; leftChild.children().add(children.subvec(all left children)); rightChild.children().add(children.subvec(all right children)); children.clear(); children.add(leftChild, rightChild);

13:29 < naywhayare> that's not actually valid C++ but it's semi-psuedocode that maybe gets the idea across

13:29 < naywhayare> maybe that is helpful?

13:29 < andrewmw94> yeah, I think that should work

13:30 < naywhayare> (back in a moment)

13:32 < andrewmw94> while I'm at it, I should probably change the code so that a split node only makes one new node rather than two and then deleting itself

13:32 < andrewmw94> ok, that's all my questions about that. Ready to move on to the nearest_neighbor_rule or did you have something else to add?

13:32 < andrewmw94> ok

13:34 < naywhayare> (ok, back)

13:35 < naywhayare> yeah, nothing else to add; I will look through the code when you have it working and tested (so I can try ideas and still check that it all works)

13:36 < andrewmw94> ok (construction is working and tested. Traversal needs this question.)

13:36 < andrewmw94> so, I'm looking through the neighbor_search_rule, and I'm concerned about the baseCase()

13:36 < andrewmw94> when it takes the indices of points, will that still work when I store all of the points in the leaf nodes?

13:37 < naywhayare> no; it will require a little bit of refactoring of BinarySpaceTree, CoverTree, and maybe RectangleTree

13:37 < naywhayare> right now, the BaseCase(queryIndex, referenceIndex) function basically calls metric.Evaluate(querySet.col(queryIndex), referenceSet.col(referenceIndex))

13:38 < naywhayare> however, if each TreeType.Dataset() function returned a reference to the dataset, then we could do something like

13:38 < naywhayare> metric.Evaluate(queryNode.Dataset().col(queryIndex), referenceNode.Dataset().col(referenceIndex))

13:40 < naywhayare> I think this needs a little more thought, though; we still need some way of obtaining a unique index for each point

13:40 < andrewmw94> ok. should I change them to do that?

13:40 < naywhayare> if each RectangleTree node holds its own arma::mat object, then the first point in every node has index 0

13:40 < andrewmw94> I think I understand

13:40 < andrewmw94> yes

13:40 < naywhayare> so we need some way of obtaining the original index of each point

13:41 < andrewmw94> do we want to keep that in reference to the original data matrix?

13:42 < andrewmw94> theoretically we could just use a map from the points to the indices in the original matrix right?

13:42 < andrewmw94> but that may be too much overhead

13:42 < naywhayare> the BinarySpaceTree maps points all over the place and then returns a std::vector<size_t> at construction time to map from new indices to old indices

13:43 < naywhayare> but the BinarySpaceTree also builds the entire tree on one contiguous dataset

13:43 < naywhayare> but I seem to remember that this is not realistic for the RectangleTree, or, we chose not to do the RectangleTree on one arma::mat object for some reason, but I can't remember the exact reason

13:43 < andrewmw94> yes. That makes less sense with an R tree though. Because we split up the data and there isn't a logically "leftmost" child of each node

13:44 < andrewmw94> though I guess if nothing else we can just descend to the node that happens to be at index 0, then 1, ...

13:44 < naywhayare> an arbitrary ordering is fine; that doesn't completely make sense for a BinarySpaceTree either

13:44 < naywhayare> the key being that the points in each leaf are contiguous in memory

13:44 < andrewmw94> but then will we have to traverse the tree like that to find the index of each point?

13:44 < andrewmw94> because if so, a map is probably faster

13:45 < naywhayare> no, when you build the tree, you can assemble the std::vector<size_t> object

13:45 < naywhayare> but again, we don't have completely unique indices

13:45 < naywhayare> if you have two RectangleTree nodes, the indices for each will start at 0

13:47 < andrewmw94> hmm. So what is the ordering in the BSP tree? The original ordering of the matrix?

13:48 < andrewmw94> or more to the point, what would the order for the R tree have to end up as so that the output would be exactly the same

13:48 < naywhayare> no, it's basically an arbitrary ordering; the leftmost child holds the first points in the matrix; the next leftmost child (left.left.left...right) has the next points, and so forth

13:49 < naywhayare> the actual ordering of the points is less important

13:49 < andrewmw94> what I mean is for the test code to be able to compare the outputs, what would the R tree need to do? I thought I remembered the brute force being compared to the BSP tree

13:49 < naywhayare> the more important part for the NeighborSearchRules abstraction (and other abstractions) is that the BaseCase() function knows what index to store in the case where it gets a new best nearest neighbor candidate

13:50 < naywhayare> yeah, the brute force algorithm gets compared

13:50 < naywhayare> if you can set up the RectangleTree in such a way that it knows the true indices of points that it is holding, then it will work

13:51 < naywhayare> i.e. instead of holding std::vector<arma::vec*> (or similar), it holds std::vector<size_t> where each size_t is just an index of a particular column in the dataset

13:51 < naywhayare> do you think refactoring like that would be straightforward?

13:51 Anand has joined #mlpack

13:51 < naywhayare> then there is no need to split the big dataset into little pieces

13:51 < andrewmw94> I think the most reasonable would be to map them to the points in the indices in the original data matrix. It doesn't seem like it would be that much harder than anything else.

13:51 < naywhayare> well, but I think this would be difficult unless you have some kind of unique mapping

13:52 < naywhayare> again consider two RectangleTree nodes in the same tree, but in different places

13:52 < andrewmw94> I'm actually expecting that having a map would be faster than that. It seems like using the indices would have bad memory locality

13:52 < naywhayare> the first point in each of these nodes has index 0

13:53 < andrewmw94> well, if nothing else we can use Map<arma::col_view, size_t> right?

13:54 < andrewmw94> or will the arma::col_view (or whatever the real name is) be different each time it is made?

13:54 < naywhayare> yes, it will be a different object every time you call mat.col()

13:56 < naywhayare> I'm not sure the actual problem is entirely clear. the NeighborSearchRules class maintains a list of nearest neighbor candidates (and distances) for every possible queryIndex

13:56 < naywhayare> when I call BaseCase(queryIndex, referenceIndex), these lists are updated if the distance between the points referred to by queryIndex and referenceIndex are closer than any distances in the list for that particular queryIndex

13:57 < naywhayare> so if I am going to update the list of candidates, then we need a referenceIndex that we can use to do so

13:57 < andrewmw94> ok. I think I get it. The original index of the point in the dataset (the one that was passed to the constructor) should work right?

13:58 < andrewmw94> but anything else will also work, as long as it is unique?

13:58 < naywhayare> well, if you use anything else, then we will later need to remap it to the old index

13:58 < naywhayare> but yes, it could be made to work

13:59 < andrewmw94> ok. So if we had a Map that went from an arma::vec to the original index that should work, provided that it doesn't have too much overhead and I can think of a way to have each vector mapped by value.

14:00 < naywhayare> but then each vector comparison is going to be O(d) and that is going to be particularly slow

14:00 < andrewmw94> other alternatives are to store the original indices of the points in each node.

14:00 < naywhayare> yes, if you stored the original indices of the points in the node, then you can just store a std::vector<size_t> that holds the indices

14:01 < andrewmw94> or to come up with an order of traversing the tree. If we go say "left most" first, we could cache "base" values as we build the tree. but they will be updated a lot, slowing construction of the tree

14:01 < naywhayare> what do you mean? can you explain more thoroughly?

14:02 udit has joined #mlpack

14:02 < andrewmw94> well, suppose we traverse by always going to the lowest numbered child

14:02 < andrewmw94> then we could cache say a 0 as the root

14:02 < andrewmw94> and it's left most child would have a 0 cached

14:02 < andrewmw94> let's say that one is a leaf to make it easier.

14:02 < andrewmw94> then it has say 17 vectors

14:03 < andrewmw94> so the child of index 1 from the root would need to cache 17 as its "base" value

14:03 < andrewmw94> and it could number the children starting there

14:03 < naywhayare> ok, that's what I thought

14:03 < udit> naywhayare: shall we start now ? are you free ?

14:03 < naywhayare> udit: let me finish this first please

14:04 < naywhayare> andrewmw94: I think that'll require a lot of updating of those base values during construction

14:04 < andrewmw94> yeah, that's what I think.

14:04 < naywhayare> do you think that it's not too restrictive to just store a std::vector<size_t> with each of the indices?

14:05 < andrewmw94> that should work. We discussed it before and I thought we agreed it would work but it would have bad memory locality

14:05 < naywhayare> it could have bad memory locality, yeah

14:05 < andrewmw94> perhaps storing both the vectors and the size_t?

14:06 < andrewmw94> it's not too much memory and would solve the locality issue

14:06 < naywhayare> that won't help, it has the same memory locality problem

14:06 < naywhayare> because you'll have to store a pointer to the vector

14:06 < andrewmw94> oh, I forgot the base case will use the index rather than the stored point

14:06 < andrewmw94> duh

14:06 < naywhayare> either that or you store the vector itself, and that takes a ton of space

14:07 < naywhayare> the NeighborSearchRules abstractions are modifiable, if you have an idea that I don't, so don't rule that out as a possibility

14:07 < naywhayare> the key being that it can still work with the other types of trees

14:08 < naywhayare> unfortunately making all these things work together can get really difficult :-S

14:08 < andrewmw94> yeah. I'm probably going to need to check that in more detail to see exactly how it works

14:08 < andrewmw94> but I'm definitely stumped on how to do this

14:08 < andrewmw94> efficiently that is

14:08 < andrewmw94> doing it with a map is easy

14:08 < naywhayare> for now I think it would be better to store a std::vector<size_t> with indices of the points

14:09 < naywhayare> and then the dataset is separate from the tree, and each node in the tree does not have to store its own small dataset

14:09 udit has quit [Quit: leaving]

14:09 udit_s has joined #mlpack

14:10 udit_s is now known as udit

14:10 < andrewmw94> yeah. I'll try that for now. The locality won't actually be as bad as I was thinking. I forgot the distance computations don't use the points in leaf nodes anyways

14:10 < andrewmw94> it seems like it may mostly be the extra dereference

14:10 < naywhayare> so, locality definitely helps; this is why the BinarySpaceTree rearranges the points in the matrix so that they are contiguous

14:10 < naywhayare> but if we decide that is a problem, we can deal with that later

14:10 < naywhayare> we don't need to worry about it now

14:10 < andrewmw94> yeah

14:11 < andrewmw94> ok. Thanks

14:11 < naywhayare> ok, anything else I can help with for now?

14:12 < andrewmw94> No I think that's enough for now. I'll let you talk with udit.

14:12 < naywhayare> sure, feel free to get in touch if you are getting hung up on other things

14:12 < naywhayare> I'm sorry that I maybe sent you in the wrong direction before

14:13 < naywhayare> udit: you wanted to talk about the decision stump entropy calculation, right?

14:15 < udit> naywhayare: yeah. Did you have time to go through the mail ?

14:15 < udit> the last one on the thread.

14:16 < udit> oh and about the subvec parameter issue. I did get to that. It didn't give an error when I built it on my machine. But the build is unstable.

14:17 < naywhayare> the failures are specifically on i386 systems

14:17 < naywhayare> I'm pretty sure the issue is the use of 'long unsigned int'

14:17 < naywhayare> line 163

14:18 < naywhayare> use size_t not long unsigned int; size_t is not guaranteed to be long unsigned int on all platforms

14:19 < udit> hmm. okay. I was going to templatize CountMostFreq as well, but a similar problem was occurring. Maybe size_t should cut it.

14:19 < naywhayare> use whatever the type held by sortedLabels is

14:20 < naywhayare> which should be size_t

14:21 < udit> okay. Now, about the entropy calculation.

14:21 < naywhayare> why did you templatize CalculateEntropy to take a type for the labels? that should always be size_t anyway

14:22 < udit> I see that now.

14:23 < udit> Earlier I was using a rowvec for them, for more precise calculations.

14:23 < naywhayare> I don't understand how that makes it more precise; have I overlooked something?

14:23 < udit> no, it was my mistake.

14:23 < naywhayare> okay

14:24 < naywhayare> ah, okay, I understand what happened

14:25 < naywhayare> so when I mentioned templatizing, my suggestion would have been more template-y: template<typename VecType, typename LabelType> double DecisionStump<MatType>::CalculateEntropy(const VecType& attribute, const LabelType& labels)

14:25 < naywhayare> that way it handles both rowvec and subview_row

14:26 < udit> actually, subview_row seems as if it should be handled in a very dicey manner. Either that, or I haven't understood it that well because of very little documentation in the armadillo pages.

14:27 < naywhayare> yes, those are kind of "hidden" armadillo classes that aren't documented very well

14:27 < naywhayare> I disagree with that decision, to hide them like that

14:27 < naywhayare> either way, they should work about the same as a rowvec

14:28 < udit> so should I try with size_t as it currently stands ? because I don't see where it'll need to be rowvec the way I'm using it now...

14:29 < naywhayare> I'd use my suggestion, to avoid explicitly using the subview_row class

14:29 < naywhayare> because that class isn't documented, it's possible that the API may change in the future

14:29 < naywhayare> and that class may change name or disappear entirely

14:29 < naywhayare> but rowvec::subvec() can reasonably be expected to always return something that acts like a rowvec

14:30 < udit> so the typename VecType will take rowvec::subvec ? and LabelType - Row<size_t>::subvec ?

14:31 < naywhayare> yeah

14:33 < udit> okay, now about the entropy calculation. Did you understood what I talked about in the mail ?

14:33 < naywhayare> yeah, I get what you are saying, but I don't see why this much simpler snippet won't act the same way:

14:33 < naywhayare> http://pastebin.com/kfHDxxs5

14:34 < naywhayare> the entropy calculation does not in any way depend on the actual values of the points; only the labels themselves

14:34 < naywhayare> unless I have misunderstood the ID3 entropy calculation you are using

14:36 < udit> I think you might have. Think about the way oneR works. Entropy calculation might not depend directly upon the values of the points, but it still does depend on which value takes what label in which split/bin.

14:37 < udit> Try it the way you would proceed as in OneR.

14:37 < udit> Given an attribute, you create/mark bins. Then for each of the bins, you calculate the entropy, which is used to calculate the total entorpy of the split.

14:38 Anand has quit [Ping timeout: 246 seconds]

14:38 < naywhayare> yeah, true -- but for each bin you are calling CalculateEntropy(), and for each bin, the calculation only depends on the distributions of labels and not the attribute

14:38 < naywhayare> is that correct?

14:38 < udit> so the calculateEntropy function is acting over each bin.

14:38 < udit> yeah

14:39 < udit> Each attribute's entropy is calculated in setupsplitAttribute.

14:39 < udit> this is then compared to the bestEntropy in the constructor.

14:39 < naywhayare> right

14:39 < udit> now in each bin,

14:41 < udit> you need the uniqueAtt to construct the entropyArray, which is used to calculate entropy just for that bin.

14:42 < udit> it helps to know the labels of each element and numElem helps calculate the value of p2.

14:42 < naywhayare> but in each bin, the entropy does not depend on the actual attribute values; just the labels

14:43 < udit> it does.

14:43 < udit> upon the distribution.

14:44 < naywhayare> H(S) = -sum_{x \in X} p(x) log_2 p(x)

14:44 < naywhayare> where X is the set of classes

14:44 < naywhayare> and p(x) is the number of points in class x divided by the total number of points in the set S

14:44 < udit> p(x) is for this bin.

14:44 < udit> the current bin.

14:45 < naywhayare> yes, S is the set of points in the current bin, is it not?

14:45 < udit> yes, which changes in each bin, but the classes, numClass here, remains the same.

14:46 < naywhayare> well, you have to ignore any classes which aren't present in X anyway otherwise H(S) is infinity

14:46 < naywhayare> so I don't think that makes a difference and I'm still not seeing the necessity of uniqueAtt instead of just a count of how many points are in each class

14:47 < udit> well, the way you ignore those classes is that p3 is 0 when p2 is 0. you take 0log(0) to be 0

14:48 < naywhayare> yeah, that's fine; that part works just fine

14:54 < udit> also, I looked into the oneClass and defaultClass issue too. It works fine. I just needed to map those to using mapping vectors

14:57 < naywhayare> I took a look at the references you sent, but this is still not making it clear to me why uniqueAtt is necessary. one of the references mentions that information gain only needs to be computed when the class label changes, but that is considering where the bin cutoff is, not the actual entropy calculation

14:58 < naywhayare> I'm sorry that I am kind of hung up on this, but I am concerned that the entropy calculation could be way faster

14:58 < naywhayare> and I still am not seeing how the snippet I pastebin'ed is incorrect; if it's correct, it (or something like it) will be significantly faster than the implementation that calls arma::unique()

15:04 < udit> okay, take the set: {(1,0),(2,1),(2,1)(1,0),(2,0)}

15:06 < udit> here, your code in the pastebin does not take into consideration the entropy difference due to (2,1) and (2,0), that is 2 having 2 different labels 1 and 0

15:07 < naywhayare> so, what should the entropy in this case be?

15:07 < udit> but while calculating entropy you need to do that.

15:07 < naywhayare> by my understanding it should be 3/5 * log2(3/5) + 2/5 * log2(2/5)

15:09 Anand has joined #mlpack

15:10 < udit> naywhayare: just a sec.

15:20 < udit> naywhayare: okay, I think I got it wrong this whole time. It should not depend upon the discreet values of the attribute points, rather only the labels.

15:21 < udit> naywhayare: I just came across an example which confirmed your position.

15:21 < udit> naywhayare: page 8: http://202.154.59.182/mfile/files/Information%20System/Data%20Mining%3B%20Concepts,%20Models,%20Methods,%20and%20Algorithms%20(2nd%20Edition)/Chapter%206%20%20Decision%20Trees%20And%20Decision%20Rules.pdf

15:23 < naywhayare> ok, thanks for digging into that. I found that most sources I could find on the entropy calculation algorithm were somewhat unclear

15:23 < udit> they've not talked in depth on continuous, attributes.

15:24 < naywhayare> yeah, that is what I found also

15:24 < naywhayare> it made it difficult to figure out how to extrapolate to continuous features

15:24 < naywhayare> anyway, if you think my idea is correct, we can refactor CalculateEntropy() significantly, and it won't need the attribute vector anymore

15:25 < naywhayare> I think we also need a test case that ensures that the stump splits on the correct dimension, because we have no test case that does that

15:25 < naywhayare> so if the stump is splitting incorrectly, it won't throw an error in the tests

15:25 < naywhayare> does that seem reasonable? or maybe I have misunderstood one of the tests?

15:26 < udit> yeah, one of the tests should also test that.

15:26 < udit> okay. good that we got this sorted out.

15:27 < udit> now about MergeRanges()

15:27 < udit> we still need it right ?

15:28 < naywhayare> if it is possible that we will have two adjacent bins with the same label, then yes

15:28 < udit> Because, once the bins have been decided, what this effecttively does is is merge adjacent bins wiht same labels.

15:28 < udit> yeah.

15:28 < naywhayare> does the bin splitting algorithm take that into account?

15:28 < naywhayare> i.e. does it ensure that two adjacent bins will have different labels?

15:29 < udit> it can't. It just splits based on inputBinSize

15:29 < udit> it jsut splits the attribute. I take the corresponding labels and send them to calculateEntropy

15:30 < udit> the begin and end variables in SetupSplitAttribute

15:30 < naywhayare> okay, so then we do need MergeRanges()

15:31 < udit> now, then. How do we go about the concerned number of bins ?

15:31 < udit> Any ideas on that, because I thought it would become too complex to get that.

15:31 < naywhayare> hang on, let me do a little quick reading

15:32 < udit> Even in the papers which talked about continuous attributes talked about the number of points in each bin - not the number of bins.

15:32 < udit> yeah.

15:32 < naywhayare> fair enough

15:32 < naywhayare> looking at the OneR documentation, it doesn't support number of bins either

15:32 < naywhayare> so let's just leave it as it is

15:32 < naywhayare> although, I guess there is a fairly easy way

15:32 < udit> okay, so as of now, I have the templatization, the rewriting of calculateEntropy, that one unit test.

15:33 < udit> go on...

15:33 < naywhayare> if you have too many bins, just merge bins in a way that minimizes the increase in entropy until you have the right number of bins

15:33 < naywhayare> and you can only merge adjacent bins, so I don't think that would take too long

15:34 < naywhayare> whether or not you want to do that is up to you. other implementations don't seem to provide that functionality, so if you don't want to, don't worry about it

15:34 < naywhayare> but I think that would be probably the simplest and best way of doing it

15:35 < udit> that's what I thought initially too, but the complexity of managing the labels and their respective indexes while at the same time handling the entropy minimization across the feature set might be complex

15:36 < udit> because while going for entropy minimization, the attribute might change due to merging of bins after the attribute has been selected - but I think merging of bins will be done before selection.

15:37 < naywhayare> yeah, you could merge the bins before selecting the attribute, and take that into account in the entropy calculation

15:37 < udit> okay - I'll think about this, and if I do have time at the end, I'll get it done.

15:37 < naywhayare> I had thought that you would merge after selecting the attribute, but you are right that merging after selection could mean that the best attribute wasn't chosen

15:38 < naywhayare> anyway, you listed three things -- templatization, CalculateEntropy(), and the unit test

15:38 < udit> but the number of ways to merge the bins would be too many - it would be expensive too.

15:38 < naywhayare> true; there are many ways to merge the bin. the calculation of the best bin to merge for an attribute should be O(b) time where b is the number of existing bins

15:39 < naywhayare> my only other thought for how to improve the decision stump code is that maybe TrainOnAtt() and SetupSplitAttribute() have some code duplication, and maybe could be merged or refactored in some way to reduce that?

15:39 < naywhayare> I'm not sure about that one

15:39 < udit> yeah, so I'll look at it in the end. But as of now, the things that I listed previously - the templatization, the rewriting, and at least one more unit test is what we need to change.

15:39 < naywhayare> so maybe it is not possible

15:39 < naywhayare> yeah, just those three, and then let me know what you think about TrainOnAtt() and SetupSplitAttribute()

15:40 < udit> Yeah, some part of it could be. But that might be slightly more expensive. Let me think about that and I

15:40 < udit> *I'll get back to you once I've thought of something valuable.

15:40 < naywhayare> okay, sounds good

15:40 < udit> I guess that does it, then.

15:40 < naywhayare> yeah; I have one more thing I have to do on my end, though

15:41 < naywhayare> I need to go through SetupSplitAttribute() and understand how the binning decisions are done

15:41 < naywhayare> the comments you added will help, so thank you for committing those

15:42 < udit> sure

15:42 < udit> I guess you're back now, for good ?

15:42 < naywhayare> yeah, I am not traveling anymore

15:43 < udit> You also ( or maybe Marcus) need to go through the Perceptron.

15:43 < naywhayare> so I should have more time

15:43 < naywhayare> yeah, the perceptron is on my list too :)

15:43 < naywhayare> this Friday (July 4) is a holiday here, so I will probably be gone that day

15:43 < naywhayare> but other than that I don't have any travel plans for the rest of the summer, I think (unless I forgot some)

15:43 < udit> Oh, you'll have an extended weekend ! Nice.

15:45 < udit> My recollection (from various pop culture references) is that people usually have a barbecue. :D Especially on a 4th of July weekend.

15:46 < udit> good stuff.

15:46 < udit> I also wanted to talk you about some other stuff. Maybe later. I'll have dinner and start work again in a while then.

15:47 < udit> see you later, then.

15:51 < naywhayare> sorry, I got intercepted by someone

15:51 Anand_ has joined #mlpack

15:51 < naywhayare> and yeah, I am planning to go to the mountains and barbecue :)

15:52 Anand has quit [Ping timeout: 246 seconds]

15:52 < naywhayare> if you want to talk about any of those other things now, I'll be here for most of the rest of the day (minus some short breaks here and there)

15:55 < andrewmw94> naywhayare: I forgot. One reason to have the data in separate matrices was that it allowed easy insertion and deletion. Deleting points is important when constructing R*trees as they do forced reinsertion. I don't really see a way around this with how the code is setup though, but I wanted to double check with you that I should go ahead. I think I recall you saying that rebuilding a tree is generally faster than moving the point

15:59 < naywhayare> well, but I think easy insertion and deletion is similarly possible if you're just holding size_t indices instead of actual arma::vec objects

16:02 < andrewmw94> how would you do deletion? The best I can think of is to remove the size_t from the vector in the tree, then on the actual data, you swap the arma::vec with the last arma::vec, but that requires updating the size_t in the tree, which requires a query.

16:03 < andrewmw94> not too good, but perhaps ok. I'm not sure how it works though, since the tree is being changed and thus the query may not actually work

16:03 < andrewmw94> I need to think about that more

16:03 < naywhayare> well, you're deleting points from the tree, but not from the actual dataset

16:04 < naywhayare> like you said they do forced reinsertion

16:04 < naywhayare> so you delete that particular index, then later reinsert it

16:04 < naywhayare> maybe that makes sense?

16:04 < andrewmw94> ah, yes. I just realized that.

16:04 < naywhayare> I don't think it actually really makes sense to have trees of dynamic size

16:04 < naywhayare> for mlpack algorithms, every point _has_ to have a unique index

16:05 < naywhayare> otherwise the results of, say, nearest neighbor search don't actually make any sense -- they have to return an index of the nearest point

16:05 < andrewmw94> yeah

16:05 < naywhayare> and if each point has a unique index, and has the same dimensionality (which it has to) then it all fits in a big arma::mat structure anyway

16:06 < naywhayare> if a user wants to grow a tree to add points, though, I'm not sure exactly what needs to happen. a call to insert_cols() or something?

16:06 < andrewmw94> ok. The changed code passes my tests and valgrind. Now I'm thinking: "that's way too easy. I MUST have broken something."

16:07 < andrewmw94> the way it was setup, I had an insertPoint() function that would take an arma::vec

16:07 < andrewmw94> but I just changed that

16:07 < andrewmw94> the tree can no longer grow

16:08 < andrewmw94> well, I guess it technically could. You would have to expand the arma::mat and then you could use insertPoint(size_t)

16:08 < andrewmw94> on the new columns.

16:08 < andrewmw94> that should work as it used to

16:08 < naywhayare> yeah, I think if we ever offer generalized support for adding new points to trees, it's going to have to be along those lines

16:08 < naywhayare> expand the matrix, then call a function to grow the tree... or something

16:09 < naywhayare> either way, unless you have a solution I don't see, I don't think we need to worry about that use case for now

16:10 < andrewmw94> yeah. I'll try getting the traversal working now that the dataset is handled like the other trees

16:10 < naywhayare> ok, sounds good. please let me know if you need help

16:10 < andrewmw94> then add the ability to delete points, then R*-trees, then X-trees

16:10 < naywhayare> traversals and rules are... pretty complex

16:10 < andrewmw94> yep. Thanks.

17:08 oldbeardo has joined #mlpack

17:09 govg has joined #mlpack

17:09 govg has quit [Changing host]

17:09 govg has joined #mlpack

17:10 < oldbeardo> naywhayare: hey, did you get a chance to look at Reg SVD?

17:13 < naywhayare> I have not had a chance to look as comprehensively as I would like

17:13 < naywhayare> could you go ahead and add the QUIC_SVD class and code to the repository?

17:13 < oldbeardo> yes, about that, where should I add it?

17:28 < naywhayare> src/mlpack/methods/ seems fine to me; that or src/mlpack/core/lin_alg/

17:30 Anand_ has quit [Ping timeout: 246 seconds]

17:30 govg has quit [Ping timeout: 272 seconds]

17:31 govg has joined #mlpack

17:49 < naywhayare> oldbeardo: one other thing -- do you want to integrate QUIC_SVD into CF now, or were you planning to do that as a last part of your project?

17:49 < oldbeardo> naywhayare: yes, this is relevant to the point we had discussed before

17:50 < oldbeardo> if you remember, I had explained to you why QUIC-SVD will not work very well in the case of sparse matrices

17:51 < oldbeardo> so, I'm not sure if should even link it to the CF implementation

17:51 < oldbeardo> *we should

17:55 udit has quit [Quit: Leaving]

17:55 < naywhayare> right, I remember that now

17:55 < naywhayare> let me see if I can dig up the logs to refresh my memory more completely

18:00 sumedh__ has joined #mlpack

18:03 sumedh_ has quit [Ping timeout: 260 seconds]

18:09 < naywhayare> oldbeardo: http://ratml.org/misc_img/grouplens-svd.png

18:09 < naywhayare> I've plotted the singular values of the GroupLens matrix

18:10 < naywhayare> although it is very sparse, the image seems to indicate that it can be approximately fairly well by a fairly small number of singular vectors

18:11 < naywhayare> I know you tested your implementation against the declaration dataset, and you got good results there

18:13 < naywhayare> I suggested that you try adding the average rating for each item to the dataset, but this had a problem because many items only had one rating

18:13 < naywhayare> so maybe the better idea is to add the average rating for each user, not each item

18:13 < naywhayare> either way, maybe quic-svd just works poorly for that dataset, and that's okay (it's not your fault, it's Michael Holmes' fault :))

18:13 < naywhayare> I still think there may be situations in practice where quic-svd is useful, though, so we should at least make sure that CF can work with the QUIC_SVD class

18:15 < oldbeardo> okay, I was thinking that it should work well for every dataset, which will certainly not be the case for CF

18:16 < naywhayare> yeah. so we shouldn't make QUIC_SVD the default for CF, but we should make it so a user can type CF<QUIC_SVD> and it will work

18:24 < naywhayare> also, I responded to your email about the regularized SVD code, so hopefully is helpful input

18:26 < jenkins-mlpack> Starting build #1975 for job mlpack - svn checkin test (previous build: SUCCESS)

18:39 < oldbeardo> naywhayare: the problem persists even on taking average ratings for items

18:40 < naywhayare> yeah, you said that some time back. did you try average ratings for users?

18:40 < oldbeardo> it doesn't matter, that's what I'm trying to say

18:41 < naywhayare> oh, ok, you tried both and it failed either way

18:41 < oldbeardo> since the matrix is sparse, almost all the columns will be the same whether we take average user ratings or average item ratings

18:41 < naywhayare> each column will be similar but not the same; it will have few nonzero values, but at least one

18:43 < oldbeardo> yes, but even if the matrix has 100 users/items, the difference in the vectors won't be able to capture the information we need

18:43 < oldbeardo> in addition replacing missing values with averages will have a huge impact on the outcome

18:44 < naywhayare> yes, definitely that will have a big impact

18:47 < naywhayare> so my understanding is that the cosine tree, during construction of a node, randomly samples a point, and then takes the cosine similarity of all other points to that randomly sampled point (or something similar to that)

18:47 < naywhayare> then, it splits the points with higher similarity to the left, and lower to the right

18:47 < naywhayare> if the matrix is very very highly sparse, it's possible that the cosine similarity between one point and all others is 0

18:48 < naywhayare> for instance if a user (where a user is a point) has rated only one movie that nobody else has rated

18:48 < oldbeardo> yes, didn't think about that, this is an additional issue

18:49 < naywhayare> I would imagine the length-squared sampling (where it more often picks points with higher norms) should help to prevent this

18:49 < naywhayare> but it is not guaranteed to

18:50 < naywhayare> I'd think that adding row averages or column averages would help with this issue

18:50 < naywhayare> you pointed out that with row averages (item averages), the average is often completely wrong because the item was only rated once

18:51 < naywhayare> sorry, I miswrote "row averages" in that last message when I meant "column averages"

18:51 < naywhayare> I would think that using row averages (user averages) would work better because each user in the GroupLens dataset has rated at least 5 movies

18:51 < naywhayare> but you say that that didn't give any better performance

18:51 < naywhayare> so I'm trying to think about what is actually going on there

18:52 < oldbeardo> well, that's because the cosine similarity is not the issue here

18:52 < oldbeardo> the issue is the selected basis vector, which is the centrois of the columns in the node

18:53 < oldbeardo> it does not represent the data well, and that is because the columns are very similar to one another after averaging

18:54 < oldbeardo> even if that wasn't the case, the obtained SVD won't be representative of the data that we have, since we introduced the average ratings

18:57 govg has quit [Ping timeout: 248 seconds]

18:59 < naywhayare> I see! thank you

18:59 < naywhayare> let me think about this more and wrap my head around it completely

19:00 < oldbeardo> sure, I have a question about what you wrote about Reg SVD in the mail

19:01 < oldbeardo> the cost function for Reg SVD is a sum over all the examples

19:01 < oldbeardo> so when you say that I should L_BFGS, do you mean I should sum over everything inside a for loop?

19:02 < oldbeardo> if yes, then how will the gradient function look? since every user/item parameter vector maybe affected by more than one rating

19:03 < naywhayare> yeah, you could probably write it all in a for loop

19:03 < naywhayare> for the gradient, you call zeros(), then you iterate over all the points, and add each of their contributions

19:04 govg has joined #mlpack

19:04 govg has quit [Changing host]

19:04 govg has joined #mlpack

19:04 < oldbeardo> yes, but what will the 'contributions' be?

19:05 < naywhayare> regularized_svd_function_impl.hpp, lines 55 and 57

19:05 < naywhayare> for each user/item combination

19:05 < oldbeardo> in the case of SGD we have a learning rate, and every update is independent of the other

19:06 < naywhayare> yeah; L-BFGS is different, but it's still optimizing the same objective function

19:07 < oldbeardo> true, but won't adding two separate contributions be different from the actual gradient?

19:07 < oldbeardo> or maybe I'm getting confused over something trivial

19:08 < naywhayare> the gradient function you have implemented is the partial gradient with respect to one user/item combination

19:08 < naywhayare> and that's how SGD works -- when you have a function where you can easily calculate partial gradients, then you iterate over each partial gradient

19:08 < naywhayare> but many other optimizers don't work that way and operate on the full gradient

19:09 < naywhayare> so my suggestion is, don't calculate just the partial gradient -- calculate the full gradient, all at once, then let L_BFGS take a step based on that information

19:09 < oldbeardo> yes, my question is 'is the sum over all the contributions the full gradient?'

19:09 < naywhayare> yeah, that is correct

19:09 govg has quit [Ping timeout: 264 seconds]

19:10 < oldbeardo> okay, that solves it then

19:10 < naywhayare> it is possible that when you calculate them all at once, you can find a more optimized implementation, but summing them all up works too

19:10 < naywhayare> it's also possible that L_BFGS won't perform as well as SGD, but we'll see :)

19:11 < oldbeardo> I will try it out tomorrow, also, do you follow football?

19:12 < oldbeardo> oh sorry, *soccer

19:12 < sumedh__> I do :)

19:13 < sumedh__> late night match is not that good though...

19:13 < oldbeardo> sumedh__: heh, which team are you supporting?

19:13 < naywhayare> oldbeardo: sumedh__: I watched some of the world cup games when I was eating; they had them on TV. but I haven't really followed very much more than that :(

19:13 < sumedh__> right now Netherlands.... they are playing with exceptional chemistry... and robben...

19:14 < sumedh__> ohh... me and my friends usually watch every match together... today's match was boring...

19:15 < oldbeardo> naywhayare: you may want to eat after around 45 minutes :)

19:15 govg has joined #mlpack

19:15 < oldbeardo> sumedh__: Netherlands is a nice choice

19:16 < naywhayare> oldbeardo: hah; but I already had lunch. :( I guess things are exciting, so maybe I should take a second lunch or something :)

19:16 < oldbeardo> naywhayare: heh, maybe you should, USA is playing Belgium

19:17 < sumedh__> oldbeardo: very risky though... they play constant counter ... any strategic manager like mourinho will make them cry... but its not the case in this world cup is it :)

19:18 < sumedh__> anyway which team do you support??

19:18 < sumedh__> naywhayare: hehe :)

19:19 < oldbeardo> sumedh__: I have always been a Germany supporter, mainly because of Klose and Schweinsteiger

19:19 < sumedh__> yesterday's match was very close....

19:20 < oldbeardo> yes, really poor defense from Germany's side

19:20 < sumedh__> i am sad that muller didn't get a goal...

19:21 < sumedh__> yeah totally... in the first half German defense was very poor...

19:22 < sumedh__> oldbeardo: hats off to Neur... really... great defending...

19:22 < oldbeardo> yeah, anyway, I have to go, see you later sumedh__ naywhayare

19:23 oldbeardo has quit [Quit: Page closed]

19:39 < sumedh__> naywhayare: getting weird results with that function... index keeps on increasing...

19:40 < sumedh__> means residue keeps on increasing...

19:40 < naywhayare> residue is increasing... that's shouldn't be happening...

19:41 < naywhayare> arma::norm(V - W * H, "fro"), right?

19:41 < sumedh__> yes exactly...

19:41 < jenkins-mlpack> Project mlpack - svn checkin test build #1975: SUCCESS in 1 hr 14 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1975/

19:41 < jenkins-mlpack> * Ryan Curtin: Very simple change to fix build on i386.

19:41 < jenkins-mlpack> * andrewmw94: change the tree to store size_t in the nodes and keep the dataset together. Other misc changes.

19:41 < naywhayare> is it increasing a lot or a little each iteration?

19:43 < sumedh__> starts with 4000... and increases in 4 or 5 iterations t0 10000 and stops... I checked it with simple residue termination... its doing fine...

20:50 < andrewmw94> naywhayare: so if I keep the dataset as it originally is, does that mean I can just remove the unmapping code for the R tree

20:51 < naywhayare> andrewmw94: yeah, if you don't modify the dataset at all, then the unmapping code isn't necessary

20:52 < naywhayare> in the TreeTraits for the R tree, just set RearrangesDataset to false

20:52 < naywhayare> and then you shouldn't need to mess with any of the algorithm classes

20:53 < andrewmw94> ok. But there's code in the allknn file to handle unmapping. Can I just remove that when I copy stuff for the R tree?

20:54 < naywhayare> yeah, or you can ignore it with an if statement

20:54 < naywhayare> it still needs to be there for the BinarySpaceTree

20:54 < naywhayare> but you are right that it's unnecessary for the R tree

20:54 < andrewmw94> ok, thanks

23:35 andrewmw94 has quit [Quit: Leaving.]