#mlpack on 2014-05-20 — irc logs at libera.irclog.whitequark.org

01:25 sumedhghaisas has quit [Remote host closed the connection]

01:26 sumedhghaisas has joined #mlpack

05:39 udit_s has joined #mlpack

05:53 < jenkins-mlpack> Project mlpack - nightly matrix build build #460: STILL UNSTABLE in 1 hr 53 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/460/

05:53 < jenkins-mlpack> * Ryan Curtin: Fix comment and clarify that it's pertaining to the runtime constructor, not the

05:53 < jenkins-mlpack> template parameters.

05:53 < jenkins-mlpack> * andrewmw94: Add files and some preliminary code for R tree

06:27 sumedhghaisas has quit [Remote host closed the connection]

06:28 sumedhghaisas has joined #mlpack

07:29 sumedhghaisas has quit [Ping timeout: 265 seconds]

09:35 koderok has joined #mlpack

10:30 koderok has quit [Ping timeout: 252 seconds]

12:15 oldbeardo has joined #mlpack

12:27 andrewmw94 has joined #mlpack

12:54 < oldbeardo> naywhayare: I have a question about Least Squared sampling

13:14 < naywhayare> oldbeardo: ok, go ahead

13:15 < oldbeardo> if you could open the QUIC-SVD paper and look at algorithm 3

13:16 < naywhayare> ok, I'm looking at it

13:17 < oldbeardo> okay, in step 1 the algorithm mentions sampling 's' rows from the matrix

13:17 < oldbeardo> are these 's' rows to be different, or is repetition allowed?

13:18 < naywhayare> hmm... hang on, let me do a little digging

13:19 < oldbeardo> okay, I looked at the old code, they have calculated the actual error over there

13:19 < naywhayare> so, they calculated error with all points, not just a sampled subset?

13:22 < oldbeardo> I was wrong, it's not the actual error, but doesn't look like what's mentioned in the paper either

13:22 < naywhayare> hang on, I am reading through it now

13:23 < oldbeardo> look at QuicSVD::addBasisFrom() and QuicSVD::curRelErr()

13:23 < naywhayare> yeah, that's what I'm looking through

13:27 < naywhayare> it looks to me like the code that's there is similar to what MCSqError() would be if every row was sampled

13:28 < oldbeardo> right, that doesn't help though

13:28 < naywhayare> also, by the way, it's 'length-squared sampling' not 'least-squared sampling'

13:28 < naywhayare> but you're right, nothing I've said has helped yet :)

13:29 < naywhayare> I'm trying to figure out which of these we should use

13:30 < naywhayare> ok, I think I understand

13:30 < oldbeardo> sorry, 'least-squared' is a more commonly used term I suppose

13:30 < naywhayare> in their paper, each row is a point; in mlpack implementation (and I think in the original implementation), each column is a point

13:31 < naywhayare> so when they calculate MCSqError(), it takes O(N) time (where there are N points)

13:31 < naywhayare> sorry, that last statement was unclear

13:32 < naywhayare> what I meant to say was, when they call curRelError() (which is their implementation in the old code of something like MCSqError()), it takes O(N) time (where there are N points)

13:32 < naywhayare> but in the paper, because they are sampling O(log N) points, the operation takes much less time

13:33 < oldbeardo> I understand that

13:33 < naywhayare> so I would say that implementing MCSqError() as implemented in the paper would be the way to go, even though Michael Holmes didn't do that in his original code...

13:34 < oldbeardo> my question is about the sampled points, rather than the computation of the error

13:34 < naywhayare> right, I was getting back to that eventually :)

13:34 < naywhayare> I don't know if length-squared sampling is with or without replacement

13:35 < naywhayare> reference 3 ("Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations") is where it was introduced, so I'm taking a look there

13:36 Anand has joined #mlpack

13:38 < naywhayare> ok, it looks like sampling is performed with replacement;; that's a relief, because it's slower and harder to implement sampling without replacement

13:39 < naywhayare> here is my source:

13:39 < naywhayare> http://www.cs.rutgers.edu/~mlittman/papers/sigir99-randsvd.ps

13:39 < naywhayare> section 1.2, subheading 'Algorithm'

13:39 < naywhayare> thinking of Michael Littman, here he is singing Thriller: https://www.youtube.com/watch?v=DQWI1kvmwRg

13:40 Anand has quit [Ping timeout: 240 seconds]

13:41 < naywhayare> oldbeardo: when you write your code, if you can add a comment referencing that paper ("Efficient Singular Value Decopmosition via Improved Document Sampling") to mention why replacement is being used, it'd probably be useful

13:41 < naywhayare> otherwise someone will come along a few years from now and ask the same question and nobody will remember why :)

13:43 < oldbeardo> sure, will do

13:43 < oldbeardo> by the way, I saw that video, got quite a lot of upvotes on Reddit :)

13:43 < naywhayare> I hadn't seen it until recently

13:44 < naywhayare> they do the recording of Littman and Isbell's machine learning class here at Georgia Tech

13:44 < naywhayare> and because I go to Isbell's lab meetings, I sometimes hear about how those recording sessions go... apparently they can never stop laughing and have a really hard time being serious

13:44 < oldbeardo> I know, it's for the online Masters in CS

13:44 < naywhayare> but I didn't know they were recording music videos too! :)

13:45 < oldbeardo> heh, from what I have heard that class is quite entertaining

13:45 < andrewmw94> For the R type trees, I'd like to be able to add and remove points, since R*trees and X trees don't make too much sense without that. However, I'm a bit confused on how the data in the tree is stored. It seems to be in the data matrix. and then data is moved arround as appropriate for the tree

13:45 < naywhayare> I've never seen any of the lectures; I know that the TAs have to do a ton of work because there are so many students

13:45 < andrewmw94> but if that is the case, it seems like deletion would have to take O(n) time

13:45 < andrewmw94> since the points in the data matirx seem to be contiguous

13:45 < oldbeardo> naywhayare: how many are we talking about?

13:45 < naywhayare> andrewmw94: for the binary space tree, the data is reordered so that all the points held in a leaf node are contiguous in memory

13:46 < naywhayare> oldbeardo: I think several hundred? I'm not certain. it seemed like way more work than a regular TA job

13:46 < andrewmw94> but then I can't delete arbitrary points in logrithmic time

13:46 < naywhayare> andrewmw94: right, I see what you mean, but you wouldn't be able to do that anyway, in general

13:47 < naywhayare> if you're holding a data matrix and you drop a point out of it, you have to copy the whole matrix anyway

13:47 < andrewmw94> naywhayare: how common is it to add points dynamically during a knn algorithm?

13:47 < andrewmw94> naywhayare: I know it's use in robocode, and it must be fairly common since certain tree types developed arround it,

13:47 < naywhayare> andrewmw94: usually this is not done. because rebalancing kd-trees (and cover trees) can be very time-consuming, it's often easier to throw away the whole tree and rebuild it

13:48 < naywhayare> but it also depends on how the tree is rebalanced, or if it is, after a new point is added

13:48 < andrewmw94> naywhayare: but R trees are always balanced, it's a question of how bad the rectangles are

13:49 < andrewmw94> they're like B trees. But if I have to keep the points in the matrix, then I'll need to keep them contiguous

13:49 < naywhayare> balanced in what sense?

13:49 < andrewmw94> all leaf nodes are on the same level

13:50 < andrewmw94> they may not be equaly filled

13:50 < naywhayare> right

13:50 < naywhayare> so insertion of a single point can incur O(log N) rebalancing, right?

13:50 < andrewmw94> yes

13:51 < andrewmw94> but with this, I'd need to shift all of the points to the right, which makes it much slower

13:51 < naywhayare> I agree

13:52 < naywhayare> hang on, let me do a little reading and thinking

13:53 < andrewmw94> yeah, I think I'm going to go pace around for a bit

13:53 < naywhayare> :)

13:53 < andrewmw94> the first thing that comes to mind is to store the index of each point in the matrix in each leaf node of the tree

13:54 < naywhayare> so, one thought is this: my understanding of the R* construction procedure is that points are added and removed during the construction process

13:54 < naywhayare> but in the end you have a finalized tree structure of some sort

13:54 < naywhayare> so you could delay making the points contiguous in memory until the final tree is built

13:54 < oldbeardo> naywhayare: one more question for your consideration, you said that Mudit's implementation was row major, so instead of sampling rows I'm sampling columns, shouldn't make a difference right?

13:54 < naywhayare> or, you could do your idea -- just store the index of the points that are held in each leaf (this is basically what the cover tree does)

13:55 < naywhayare> oldbeardo: yes, each point is a column, so you should be sampling columns. you'll have to transpose the notation of just about everything in the paper...

13:55 Anand_ has joined #mlpack

13:55 < oldbeardo> naywhayare: okay, thanks

13:55 < naywhayare> andrewmw94: the problem with not having points contiguous in memory is that when you come to inner loops that loop over each point in a leaf, the memory will be accessed less efficiently

13:56 < naywhayare> but, I don't know how much of a difference that really makes

13:56 oldbeardo has quit [Quit: Page closed]

14:01 < andrewmw94> yeah

14:02 < andrewmw94> actually, I think that's a fairly big deal

14:02 < andrewmw94> since one of the advantages of R trees is the memory accesss

14:02 < naywhayare> I'd be interested to see numbers, but, also, after thinking about it a bit, it would be easy to take a non-contiguous tree of any type and rearrange the points

14:02 < andrewmw94> but on the otherhand, they would need to have a lot of data before paging actually became an issue

14:03 < naywhayare> I think you'd see speedup just as a result of linear memory access vs. random memory access

14:03 < naywhayare> I'm reminded of a paper... let me find it

14:03 < naywhayare> http://diyhpl.us/~bryan/papers2/distributed/distributed-systems/what-every-programmer-should-know-about-memory.2007.pdf

14:03 < naywhayare> Ulrich Drepper has written a couple of articles like this... very useful

14:04 < naywhayare> (although they are super long)

14:04 < naywhayare> let me find the section I am looking for...

14:08 < naywhayare> ok, I think Figure 3.15 on page 23 is what I was looking for

14:09 < naywhayare> or, the whole of section 3.3.2, if you like reading extremely verbose documents :)

14:09 < andrewmw94> nothing could be more fun

14:09 < naywhayare> for our particular situation, I don't know how much speedup we'd see by making points contiguous, but this is something we can do entirely independently of tree construction

14:09 < naywhayare> haha

14:17 Anand_ has quit [Ping timeout: 240 seconds]

14:23 < andrewmw94> naywhayare: yeah, I think the memory speedup would be an issue. The wikipedia page says the R tree is esigned for storage on disk. I don't see why it's design is more disk specific than a kd-tree

14:23 < andrewmw94> but it seems like teh memory would be an issue

14:23 < andrewmw94> reordering the points should be fine for bulk loading, but I'm not sure haw it would work for the dynamic insertions

14:27 < andrewmw94> are you certain that dynamically building the trees is rare? It sounds as though that is the main reason for R*trees (http://en.wikipedia.org/wiki/R*-tree)

14:29 udit_s has quit [Quit: Ex-Chat]

14:43 < andrewmw94> what is the difference between BinarySpaceTree::FurthestPointDistance() and BinarySpaceTree::FurthestDescendantDistance()

15:23 < naywhayare> tneilc cri ym htiw gnorw si gnihtemos ,hu

15:23 < naywhayare> sdrawkcab si gnihtyreve

15:23 < naywhayare> everything is backwards

15:24 < naywhayare> [1~I think I need to logout and login

15:26 < naywhayare> ok, so, that was the weirdest thing I've had happen to me recently. my phone SSH client was connected but malfunctioning, and for some combination of reasons it was causing everything I typed into irssi to be entered backwards

15:26 < naywhayare> anyway, now I can answer your questions...

15:27 < naywhayare> FurthestPointDistance() returns the distance between the node's centroid and the furthest point held in that node (not held in children or descendants)

15:27 < naywhayare> whereas FurthestDescendantDistance() is the distance between the node's center and the furthest point held in that node or any descendants

15:28 < naywhayare> the /names

15:28 < naywhayare> excuse me... I am failing at typing

15:30 < naywhayare> the dynamic building of trees is rare in the context of kd-trees, where it is faster to just rebuild the entire tree than it is to do dynamic insertions or deletions

15:30 < andrewmw94> but aren't they both based on the bowndaries, not the actual points?

15:30 < andrewmw94> sorry,too slow with my typing

15:30 < andrewmw94> I meant for the furthestPoint thing

15:31 < naywhayare> for the case of kd-trees, yeah, it is easier to calculate a possibly loose bound based on the boundaires

15:31 < naywhayare> *boundaries

15:31 < andrewmw94> that's what it looks like the code is doing

15:31 < andrewmw94> but unless I'm missing something, they should both be the same then

15:31 < naywhayare> I think FurthestPointDistance() is implemented incorrectly... the if(IsLeaf()) should be if(!IsLeaf()), I think

15:32 < andrewmw94> ahh

15:32 < andrewmw94> I thought the comment was wrong

15:32 < naywhayare> yeah... and the test is wrong in the exact same way

15:32 < naywhayare> for a non-leaf node, the furthest point distance should be zero because it holds no points

15:32 < andrewmw94> so is the comment or the code correct, because, I changed the comment and kept the code

15:33 < andrewmw94> ahh

15:33 < andrewmw94> ok

15:33 < naywhayare> I'm testing and committing a fix now... sorry about that

15:33 < andrewmw94> but doesn't the MaxDistance function give the diagonal when called with that argument

15:33 < andrewmw94> that doesn't make sense either

15:34 < andrewmw94> it sohuld at least be 0.5*diam

15:34 < andrewmw94> rather than diam

15:34 < andrewmw94> if I am correct on what it doess

15:35 < naywhayare> this is in FurthestPointDistance()?

15:35 < andrewmw94> yes

15:36 < naywhayare> ok, I think you are right again... it should be multiplied by 0.5

15:36 < andrewmw94> reading the code in mrectbound_imlp() it seems like the return in FurthestPointDistance just gives the diam

15:36 < naywhayare> I must have been very tired when I wrote that

15:36 < andrewmw94> happens to me to :)

15:36 Anand has joined #mlpack

15:37 < naywhayare> also, thinking about on-disk matrices and trees built on disk-resident datasets, I have a guy working to extend arma::mat to work with on-disk data by using mmap()

15:37 < naywhayare> so when he gets that working, it is probably reasonable to create trees using mmap()-ed arma::mat objects and see how they perform

15:37 < andrewmw94> yeah

15:37 < naywhayare> at least, I think it will work without any modification being necessary...

15:39 < andrewmw94> I'm not sure whether I should just write it for bulk loading and hope I can get dynamic insertions working later or whether I should keep trying to think of a way to get dynamic insertions without copying the whole matrix, with keeping data contiguous, and with keeping insertions logrithmic

15:40 < andrewmw94> it seems like even trying to periodically reorder the data would break the prefetching of data from ram

15:40 < andrewmw94> so keeping it contiguous is really important

15:41 < andrewmw94> copying it is definitely the easiest, and if they are actually inserting points dynamically it isn't that bad. Perhaps there should be two different versions?

15:42 < andrewmw94> a bulk loading one and a dynamic one?

15:42 < naywhayare> so, one of the problems here is that all the trees in mlpack are built on an arma::mat object

15:43 < naywhayare> but resizing the arma::mat object, no matter how it's done, requires a complete copy of the matrix because realloc() isn't being used (new/delete are used)

15:43 < naywhayare> and even if realloc() was being used that doesn't guarantee that a copy won't be necessary

15:44 < andrewmw94> or maybe we could just say that since the X tree is a super set of the R* tree which only makes sense in a dynamic situation, and there's no reason to use an R tree instead in those situations, the R*tree will be dynamic and quite different than the binary space tree?

15:44 < andrewmw94> and the same for the X tree?

15:45 < naywhayare> yes, you could write a tree that doesn't hold an arma::mat but instead actually holds the arma::vec in each node

15:45 < andrewmw94> hmm, actually the X tree could be nice even when bulkloading

15:45 < andrewmw94> but that sohuld be really rare

15:45 < naywhayare> let me take some time to think about how to do this (like, a day or a few days)

15:46 < naywhayare> I need to think about how the current abstractions might need to be modified to support dynamic trees

15:46 < naywhayare> currently, trees generally don't hold a reference to the dataset, so when you run AllkNN or something, the tree is separate from the dataset, and you need to pass both

15:46 < andrewmw94> yeah. I need to learn more about how bulk loading of R trees is actually done. I've never done bulk loading in robocode

15:47 < andrewmw94> I think that the main reason someone would want an R*tree or supersets would be if they were dynamically adding observations

15:47 < naywhayare> this allows the tree to be smaller in memory, because nodes don't need to hold a reference to the dataset -- they only hold the indices of the point held in each node

15:47 < andrewmw94> or if they were ignorant

15:48 < andrewmw94> which can't be ruled out

15:48 < andrewmw94> but if bulk loading, then the way you do it makes the most sense

15:49 < naywhayare> I think it would be nice to provide dynamic tree support, but I'm not sure how to do that without revising the current tree abstractions in a way that will make them slower

15:50 < andrewmw94> yeah. I think it makes sense to have the Rectangle type trees separate from the others

15:50 < andrewmw94> since they can hold more than just points and are generally different

15:51 < naywhayare> perhaps, but it's worth remembering that mlpack doesn't really support anything that's not points, and also that the rectangle trees should be able to work with the existing dual-tree algorithms

15:51 < andrewmw94> but machine learning applications wil usually just be points

15:51 < naywhayare> depending on the application, yeah. for range search / proximity-type problems, it'll almost always be "just points"

15:52 < andrewmw94> would you expect it to be used for pathfinding type stuff, like PRM RRT's

15:52 < andrewmw94> because that's a common real world application that requires dynamic point insertion, but it's not really machine learning

15:52 < andrewmw94> in my opinion

15:52 < naywhayare> not familiar with PRM RRTs

15:53 < andrewmw94> it's just a path planning algorithm that's really common

15:53 < naywhayare> I sort of see mlpack's dual-tree algorithms as black-box solvers to problems like nearest neighbor, range search, etc.

15:53 < naywhayare> so a user passes in a data matrix and says "I need the 3 nearest neighbors of every point", then mlpack builds the tree, finds the neighbors, destroys the tree, and returns the neighbors

15:54 < naywhayare> in which case dynamic trees aren't necessary

15:54 < andrewmw94> yeah, that's actually really relevant to PRM

15:54 < andrewmw94> but not to RRT

15:54 < naywhayare> but, mlpack could also be used to create a tree, find nearest neighbors, modify the tree slightly, find nearest neighbors again, and so on and so forth

15:54 < naywhayare> which might be useful in the context of video searching, or something like that, where frames tend to change slowly

15:54 < naywhayare> or databases that grow over time... or something like that

15:55 < andrewmw94> exactly

15:55 < naywhayare> in current code, as you've seen, this isn't really all that feasible because there are no single-point insertions or deletions

15:55 < andrewmw94> yeah, and the matrix doesn't seem like it could possible support that

15:56 < naywhayare> yes, but I'm trying to think of some workarounds

15:56 < naywhayare> the BinarySpaceTree class doesn't hold a reference to the dataset it's built on

15:56 < naywhayare> but if it did, then each node could reference a different dataset -- and these different datasets could consist of only what's in the node

15:57 < naywhayare> which gets you leaf nodes that have points that are contiguous in memory, and also insertions/deletions are nowhere near expensive, because your cost for modifying a leaf is at most O(d * (maximum number of points in a leaf)) where d is the dimensionality

15:57 < naywhayare> but the drawback here is now that sizeof(BinarySpaceTree<...>) is 8 bytes larger (or 4 on entertainingly ancient systems)

15:58 < naywhayare> adding 8 bytes to each node doesn't scare me; that shouldn't be a problem because usually the number of nodes is nowhere near the number of points (cover trees are an exception, but I'll ignore that for now)

15:59 Anand has quit [Ping timeout: 240 seconds]

15:59 < naywhayare> but this means that all of the dual-tree algorithms have to be refactored; lines like 'querySet.col(queryNode.Point(i))' turn into 'queryNode.Dataset().col(queryNode.Point(i))'

15:59 < naywhayare> or something like that

15:59 < naywhayare> I *think* this will incur slowdown, but I'm not sure how much or if it really matters

15:59 < naywhayare> if the slowdown is negligible, then great, we have a solution, but I'm not sure

16:00 < andrewmw94> but do you have to copy each dataset referenced by each node?

16:00 < andrewmw94> or are these still in the original matirx

16:00 < naywhayare> let me use pastebin to sketch out my ideas, hang on...

16:03 < naywhayare> http://pastebin.com/TMpWN8SN

16:04 < naywhayare> I guess I should have clarified a bit... for the BinarySpaceTree, every single node in the tree has the exact same value of arma::mat& dataset -- each node references the same dataset

16:05 < andrewmw94> ahh

16:05 < andrewmw94> that makes sense, we still can't add and delete efficiently in BSP trees, but it makes the R tree work and keeps them similar

16:06 < andrewmw94> the only thing is the copying of the data for the R tree

16:06 < andrewmw94> but if we want to keep it contiguous and keep insertions that seems necessary

16:06 < naywhayare> tree construction time is usually O(N log N) so I'm not too concerned with an O(N) cost

16:06 < andrewmw94> and it's a one time thing

16:06 < naywhayare> or, rather, I'm not too concerned with a one-time O(N) cost -- yeah

16:07 < andrewmw94> and I assume that these would only be in the leaf nodes

16:07 < naywhayare> what is nice is that the points don't need to be in any specific ordering in a leaf, so simply by adding the points to a node's local data matrix, we get points that are contiguous in memory

16:07 < naywhayare> yeah, unless you're working with a tree type that holds points in non-leaf nodes

16:07 < naywhayare> but I think all the R-tree variants hold points only in the leaves

16:08 < andrewmw94> ordering them in an R tree is weird anyways, since there isn't a split dimension

16:09 Anand_ has joined #mlpack

16:27 Anand_ has quit [Ping timeout: 240 seconds]

18:24 < andrewmw94> naywhayare: also, I don't recall if I already mentioned this, but it seems like the FurthestPointDistance() function should check against the actual points, not the bounding box when it is in a leaf node. I'm not sure if the bounding box is actually set in a leaf node

18:26 < andrewmw94> naywhayare: I think it's using the consturctor of line 137, which doesn't seem to set the bound

18:29 < andrewmw94> also, is my SVN not working or have you not commited the fix you mentioned yet?

18:32 < naywhayare> oh... I started the test then started doing other things and forgot about it

18:32 < naywhayare> I seem to do that often

18:32 < naywhayare> the check for FurthestPointDistance() is against the bounding box and not actual points because it's much quicker that way

18:33 < naywhayare> against the bounding box, the check time is O(d), but against all the points, the check time is O(d * (number of points)), which can be much larger especially given how often those functions are called in dual-tree algorithms

18:33 < naywhayare> it is possible that by giving an upper bound for FurthestPointDistance(), some algorithms might break, but I haven't yet encountered that problem

18:33 < naywhayare> so I've figured so far that if the problem does arise later it can be dealt with later :)

18:34 < andrewmw94> but are you sure that it works

18:34 < andrewmw94> the constructor doesn't seem to set the bound

18:34 < andrewmw94> and the way it seems to usually be set is the splitNode() function

18:34 < andrewmw94> I think the constructor is on line 137

18:35 < naywhayare> fixes from earlier today committed in r16523 and r16524

18:35 < naywhayare> the bound is set when the node is constructed; otherwise, it should default to an empty bound with no volume

18:36 < naywhayare> sorry, what I said was unclear. you are right that the bound is set when the node is constructed in SplitNode()

18:36 < naywhayare> so I meant "constructed" in the sense of tree-building, not C++ object construction time :)

18:37 < andrewmw94> but doesn't that only set the bound of the parent nodes?

18:37 < andrewmw94> I think the leaf nodes just have their constructors called (lines 641-644)

18:38 < andrewmw94> and these seem to be the constructors on 137, which I think set the bound to an empty bound of the correct dimensions

18:39 < naywhayare> when a leaf is constructed, the constructor on line 137 is called, like you pointed out

18:39 < naywhayare> then SplitNode() is called

18:40 < naywhayare> and the bound is expanded; however, the function returns at line 637

18:40 < naywhayare> because the leaf can't be further split

18:40 < andrewmw94> ahh

18:40 < naywhayare> line 615 is the one that extends the bound to the right size

18:40 < naywhayare> now... I suppose it's possible that during that calculation, the true furthest descendant distance could be cached

18:40 < naywhayare> without much (or any) extra overhead

18:42 < andrewmw94> yeah. The way it is now also seems like it should be equivalent to using furthestDescendantDistance

18:42 < andrewmw94> since you call bound.MakDistance(bound)

18:42 < andrewmw94> which is, I think, the same as bound.Diam()

18:42 < naywhayare> yeah

18:43 < naywhayare> the thing about using furthestDescendantDistance to prune is that it implies a ball of radius furthestDescendantDistance

18:43 < naywhayare> instead of a potentially much less voluminous hyperrectangle

18:43 < andrewmw94> but don't we prune with a function that takes the query point anyways?

18:44 < andrewmw94> which reminds me of another thing that was bothering me before bed. Is the pruning safe for all LMetrics?

18:45 < andrewmw94> it doesn't seem like it could be if p is between 0 and 1 (exclusive)

18:45 < andrewmw94> but then again, that probably isn't a commonly used metric

18:45 < naywhayare> so when you're talking about pruning, it implies that you're solving a particular problem or have a particular application

18:45 < naywhayare> is this for nearest neighbor search?

18:46 < andrewmw94> yes

18:46 < naywhayare> ok

18:46 < andrewmw94> because I don't think the triangle inequality holds for L metrics when 0 < p < 1

18:46 < naywhayare> the dual-tree algorithm for nearest neighbor search in mlpack/methods/neighbor_search/ should be safe for any valid metric, not just l-metrics

18:46 < naywhayare> yeah; in that case, the L-metric isn't actually a metric

18:46 < naywhayare> but I don't think you can build an L-0.5 tree anyway because the template parameter to HRectBound has to be an int

18:47 < naywhayare> (due to C++ language restrictions on the types of template parameters)

18:47 < naywhayare> I didn't address one of your questions:

18:47 < naywhayare> 18:43 < andrewmw94> but don't we prune with a function that takes the query point anyways?

18:48 < naywhayare> in a single-tree case, yes, the function Score(point, node) is called; but still in that case it's often better for kd-trees to use a hyperrectangle bound instead of a ball with radius furthestDescendantDistance

18:48 < naywhayare> and that's also true in the dual-tree case where the function Score(node, node) is called

18:49 < andrewmw94> but since we are using a query case, it seems like it doesn't matter that a hyperrectangle is less voluminous than a hypresphere, since we are using a different function to get the worst case distances between them

18:49 < andrewmw94> and a point or another volume as needed

18:52 < naywhayare> I'm not sure I understand what you mean

18:54 < andrewmw94> so, we are using a volume or point when we want to prune some part of the tree. This, I assume, uses a different function which can handle the difference between a hyperrectangle and a hypersphere

18:54 < andrewmw94> right?

18:55 < naywhayare> so, considering the single-tree case, the basic flow of any single-tree algorithm is this: traverse the tree, visit a node, score/prune the node, do base cases between the query point and any points of the node

18:56 < naywhayare> in mlpack's implementation of these Score() functions, it's better to use MaxDistance(point, node) instead of distance(point, node.Centroid()) + node.FurthestDescendantDistance()

18:56 < naywhayare> or MinDistance(point, node) instead of distance(point, node.Centroid()) - node.FurthestDescendantDistance()

18:57 < naywhayare> and the reason for this is that MinDistance() and MaxDistance() are functions provided by the tree type, so they can take into account special shapes of the bounding shape of the node

18:57 < naywhayare> whereas the other code I wrote assumes that each node is a ball centered at node.Centroid() with radius node.FurthestDescendantDistance()

18:58 < naywhayare> the runtime of most (all?) single-tree and dual-tree algorithms are influenced quite heavily by how tight MinDistance() and MaxDistance() are to the true minimum distance and maximum distance

18:58 < naywhayare> where the true minimum distance is min_{p in descendants of node N} d(q, p)

18:58 < naywhayare> of course, the true minimum distance takes a very long time to calculate, which is why MinDistance() provides a rough bound that is calculable very quickly

18:59 < naywhayare> maybe I have clarified? or maybe I wandered off into some other explanation?

19:06 < andrewmw94> I think that's what I was trying to say. Unless I missunderstood you

19:07 < naywhayare> ok. I thought you were trying to say that it shouldn't matter whether the bounding ball or bounding hyperrectangle is used for pruning

19:07 < andrewmw94> I just don't see what the purpose of having furthest point distance and furthestDescendantDistance rather than just furthestDescendantDistance

19:08 naywhayare changed the topic of #mlpack to: "http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/"

19:08 < andrewmw94> as the code is written now, I think the only difference is that furthestPointDistcance sometimes returns 0

19:08 < naywhayare> I see what you mean

19:09 < naywhayare> the specific reason those are implemented is because they are used for cover trees

19:09 < naywhayare> http://machinelearning.wustl.edu/mlpapers/paper_files/icml2013_curtin13.pdf

19:09 < naywhayare> specifically for the B(N_q) function under Algorithm 3 (where did the page numbers go? they aren't there in this version)

19:09 < naywhayare> looks like the fifth page

19:10 < naywhayare> the kd-tree uses the first line of that bound, but the cover tree uses the second and third lines

19:10 < naywhayare> it turns out you can combine these bounds to get a bound that's tighter than either, but, this means that kd-trees need to implement phi(N_q) and lambda(N_q)

19:10 < naywhayare> which are the furthest point distance and furthest descendant distance, respectively

19:11 < naywhayare> I can't remember if that bound is actually being used in the mlpack code. I don't think it is, because it takes a long time to calculate, but I'm not certain

19:11 < andrewmw94> hmm. I like the idea of caching the furthest point once though

19:12 < naywhayare> the cover tree does that, and also does it for the furthest descendant

19:12 < naywhayare> but I'm not sure if the FurthestPointDistance() or FurthestDescendantDistance() functions are ever called in any of the mlpack dual-tree algorithms when TreeType = BinarySpaceTree<...>

19:13 < andrewmw94> ahh. Then it doesn't really matter. Do you mind if I change bound.MaxDistance(bound) to bound.Diameter() ? I think the second is easier to read

19:14 < naywhayare> in FurthestPointDistance()? yeah, I agree, go ahead and change it

19:15 < andrewmw94> yeah. I also change the comment from if+ to unless. I assume if+ is a typo and not some notation I'm unfamiliar with?

19:15 < naywhayare> probably, yeah

19:16 < naywhayare> where is "if+"? I don't see it in the code anywhere

19:16 < andrewmw94> in the comment

19:17 < naywhayare> ok, I don't see it, but I guess I will when you check it in

19:20 < naywhayare> I think your 'if+' was local :) http://www.mlpack.org/trac/changeset/16525

19:23 < andrewmw94> weird. I thought I "svn up"ed right before changing that

19:45 < naywhayare> I'd like to add your name to the list of contributors... is there a preferred email I should use? (or none at all?)

19:45 < andrewmw94> andrewmw94@gmail.com is probably the best

19:45 < naywhayare> ok

19:48 < naywhayare> added into src/mlpack/core.hpp, COPYRIGHT.txt, and mlpack.org/about.html

19:49 < andrewmw94> thanks

19:50 < naywhayare> I always type 'andrew wm' or 'awm' instead of 'amw'; I get mixed up because there's a guy named Andrew W. Moore who was a coauthor on a lot of the early dual-tree algorithm work

19:50 < naywhayare> http://www.cs.cmu.edu/~awm/

19:50 < naywhayare> I don't think he's still working on trees; I'm not sure what he's up to now

21:02 andrewmw94 has quit [Remote host closed the connection]