< jenkins-mlpack>
* Ryan Curtin: Fix comment and clarify that it's pertaining to the runtime constructor, not the
< jenkins-mlpack>
template parameters.
< jenkins-mlpack>
* andrewmw94: Add files and some preliminary code for R tree
sumedhghaisas has quit [Remote host closed the connection]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 265 seconds]
koderok has joined #mlpack
koderok has quit [Ping timeout: 252 seconds]
oldbeardo has joined #mlpack
andrewmw94 has joined #mlpack
< oldbeardo>
naywhayare: I have a question about Least Squared sampling
< naywhayare>
oldbeardo: ok, go ahead
< oldbeardo>
if you could open the QUIC-SVD paper and look at algorithm 3
< naywhayare>
ok, I'm looking at it
< oldbeardo>
okay, in step 1 the algorithm mentions sampling 's' rows from the matrix
< oldbeardo>
are these 's' rows to be different, or is repetition allowed?
< naywhayare>
hmm... hang on, let me do a little digging
< oldbeardo>
okay, I looked at the old code, they have calculated the actual error over there
< naywhayare>
so, they calculated error with all points, not just a sampled subset?
< oldbeardo>
I was wrong, it's not the actual error, but doesn't look like what's mentioned in the paper either
< naywhayare>
hang on, I am reading through it now
< oldbeardo>
look at QuicSVD::addBasisFrom() and QuicSVD::curRelErr()
< naywhayare>
yeah, that's what I'm looking through
< naywhayare>
it looks to me like the code that's there is similar to what MCSqError() would be if every row was sampled
< oldbeardo>
right, that doesn't help though
< naywhayare>
also, by the way, it's 'length-squared sampling' not 'least-squared sampling'
< naywhayare>
but you're right, nothing I've said has helped yet :)
< naywhayare>
I'm trying to figure out which of these we should use
< naywhayare>
ok, I think I understand
< oldbeardo>
sorry, 'least-squared' is a more commonly used term I suppose
< naywhayare>
in their paper, each row is a point; in mlpack implementation (and I think in the original implementation), each column is a point
< naywhayare>
so when they calculate MCSqError(), it takes O(N) time (where there are N points)
< naywhayare>
sorry, that last statement was unclear
< naywhayare>
what I meant to say was, when they call curRelError() (which is their implementation in the old code of something like MCSqError()), it takes O(N) time (where there are N points)
< naywhayare>
but in the paper, because they are sampling O(log N) points, the operation takes much less time
< oldbeardo>
I understand that
< naywhayare>
so I would say that implementing MCSqError() as implemented in the paper would be the way to go, even though Michael Holmes didn't do that in his original code...
< oldbeardo>
my question is about the sampled points, rather than the computation of the error
< naywhayare>
right, I was getting back to that eventually :)
< naywhayare>
I don't know if length-squared sampling is with or without replacement
< naywhayare>
reference 3 ("Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations") is where it was introduced, so I'm taking a look there
Anand has joined #mlpack
< naywhayare>
ok, it looks like sampling is performed with replacement;; that's a relief, because it's slower and harder to implement sampling without replacement
< naywhayare>
oldbeardo: when you write your code, if you can add a comment referencing that paper ("Efficient Singular Value Decopmosition via Improved Document Sampling") to mention why replacement is being used, it'd probably be useful
< naywhayare>
otherwise someone will come along a few years from now and ask the same question and nobody will remember why :)
< oldbeardo>
sure, will do
< oldbeardo>
by the way, I saw that video, got quite a lot of upvotes on Reddit :)
< naywhayare>
I hadn't seen it until recently
< naywhayare>
they do the recording of Littman and Isbell's machine learning class here at Georgia Tech
< naywhayare>
and because I go to Isbell's lab meetings, I sometimes hear about how those recording sessions go... apparently they can never stop laughing and have a really hard time being serious
< oldbeardo>
I know, it's for the online Masters in CS
< naywhayare>
but I didn't know they were recording music videos too! :)
< oldbeardo>
heh, from what I have heard that class is quite entertaining
< andrewmw94>
For the R type trees, I'd like to be able to add and remove points, since R*trees and X trees don't make too much sense without that. However, I'm a bit confused on how the data in the tree is stored. It seems to be in the data matrix. and then data is moved arround as appropriate for the tree
< naywhayare>
I've never seen any of the lectures; I know that the TAs have to do a ton of work because there are so many students
< andrewmw94>
but if that is the case, it seems like deletion would have to take O(n) time
< andrewmw94>
since the points in the data matirx seem to be contiguous
< oldbeardo>
naywhayare: how many are we talking about?
< naywhayare>
andrewmw94: for the binary space tree, the data is reordered so that all the points held in a leaf node are contiguous in memory
< naywhayare>
oldbeardo: I think several hundred? I'm not certain. it seemed like way more work than a regular TA job
< andrewmw94>
but then I can't delete arbitrary points in logrithmic time
< naywhayare>
andrewmw94: right, I see what you mean, but you wouldn't be able to do that anyway, in general
< naywhayare>
if you're holding a data matrix and you drop a point out of it, you have to copy the whole matrix anyway
< andrewmw94>
naywhayare: how common is it to add points dynamically during a knn algorithm?
< andrewmw94>
naywhayare: I know it's use in robocode, and it must be fairly common since certain tree types developed arround it,
< naywhayare>
andrewmw94: usually this is not done. because rebalancing kd-trees (and cover trees) can be very time-consuming, it's often easier to throw away the whole tree and rebuild it
< naywhayare>
but it also depends on how the tree is rebalanced, or if it is, after a new point is added
< andrewmw94>
naywhayare: but R trees are always balanced, it's a question of how bad the rectangles are
< andrewmw94>
they're like B trees. But if I have to keep the points in the matrix, then I'll need to keep them contiguous
< naywhayare>
balanced in what sense?
< andrewmw94>
all leaf nodes are on the same level
< andrewmw94>
they may not be equaly filled
< naywhayare>
right
< naywhayare>
so insertion of a single point can incur O(log N) rebalancing, right?
< andrewmw94>
yes
< andrewmw94>
but with this, I'd need to shift all of the points to the right, which makes it much slower
< naywhayare>
I agree
< naywhayare>
hang on, let me do a little reading and thinking
< andrewmw94>
yeah, I think I'm going to go pace around for a bit
< naywhayare>
:)
< andrewmw94>
the first thing that comes to mind is to store the index of each point in the matrix in each leaf node of the tree
< naywhayare>
so, one thought is this: my understanding of the R* construction procedure is that points are added and removed during the construction process
< naywhayare>
but in the end you have a finalized tree structure of some sort
< naywhayare>
so you could delay making the points contiguous in memory until the final tree is built
< oldbeardo>
naywhayare: one more question for your consideration, you said that Mudit's implementation was row major, so instead of sampling rows I'm sampling columns, shouldn't make a difference right?
< naywhayare>
or, you could do your idea -- just store the index of the points that are held in each leaf (this is basically what the cover tree does)
< naywhayare>
oldbeardo: yes, each point is a column, so you should be sampling columns. you'll have to transpose the notation of just about everything in the paper...
Anand_ has joined #mlpack
< oldbeardo>
naywhayare: okay, thanks
< naywhayare>
andrewmw94: the problem with not having points contiguous in memory is that when you come to inner loops that loop over each point in a leaf, the memory will be accessed less efficiently
< naywhayare>
but, I don't know how much of a difference that really makes
oldbeardo has quit [Quit: Page closed]
< andrewmw94>
yeah
< andrewmw94>
actually, I think that's a fairly big deal
< andrewmw94>
since one of the advantages of R trees is the memory accesss
< naywhayare>
I'd be interested to see numbers, but, also, after thinking about it a bit, it would be easy to take a non-contiguous tree of any type and rearrange the points
< andrewmw94>
but on the otherhand, they would need to have a lot of data before paging actually became an issue
< naywhayare>
I think you'd see speedup just as a result of linear memory access vs. random memory access
< naywhayare>
I'm reminded of a paper... let me find it
< naywhayare>
Ulrich Drepper has written a couple of articles like this... very useful
< naywhayare>
(although they are super long)
< naywhayare>
let me find the section I am looking for...
< naywhayare>
ok, I think Figure 3.15 on page 23 is what I was looking for
< naywhayare>
or, the whole of section 3.3.2, if you like reading extremely verbose documents :)
< andrewmw94>
nothing could be more fun
< naywhayare>
for our particular situation, I don't know how much speedup we'd see by making points contiguous, but this is something we can do entirely independently of tree construction
< naywhayare>
haha
Anand_ has quit [Ping timeout: 240 seconds]
< andrewmw94>
naywhayare: yeah, I think the memory speedup would be an issue. The wikipedia page says the R tree is esigned for storage on disk. I don't see why it's design is more disk specific than a kd-tree
< andrewmw94>
but it seems like teh memory would be an issue
< andrewmw94>
reordering the points should be fine for bulk loading, but I'm not sure haw it would work for the dynamic insertions
< andrewmw94>
are you certain that dynamically building the trees is rare? It sounds as though that is the main reason for R*trees (http://en.wikipedia.org/wiki/R*-tree)
udit_s has quit [Quit: Ex-Chat]
< andrewmw94>
what is the difference between BinarySpaceTree::FurthestPointDistance() and BinarySpaceTree::FurthestDescendantDistance()
< naywhayare>
tneilc cri ym htiw gnorw si gnihtemos ,hu
< naywhayare>
sdrawkcab si gnihtyreve
< naywhayare>
everything is backwards
< naywhayare>
[1~I think I need to logout and login
< naywhayare>
ok, so, that was the weirdest thing I've had happen to me recently. my phone SSH client was connected but malfunctioning, and for some combination of reasons it was causing everything I typed into irssi to be entered backwards
< naywhayare>
anyway, now I can answer your questions...
< naywhayare>
FurthestPointDistance() returns the distance between the node's centroid and the furthest point held in that node (not held in children or descendants)
< naywhayare>
whereas FurthestDescendantDistance() is the distance between the node's center and the furthest point held in that node or any descendants
< naywhayare>
the /names
< naywhayare>
excuse me... I am failing at typing
< naywhayare>
the dynamic building of trees is rare in the context of kd-trees, where it is faster to just rebuild the entire tree than it is to do dynamic insertions or deletions
< andrewmw94>
but aren't they both based on the bowndaries, not the actual points?
< andrewmw94>
sorry,too slow with my typing
< andrewmw94>
I meant for the furthestPoint thing
< naywhayare>
for the case of kd-trees, yeah, it is easier to calculate a possibly loose bound based on the boundaires
< naywhayare>
*boundaries
< andrewmw94>
that's what it looks like the code is doing
< andrewmw94>
but unless I'm missing something, they should both be the same then
< naywhayare>
I think FurthestPointDistance() is implemented incorrectly... the if(IsLeaf()) should be if(!IsLeaf()), I think
< andrewmw94>
ahh
< andrewmw94>
I thought the comment was wrong
< naywhayare>
yeah... and the test is wrong in the exact same way
< naywhayare>
for a non-leaf node, the furthest point distance should be zero because it holds no points
< andrewmw94>
so is the comment or the code correct, because, I changed the comment and kept the code
< andrewmw94>
ahh
< andrewmw94>
ok
< naywhayare>
I'm testing and committing a fix now... sorry about that
< andrewmw94>
but doesn't the MaxDistance function give the diagonal when called with that argument
< andrewmw94>
that doesn't make sense either
< andrewmw94>
it sohuld at least be 0.5*diam
< andrewmw94>
rather than diam
< andrewmw94>
if I am correct on what it doess
< naywhayare>
this is in FurthestPointDistance()?
< andrewmw94>
yes
< naywhayare>
ok, I think you are right again... it should be multiplied by 0.5
< andrewmw94>
reading the code in mrectbound_imlp() it seems like the return in FurthestPointDistance just gives the diam
< naywhayare>
I must have been very tired when I wrote that
< andrewmw94>
happens to me to :)
Anand has joined #mlpack
< naywhayare>
also, thinking about on-disk matrices and trees built on disk-resident datasets, I have a guy working to extend arma::mat to work with on-disk data by using mmap()
< naywhayare>
so when he gets that working, it is probably reasonable to create trees using mmap()-ed arma::mat objects and see how they perform
< andrewmw94>
yeah
< naywhayare>
at least, I think it will work without any modification being necessary...
< andrewmw94>
I'm not sure whether I should just write it for bulk loading and hope I can get dynamic insertions working later or whether I should keep trying to think of a way to get dynamic insertions without copying the whole matrix, with keeping data contiguous, and with keeping insertions logrithmic
< andrewmw94>
it seems like even trying to periodically reorder the data would break the prefetching of data from ram
< andrewmw94>
so keeping it contiguous is really important
< andrewmw94>
copying it is definitely the easiest, and if they are actually inserting points dynamically it isn't that bad. Perhaps there should be two different versions?
< andrewmw94>
a bulk loading one and a dynamic one?
< naywhayare>
so, one of the problems here is that all the trees in mlpack are built on an arma::mat object
< naywhayare>
but resizing the arma::mat object, no matter how it's done, requires a complete copy of the matrix because realloc() isn't being used (new/delete are used)
< naywhayare>
and even if realloc() was being used that doesn't guarantee that a copy won't be necessary
< andrewmw94>
or maybe we could just say that since the X tree is a super set of the R* tree which only makes sense in a dynamic situation, and there's no reason to use an R tree instead in those situations, the R*tree will be dynamic and quite different than the binary space tree?
< andrewmw94>
and the same for the X tree?
< naywhayare>
yes, you could write a tree that doesn't hold an arma::mat but instead actually holds the arma::vec in each node
< andrewmw94>
hmm, actually the X tree could be nice even when bulkloading
< andrewmw94>
but that sohuld be really rare
< naywhayare>
let me take some time to think about how to do this (like, a day or a few days)
< naywhayare>
I need to think about how the current abstractions might need to be modified to support dynamic trees
< naywhayare>
currently, trees generally don't hold a reference to the dataset, so when you run AllkNN or something, the tree is separate from the dataset, and you need to pass both
< andrewmw94>
yeah. I need to learn more about how bulk loading of R trees is actually done. I've never done bulk loading in robocode
< andrewmw94>
I think that the main reason someone would want an R*tree or supersets would be if they were dynamically adding observations
< naywhayare>
this allows the tree to be smaller in memory, because nodes don't need to hold a reference to the dataset -- they only hold the indices of the point held in each node
< andrewmw94>
or if they were ignorant
< andrewmw94>
which can't be ruled out
< andrewmw94>
but if bulk loading, then the way you do it makes the most sense
< naywhayare>
I think it would be nice to provide dynamic tree support, but I'm not sure how to do that without revising the current tree abstractions in a way that will make them slower
< andrewmw94>
yeah. I think it makes sense to have the Rectangle type trees separate from the others
< andrewmw94>
since they can hold more than just points and are generally different
< naywhayare>
perhaps, but it's worth remembering that mlpack doesn't really support anything that's not points, and also that the rectangle trees should be able to work with the existing dual-tree algorithms
< andrewmw94>
but machine learning applications wil usually just be points
< naywhayare>
depending on the application, yeah. for range search / proximity-type problems, it'll almost always be "just points"
< andrewmw94>
would you expect it to be used for pathfinding type stuff, like PRM RRT's
< andrewmw94>
because that's a common real world application that requires dynamic point insertion, but it's not really machine learning
< andrewmw94>
in my opinion
< naywhayare>
not familiar with PRM RRTs
< andrewmw94>
it's just a path planning algorithm that's really common
< naywhayare>
I sort of see mlpack's dual-tree algorithms as black-box solvers to problems like nearest neighbor, range search, etc.
< naywhayare>
so a user passes in a data matrix and says "I need the 3 nearest neighbors of every point", then mlpack builds the tree, finds the neighbors, destroys the tree, and returns the neighbors
< naywhayare>
in which case dynamic trees aren't necessary
< andrewmw94>
yeah, that's actually really relevant to PRM
< andrewmw94>
but not to RRT
< naywhayare>
but, mlpack could also be used to create a tree, find nearest neighbors, modify the tree slightly, find nearest neighbors again, and so on and so forth
< naywhayare>
which might be useful in the context of video searching, or something like that, where frames tend to change slowly
< naywhayare>
or databases that grow over time... or something like that
< andrewmw94>
exactly
< naywhayare>
in current code, as you've seen, this isn't really all that feasible because there are no single-point insertions or deletions
< andrewmw94>
yeah, and the matrix doesn't seem like it could possible support that
< naywhayare>
yes, but I'm trying to think of some workarounds
< naywhayare>
the BinarySpaceTree class doesn't hold a reference to the dataset it's built on
< naywhayare>
but if it did, then each node could reference a different dataset -- and these different datasets could consist of only what's in the node
< naywhayare>
which gets you leaf nodes that have points that are contiguous in memory, and also insertions/deletions are nowhere near expensive, because your cost for modifying a leaf is at most O(d * (maximum number of points in a leaf)) where d is the dimensionality
< naywhayare>
but the drawback here is now that sizeof(BinarySpaceTree<...>) is 8 bytes larger (or 4 on entertainingly ancient systems)
< naywhayare>
adding 8 bytes to each node doesn't scare me; that shouldn't be a problem because usually the number of nodes is nowhere near the number of points (cover trees are an exception, but I'll ignore that for now)
Anand has quit [Ping timeout: 240 seconds]
< naywhayare>
but this means that all of the dual-tree algorithms have to be refactored; lines like 'querySet.col(queryNode.Point(i))' turn into 'queryNode.Dataset().col(queryNode.Point(i))'
< naywhayare>
or something like that
< naywhayare>
I *think* this will incur slowdown, but I'm not sure how much or if it really matters
< naywhayare>
if the slowdown is negligible, then great, we have a solution, but I'm not sure
< andrewmw94>
but do you have to copy each dataset referenced by each node?
< andrewmw94>
or are these still in the original matirx
< naywhayare>
let me use pastebin to sketch out my ideas, hang on...
< naywhayare>
I guess I should have clarified a bit... for the BinarySpaceTree, every single node in the tree has the exact same value of arma::mat& dataset -- each node references the same dataset
< andrewmw94>
ahh
< andrewmw94>
that makes sense, we still can't add and delete efficiently in BSP trees, but it makes the R tree work and keeps them similar
< andrewmw94>
the only thing is the copying of the data for the R tree
< andrewmw94>
but if we want to keep it contiguous and keep insertions that seems necessary
< naywhayare>
tree construction time is usually O(N log N) so I'm not too concerned with an O(N) cost
< andrewmw94>
and it's a one time thing
< naywhayare>
or, rather, I'm not too concerned with a one-time O(N) cost -- yeah
< andrewmw94>
and I assume that these would only be in the leaf nodes
< naywhayare>
what is nice is that the points don't need to be in any specific ordering in a leaf, so simply by adding the points to a node's local data matrix, we get points that are contiguous in memory
< naywhayare>
yeah, unless you're working with a tree type that holds points in non-leaf nodes
< naywhayare>
but I think all the R-tree variants hold points only in the leaves
< andrewmw94>
ordering them in an R tree is weird anyways, since there isn't a split dimension
Anand_ has joined #mlpack
Anand_ has quit [Ping timeout: 240 seconds]
< andrewmw94>
naywhayare: also, I don't recall if I already mentioned this, but it seems like the FurthestPointDistance() function should check against the actual points, not the bounding box when it is in a leaf node. I'm not sure if the bounding box is actually set in a leaf node
< andrewmw94>
naywhayare: I think it's using the consturctor of line 137, which doesn't seem to set the bound
< andrewmw94>
also, is my SVN not working or have you not commited the fix you mentioned yet?
< naywhayare>
oh... I started the test then started doing other things and forgot about it
< naywhayare>
I seem to do that often
< naywhayare>
the check for FurthestPointDistance() is against the bounding box and not actual points because it's much quicker that way
< naywhayare>
against the bounding box, the check time is O(d), but against all the points, the check time is O(d * (number of points)), which can be much larger especially given how often those functions are called in dual-tree algorithms
< naywhayare>
it is possible that by giving an upper bound for FurthestPointDistance(), some algorithms might break, but I haven't yet encountered that problem
< naywhayare>
so I've figured so far that if the problem does arise later it can be dealt with later :)
< andrewmw94>
but are you sure that it works
< andrewmw94>
the constructor doesn't seem to set the bound
< andrewmw94>
and the way it seems to usually be set is the splitNode() function
< andrewmw94>
I think the constructor is on line 137
< naywhayare>
fixes from earlier today committed in r16523 and r16524
< naywhayare>
the bound is set when the node is constructed; otherwise, it should default to an empty bound with no volume
< naywhayare>
sorry, what I said was unclear. you are right that the bound is set when the node is constructed in SplitNode()
< naywhayare>
so I meant "constructed" in the sense of tree-building, not C++ object construction time :)
< andrewmw94>
but doesn't that only set the bound of the parent nodes?
< andrewmw94>
I think the leaf nodes just have their constructors called (lines 641-644)
< andrewmw94>
and these seem to be the constructors on 137, which I think set the bound to an empty bound of the correct dimensions
< naywhayare>
when a leaf is constructed, the constructor on line 137 is called, like you pointed out
< naywhayare>
then SplitNode() is called
< naywhayare>
and the bound is expanded; however, the function returns at line 637
< naywhayare>
because the leaf can't be further split
< andrewmw94>
ahh
< naywhayare>
line 615 is the one that extends the bound to the right size
< naywhayare>
now... I suppose it's possible that during that calculation, the true furthest descendant distance could be cached
< naywhayare>
without much (or any) extra overhead
< andrewmw94>
yeah. The way it is now also seems like it should be equivalent to using furthestDescendantDistance
< andrewmw94>
since you call bound.MakDistance(bound)
< andrewmw94>
which is, I think, the same as bound.Diam()
< naywhayare>
yeah
< naywhayare>
the thing about using furthestDescendantDistance to prune is that it implies a ball of radius furthestDescendantDistance
< naywhayare>
instead of a potentially much less voluminous hyperrectangle
< andrewmw94>
but don't we prune with a function that takes the query point anyways?
< andrewmw94>
which reminds me of another thing that was bothering me before bed. Is the pruning safe for all LMetrics?
< andrewmw94>
it doesn't seem like it could be if p is between 0 and 1 (exclusive)
< andrewmw94>
but then again, that probably isn't a commonly used metric
< naywhayare>
so when you're talking about pruning, it implies that you're solving a particular problem or have a particular application
< naywhayare>
is this for nearest neighbor search?
< andrewmw94>
yes
< naywhayare>
ok
< andrewmw94>
because I don't think the triangle inequality holds for L metrics when 0 < p < 1
< naywhayare>
the dual-tree algorithm for nearest neighbor search in mlpack/methods/neighbor_search/ should be safe for any valid metric, not just l-metrics
< naywhayare>
yeah; in that case, the L-metric isn't actually a metric
< naywhayare>
but I don't think you can build an L-0.5 tree anyway because the template parameter to HRectBound has to be an int
< naywhayare>
(due to C++ language restrictions on the types of template parameters)
< naywhayare>
I didn't address one of your questions:
< naywhayare>
18:43 < andrewmw94> but don't we prune with a function that takes the query point anyways?
< naywhayare>
in a single-tree case, yes, the function Score(point, node) is called; but still in that case it's often better for kd-trees to use a hyperrectangle bound instead of a ball with radius furthestDescendantDistance
< naywhayare>
and that's also true in the dual-tree case where the function Score(node, node) is called
< andrewmw94>
but since we are using a query case, it seems like it doesn't matter that a hyperrectangle is less voluminous than a hypresphere, since we are using a different function to get the worst case distances between them
< andrewmw94>
and a point or another volume as needed
< naywhayare>
I'm not sure I understand what you mean
< andrewmw94>
so, we are using a volume or point when we want to prune some part of the tree. This, I assume, uses a different function which can handle the difference between a hyperrectangle and a hypersphere
< andrewmw94>
right?
< naywhayare>
so, considering the single-tree case, the basic flow of any single-tree algorithm is this: traverse the tree, visit a node, score/prune the node, do base cases between the query point and any points of the node
< naywhayare>
in mlpack's implementation of these Score() functions, it's better to use MaxDistance(point, node) instead of distance(point, node.Centroid()) + node.FurthestDescendantDistance()
< naywhayare>
or MinDistance(point, node) instead of distance(point, node.Centroid()) - node.FurthestDescendantDistance()
< naywhayare>
and the reason for this is that MinDistance() and MaxDistance() are functions provided by the tree type, so they can take into account special shapes of the bounding shape of the node
< naywhayare>
whereas the other code I wrote assumes that each node is a ball centered at node.Centroid() with radius node.FurthestDescendantDistance()
< naywhayare>
the runtime of most (all?) single-tree and dual-tree algorithms are influenced quite heavily by how tight MinDistance() and MaxDistance() are to the true minimum distance and maximum distance
< naywhayare>
where the true minimum distance is min_{p in descendants of node N} d(q, p)
< naywhayare>
of course, the true minimum distance takes a very long time to calculate, which is why MinDistance() provides a rough bound that is calculable very quickly
< naywhayare>
maybe I have clarified? or maybe I wandered off into some other explanation?
< andrewmw94>
I think that's what I was trying to say. Unless I missunderstood you
< naywhayare>
ok. I thought you were trying to say that it shouldn't matter whether the bounding ball or bounding hyperrectangle is used for pruning
< andrewmw94>
I just don't see what the purpose of having furthest point distance and furthestDescendantDistance rather than just furthestDescendantDistance
naywhayare changed the topic of #mlpack to: "http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/"
< andrewmw94>
as the code is written now, I think the only difference is that furthestPointDistcance sometimes returns 0
< naywhayare>
I see what you mean
< naywhayare>
the specific reason those are implemented is because they are used for cover trees
< naywhayare>
specifically for the B(N_q) function under Algorithm 3 (where did the page numbers go? they aren't there in this version)
< naywhayare>
looks like the fifth page
< naywhayare>
the kd-tree uses the first line of that bound, but the cover tree uses the second and third lines
< naywhayare>
it turns out you can combine these bounds to get a bound that's tighter than either, but, this means that kd-trees need to implement phi(N_q) and lambda(N_q)
< naywhayare>
which are the furthest point distance and furthest descendant distance, respectively
< naywhayare>
I can't remember if that bound is actually being used in the mlpack code. I don't think it is, because it takes a long time to calculate, but I'm not certain
< andrewmw94>
hmm. I like the idea of caching the furthest point once though
< naywhayare>
the cover tree does that, and also does it for the furthest descendant
< naywhayare>
but I'm not sure if the FurthestPointDistance() or FurthestDescendantDistance() functions are ever called in any of the mlpack dual-tree algorithms when TreeType = BinarySpaceTree<...>
< andrewmw94>
ahh. Then it doesn't really matter. Do you mind if I change bound.MaxDistance(bound) to bound.Diameter() ? I think the second is easier to read
< naywhayare>
in FurthestPointDistance()? yeah, I agree, go ahead and change it
< andrewmw94>
yeah. I also change the comment from if+ to unless. I assume if+ is a typo and not some notation I'm unfamiliar with?
< naywhayare>
probably, yeah
< naywhayare>
where is "if+"? I don't see it in the code anywhere
< andrewmw94>
in the comment
< naywhayare>
ok, I don't see it, but I guess I will when you check it in
< andrewmw94>
weird. I thought I "svn up"ed right before changing that
< naywhayare>
I'd like to add your name to the list of contributors... is there a preferred email I should use? (or none at all?)
< andrewmw94>
andrewmw94@gmail.com is probably the best
< naywhayare>
ok
< naywhayare>
added into src/mlpack/core.hpp, COPYRIGHT.txt, and mlpack.org/about.html
< andrewmw94>
thanks
< naywhayare>
I always type 'andrew wm' or 'awm' instead of 'amw'; I get mixed up because there's a guy named Andrew W. Moore who was a coauthor on a lot of the early dual-tree algorithm work