naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
jbc__ has joined #mlpack
andrewmw94 has quit [Quit: Leaving.]
jbc__ has quit [Quit: jbc__]
govg has quit [Ping timeout: 240 seconds]
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 272 seconds]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 240 seconds]
sumedhghaisas has joined #mlpack
jbc__ has joined #mlpack
jbc__ has quit [Client Quit]
jbc__ has joined #mlpack
govg has quit [Ping timeout: 240 seconds]
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
govg has quit [Ping timeout: 255 seconds]
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
govg has quit [Ping timeout: 272 seconds]
govg has joined #mlpack
andrewmw94 has joined #mlpack
< andrewmw94>
naywhayare: are you free?
< naywhayare>
andrewmw94: yeah, I'm here
< andrewmw94>
So I have a question about the X tree.
< andrewmw94>
The paper doesn't explain some of the stuff very well
< andrewmw94>
so I looked for implementations. The only one I found was in lisp
< naywhayare>
okay; I am looking at the paper
< naywhayare>
heh, it has been a while since I've written lisp...
< andrewmw94>
the problem I'm at now it how they want you to handle the split history when you don't use the minimal overlap split
< andrewmw94>
or how to handle it when reinserting nodes. Answering either of those should answer the other
< naywhayare>
okay, let me do some quick learning
< andrewmw94>
but I can't think of a way to do it using the tree approach they describe.
< andrewmw94>
I think I have a way to do it that doesn't scale well to higher dimensions, but it should be fine for D < 100
< andrewmw94>
Is that an ok assumption to make?
< naywhayare>
generally, people think that trees are bad for higher dimensions (i.e. D > 10 or 20), but in my experience you can build trees for nearest neighbor search and still get speedup with hundreds of dimensions
< naywhayare>
I'm trying to find the relevant section of the paper for the question(s) you're asking
< andrewmw94>
yeah. It only impacts performance while building the tree, so I'm not sure how important speed is.
< andrewmw94>
I think it's section 3.3
< naywhayare>
yeah, that is what I'm reading
< naywhayare>
in general tree construction is O(N log N) with some dependence on d
< andrewmw94>
but this particular issue isn't addressed. They skip a lot of stuff. EG. they don't specify which topological splits they use (they kind of sort of implied the ones from the R* tree, so that's what I'm using currently)
< naywhayare>
this is usually about the same as the scaling characteristics of the tree-based algorithm you might use; single-tree algorithms are usually O(N log N) for N queries, and dual-tree algorithms are often O(N) (but again those proofs depend on dataset-dependent parameters...)
< naywhayare>
so where I am meandering with my discussion is to say that usually, the construction time of a kd-tree is nowhere near as large as the runtime of the algorithms (...in most cases)
< naywhayare>
I would expect about the same for the R tree, though it might be difference
< andrewmw94>
yeah
< naywhayare>
*different
< andrewmw94>
yeah, I would expect that too. My idea is basically to tie the split history to each individual node using a bitset. Then you can & them all together to see if they all share a dimension. Loop through those starting one after the last dimension used so you tend to split along different dimensions
< andrewmw94>
so it's basically adding D iterations of a loop every time you split a node
< andrewmw94>
I think that would be fairly minimal
< andrewmw94>
then their remains the problem of choosing where to divide the nodes once you know they all have been split in that dimension.
< naywhayare>
hm, I bet that you could do it with integer divide/modulo operations instead of a loop
< naywhayare>
but if you are working with bits that represent dimensions, then it will probably be quite fast
< naywhayare>
(in comparison to the other operations, such as distance calculations, etc.)
< andrewmw94>
yeah.
< andrewmw94>
I'm really curious how they did this. They claim they could encode the split history into a few bits. I encoded it into integers, but even then, there's nothing like a "few bits"
< naywhayare>
it looks like they are talking about D <= 16
< naywhayare>
which is arguably "a few bits" for a large "few"
< andrewmw94>
a very large few
< andrewmw94>
I usually think of it as 2 or 3
< naywhayare>
let me read through this section again to better understand the splitting algorithm
< andrewmw94>
My calculations indicate that you need about 4*maxNumChildren integers
< andrewmw94>
at least for my method
< andrewmw94>
I don't think it's explained very well. It may be faster for me to try.
< andrewmw94>
Or do you want to see if I missed something?
< naywhayare>
I'm going to read through it anyway and come back with an idea (...hopefully), but if you want to implement your idea too, go ahead
< naywhayare>
there's no guarantee that my thought will be anything better than what you've already come up with, especially given that you've already spent a significant amount of time on this
< andrewmw94>
ok. I'll save what I already have. It works for the minimal split thing and took a while to figure out. I just can't see how you could preserve a meaningful tree structure when moving arbitrary nodes around as required by the topological split or reinsertion of nodes
< andrewmw94>
I'm going to grab lunch. I'll be back in about 30 min
< naywhayare>
okay, hopefully I will have some idea of something by then
sumedhghaisas has quit [Ping timeout: 272 seconds]
< andrewmw94>
naywhayare: back
< naywhayare>
andrewmw94: I realized that I need to get a bit more perspective on what is going on in this paper, so I'm reading the whole thing (or, at least, the non-experimental stuff)
< andrewmw94>
ok
< naywhayare>
okay, so here is my understanding of the X tree split procedure. I'm going to write it all out just to make sure we're on the same page. you've spent more time with it, so correct me if I am wrong
< naywhayare>
when you insert a point, you find which MBR to insert it into
< naywhayare>
should that MBR be filled to capacity, you try to split that MBR into two
< naywhayare>
the first thing you try is the topological split, which is from the R* tree
< naywhayare>
if you find that the topological split produces overlapping nodes, then you try the overlap minimal split
< naywhayare>
(and I think this is where your question lies)
< naywhayare>
the overlap minimal split is always possible according to their algorithm, but it may produce unbalanced nodes
< naywhayare>
if it does produce unbalanced nodes, then you don't split -- you create a supernode
< naywhayare>
when you perform the overlap minimal split, you need the split history. currently I think this is stored implicitly: each RectangleTree node holds the dimension it has split on (...I think; correct me if I am wrong)
< naywhayare>
that means you can construct the split history quickly by an O(log N) (or, O(depth of the MBR)) traversal upwards
< naywhayare>
something gives me the feeling that with the X tree in higher dimensions (...tens to hundreds? more?), the depth of the tree is often much less than the dimensionality of the data
< naywhayare>
I guess I think I understand the full procedure, but I'm not coming to the same question that you had, which is how you handle the split history when not using the minimal overlap split
< andrewmw94>
well, the problem is that the split history is easy to handle if you do a minimum overlap split.
< andrewmw94>
But if you do a topological split, how do you preserve the tree in each of the two new nodes
< andrewmw94>
or likewise if you reinsert a node, how do you preserve the tree?
< naywhayare>
when you say "preserve the tree", you mean the split history tree, I am assuming
< andrewmw94>
yes
< naywhayare>
if you do a topological split, it splits on a certain dimension, right?
< andrewmw94>
yes
< naywhayare>
okay, then just store the split dimension as a member of RectangleTree
< naywhayare>
no need to hold on to a separate tree structure
< andrewmw94>
so just store the last one instead of all of them?
< naywhayare>
right
< naywhayare>
then when you need the actual full list of dimensions that have been split on for a minimum overlap split, you traverse upwards:
< naywhayare>
list of split dimensions = { this.splitDimension, parent.splitDimension, parent.parent.splitDimension, ... }
< andrewmw94>
I don't think that works because in the X tree the two new nodes replace what would be their parent in the binary tree
< naywhayare>
well, but at the time in which you are splitting that parent node, you need the split dimension of that parent node and all its parents, so modify what I wrote a bit, I guess
< naywhayare>
list of split dimension = { parent.splitDimension, parent.parent.splitDimension, ... }
< naywhayare>
I guess I was writing it with the context that the "parent that you are splitting" was the node "this"
< andrewmw94>
One of us seems to be misunderstanding the other.
< andrewmw94>
A node in the RectangleTree can have say 5 child nodes
< andrewmw94>
which could have been split in different ways.
< naywhayare>
okay
< andrewmw94>
Let's say it started with two. Then node A splits, creating two new nodes. These are now children of node A's parent, and node A is forgoten
< andrewmw94>
so where do you store the dimension that node A split along in order to make those two new nodes?
< andrewmw94>
If you store it in the parent of those nodes, it loses information about how it split. If you store in in those nodes, how can the parent know the split history of A' and A'' (split from A) and B?
< naywhayare>
hang on, my entire conception falls apart when trees have multiple leaves
< naywhayare>
er, sorry... not "multiple leaves", I mean "more than two children"
< naywhayare>
in this case the split history tree structure is _not_ the same as the structure of the X tree itself
< andrewmw94>
yeah
< naywhayare>
Figure 9 makes this clear, but I wasn't looking at the bottom part...
< naywhayare>
do you know how the lisp implementation does it?
< andrewmw94>
well, I'm not sure if I would say "clear" I needed to read section 3.3 several times and look at another program before I sort of understood
< andrewmw94>
I don't know lisp (other than the little bit I read to try to understand this) so I just focused on the comments
< naywhayare>
nevermind, I'm not sure the clarification I am looking for makes sense
< naywhayare>
in Figure 9, the nodes contained by node S (i.e. A'', B'', C, D, E) appear to me to be nodes of their own with their own MBRs
< andrewmw94>
yeah, I understand those to be nodes of the X tree (or possibly MBR's of non-point data in the X tree)
< naywhayare>
"In typical academic fashion, this is extremely badly explained in the paper itself; I think I've reconstructed what's necessary, but it's not obvious that it's a huge win."
< andrewmw94>
I liked that part too. :)
< naywhayare>
so it seems to me that the X tree nodes *must* also contain split tree structures
< naywhayare>
they can't be thrown away after construction time, since a user may call Insert() with new points and the information is necessary then
< andrewmw94>
well, they have to store the history somehow. I'm not quite done, but I'm close enough to be fairly certain that my idea of storing a bitfield will work
< naywhayare>
okay; can you go ahead and finish that, then I can take a look at the implementation?
< naywhayare>
this still doesn't address your original question, though, which was how to handle the split tree for a topological split
< andrewmw94>
Yeah. I don't see how it is possible to preserve the relevant binary tree structure and have nodes "randomly" reinserted in different places.
< naywhayare>
take a look at Figure 8... specifically, if (overlap(r1, r2) > MAX_OVERLAP)
< andrewmw94>
yeah?
< naywhayare>
so, that condition being true implies that the topological split failed
< naywhayare>
but if it is false, it seems like all you would need to do is insert the information about the split into the parent node's split tree
< andrewmw94>
Well, not exactly "failed". It has more than 20% overlap.
< naywhayare>
ah, yeah, "failed" is a bit harsh. but either way, it means that the topological split won't be used and the overlap minimal split will be used
< naywhayare>
but in the case where the topological split *is* used, it does split on a certain dimension, so it seems like you could add it to the split tree in the same way
< andrewmw94>
yeah. Minimal overlap is easy. You take the left "half" and the right "half" of the search history
< andrewmw94>
but when you do topological splits, there is no gaurantee that eg. B'' and D of fig. 9 will be assigned to the same node
< andrewmw94>
so what do you do with the split history then?
< naywhayare>
this is a difficult problem. I need to get lunch (death timer is down to about 10 minutes, gotta eat soon), and I will try and think about this over lunch
< andrewmw94>
I can think of at least one possible solution, but it's rather complicated, loses lots of information, and I have little indication that it is what the authors intended
< naywhayare>
one quick thought is that maybe the authors intended a topological split without random reinsertion?
< andrewmw94>
Yeah, I'm kind of assuming that at this point. But the assignment in the topological split is still "random" in that I think I can arbitrarily break up the split history tree
< andrewmw94>
by specially constructing a sequence of points to insert
< naywhayare>
let me get lunch and wrap my head around the problem further. I don't think that I am thinking about this quite correctly
< naywhayare>
I'll be back in about an hour, I think
< andrewmw94>
ok. Hopefully I'll have the other idea implemented.
< naywhayare>
okay; if you get hung up, maybe it is a good idea to move to the dual-tree traverser and start testing that in parallel with finishing the X tree
< naywhayare>
andrewmw94: sorry, I was out for a little longer than I thought...
< andrewmw94>
no problem. I just finished writing it. I'm compiling now.
< naywhayare>
ah, okay. when it works, can you check it in? then I can look through and understand your idea
< naywhayare>
I have a basic idea of how the whole thing can be stored, but it doesn't completely account for random reinsertion
< naywhayare>
although, more of my thought has gone into "how can we store the split history for an X tree node but not for an R tree node?" and as usual my answer is with templates... :)
< andrewmw94>
more templates is always a good idea :)