#mlpack on 2014-08-13 — irc logs at libera.irclog.whitequark.org

2014-05-21 16:24 naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

01:56 jbc__ has joined #mlpack

01:57 andrewmw94 has quit [Quit: Leaving.]

02:29 jbc__ has quit [Quit: jbc__]

05:36 govg has quit [Ping timeout: 240 seconds]

06:02 govg has joined #mlpack

06:02 govg has quit [Changing host]

06:02 govg has joined #mlpack

07:10 sumedhghaisas has joined #mlpack

08:40 sumedhghaisas has quit [Ping timeout: 272 seconds]

08:41 sumedhghaisas has joined #mlpack

09:04 sumedhghaisas has quit [Ping timeout: 240 seconds]

09:17 sumedhghaisas has joined #mlpack

12:13 jbc__ has joined #mlpack

12:13 jbc__ has quit [Client Quit]

12:15 jbc__ has joined #mlpack

12:35 govg has quit [Ping timeout: 240 seconds]

13:07 govg has joined #mlpack

13:07 govg has quit [Changing host]

13:07 govg has joined #mlpack

13:29 govg has quit [Ping timeout: 255 seconds]

13:30 govg has joined #mlpack

13:30 govg has quit [Changing host]

13:30 govg has joined #mlpack

13:39 govg has quit [Ping timeout: 272 seconds]

13:41 govg has joined #mlpack

14:46 andrewmw94 has joined #mlpack

15:14 < andrewmw94> naywhayare: are you free?

15:27 < naywhayare> andrewmw94: yeah, I'm here

15:27 < andrewmw94> So I have a question about the X tree.

15:27 < andrewmw94> The paper doesn't explain some of the stuff very well

15:28 < andrewmw94> so I looked for implementations. The only one I found was in lisp

15:28 < naywhayare> okay; I am looking at the paper

15:28 < naywhayare> heh, it has been a while since I've written lisp...

15:28 < andrewmw94> the problem I'm at now it how they want you to handle the split history when you don't use the minimal overlap split

15:29 < andrewmw94> or how to handle it when reinserting nodes. Answering either of those should answer the other

15:29 < naywhayare> okay, let me do some quick learning

15:29 < andrewmw94> but I can't think of a way to do it using the tree approach they describe.

15:29 < andrewmw94> I think I have a way to do it that doesn't scale well to higher dimensions, but it should be fine for D < 100

15:30 < andrewmw94> Is that an ok assumption to make?

15:30 < naywhayare> generally, people think that trees are bad for higher dimensions (i.e. D > 10 or 20), but in my experience you can build trees for nearest neighbor search and still get speedup with hundreds of dimensions

15:31 < naywhayare> I'm trying to find the relevant section of the paper for the question(s) you're asking

15:31 < andrewmw94> yeah. It only impacts performance while building the tree, so I'm not sure how important speed is.

15:31 < andrewmw94> I think it's section 3.3

15:31 < naywhayare> yeah, that is what I'm reading

15:32 < naywhayare> in general tree construction is O(N log N) with some dependence on d

15:32 < andrewmw94> but this particular issue isn't addressed. They skip a lot of stuff. EG. they don't specify which topological splits they use (they kind of sort of implied the ones from the R* tree, so that's what I'm using currently)

15:32 < naywhayare> this is usually about the same as the scaling characteristics of the tree-based algorithm you might use; single-tree algorithms are usually O(N log N) for N queries, and dual-tree algorithms are often O(N) (but again those proofs depend on dataset-dependent parameters...)

15:33 < naywhayare> so where I am meandering with my discussion is to say that usually, the construction time of a kd-tree is nowhere near as large as the runtime of the algorithms (...in most cases)

15:33 < naywhayare> I would expect about the same for the R tree, though it might be difference

15:33 < andrewmw94> yeah

15:33 < naywhayare> *different

15:35 < andrewmw94> yeah, I would expect that too. My idea is basically to tie the split history to each individual node using a bitset. Then you can & them all together to see if they all share a dimension. Loop through those starting one after the last dimension used so you tend to split along different dimensions

15:35 < andrewmw94> so it's basically adding D iterations of a loop every time you split a node

15:35 < andrewmw94> I think that would be fairly minimal

15:36 < andrewmw94> then their remains the problem of choosing where to divide the nodes once you know they all have been split in that dimension.

15:36 < naywhayare> hm, I bet that you could do it with integer divide/modulo operations instead of a loop

15:36 < naywhayare> but if you are working with bits that represent dimensions, then it will probably be quite fast

15:36 < naywhayare> (in comparison to the other operations, such as distance calculations, etc.)

15:37 < andrewmw94> yeah.

15:38 < andrewmw94> I'm really curious how they did this. They claim they could encode the split history into a few bits. I encoded it into integers, but even then, there's nothing like a "few bits"

15:38 < naywhayare> it looks like they are talking about D <= 16

15:38 < naywhayare> which is arguably "a few bits" for a large "few"

15:38 < andrewmw94> a very large few

15:38 < andrewmw94> I usually think of it as 2 or 3

15:39 < naywhayare> let me read through this section again to better understand the splitting algorithm

15:39 < andrewmw94> My calculations indicate that you need about 4*maxNumChildren integers

15:39 < andrewmw94> at least for my method

15:40 < andrewmw94> I don't think it's explained very well. It may be faster for me to try.

15:40 < andrewmw94> Or do you want to see if I missed something?

15:40 < naywhayare> I'm going to read through it anyway and come back with an idea (...hopefully), but if you want to implement your idea too, go ahead

15:41 < naywhayare> there's no guarantee that my thought will be anything better than what you've already come up with, especially given that you've already spent a significant amount of time on this

15:42 < andrewmw94> ok. I'll save what I already have. It works for the minimal split thing and took a while to figure out. I just can't see how you could preserve a meaningful tree structure when moving arbitrary nodes around as required by the topological split or reinsertion of nodes

15:47 < andrewmw94> I'm going to grab lunch. I'll be back in about 30 min

15:47 < naywhayare> okay, hopefully I will have some idea of something by then

15:55 sumedhghaisas has quit [Ping timeout: 272 seconds]

16:18 < andrewmw94> naywhayare: back

16:24 < naywhayare> andrewmw94: I realized that I need to get a bit more perspective on what is going on in this paper, so I'm reading the whole thing (or, at least, the non-experimental stuff)

16:29 < andrewmw94> ok

16:32 < naywhayare> okay, so here is my understanding of the X tree split procedure. I'm going to write it all out just to make sure we're on the same page. you've spent more time with it, so correct me if I am wrong

16:32 < naywhayare> when you insert a point, you find which MBR to insert it into

16:33 < naywhayare> should that MBR be filled to capacity, you try to split that MBR into two

16:33 < naywhayare> the first thing you try is the topological split, which is from the R* tree

16:33 < naywhayare> if you find that the topological split produces overlapping nodes, then you try the overlap minimal split

16:33 < naywhayare> (and I think this is where your question lies)

16:34 < naywhayare> the overlap minimal split is always possible according to their algorithm, but it may produce unbalanced nodes

16:34 < naywhayare> if it does produce unbalanced nodes, then you don't split -- you create a supernode

16:35 < naywhayare> when you perform the overlap minimal split, you need the split history. currently I think this is stored implicitly: each RectangleTree node holds the dimension it has split on (...I think; correct me if I am wrong)

16:36 < naywhayare> that means you can construct the split history quickly by an O(log N) (or, O(depth of the MBR)) traversal upwards

16:37 < naywhayare> something gives me the feeling that with the X tree in higher dimensions (...tens to hundreds? more?), the depth of the tree is often much less than the dimensionality of the data

16:37 < naywhayare> I guess I think I understand the full procedure, but I'm not coming to the same question that you had, which is how you handle the split history when not using the minimal overlap split

16:57 < andrewmw94> well, the problem is that the split history is easy to handle if you do a minimum overlap split.

16:57 < andrewmw94> But if you do a topological split, how do you preserve the tree in each of the two new nodes

16:58 < andrewmw94> or likewise if you reinsert a node, how do you preserve the tree?

17:03 < naywhayare> when you say "preserve the tree", you mean the split history tree, I am assuming

17:03 < andrewmw94> yes

17:03 < naywhayare> if you do a topological split, it splits on a certain dimension, right?

17:03 < andrewmw94> yes

17:03 < naywhayare> okay, then just store the split dimension as a member of RectangleTree

17:03 < naywhayare> no need to hold on to a separate tree structure

17:04 < andrewmw94> so just store the last one instead of all of them?

17:04 < naywhayare> right

17:04 < naywhayare> then when you need the actual full list of dimensions that have been split on for a minimum overlap split, you traverse upwards:

17:05 < naywhayare> list of split dimensions = { this.splitDimension, parent.splitDimension, parent.parent.splitDimension, ... }

17:05 < andrewmw94> I don't think that works because in the X tree the two new nodes replace what would be their parent in the binary tree

17:06 < naywhayare> well, but at the time in which you are splitting that parent node, you need the split dimension of that parent node and all its parents, so modify what I wrote a bit, I guess

17:06 < naywhayare> list of split dimension = { parent.splitDimension, parent.parent.splitDimension, ... }

17:07 < naywhayare> I guess I was writing it with the context that the "parent that you are splitting" was the node "this"

17:07 < andrewmw94> One of us seems to be misunderstanding the other.

17:07 < andrewmw94> A node in the RectangleTree can have say 5 child nodes

17:07 < andrewmw94> which could have been split in different ways.

17:07 < naywhayare> okay

17:08 < andrewmw94> Let's say it started with two. Then node A splits, creating two new nodes. These are now children of node A's parent, and node A is forgoten

17:08 < andrewmw94> so where do you store the dimension that node A split along in order to make those two new nodes?

17:09 < andrewmw94> If you store it in the parent of those nodes, it loses information about how it split. If you store in in those nodes, how can the parent know the split history of A' and A'' (split from A) and B?

17:10 < naywhayare> hang on, my entire conception falls apart when trees have multiple leaves

17:10 < naywhayare> er, sorry... not "multiple leaves", I mean "more than two children"

17:10 < naywhayare> in this case the split history tree structure is _not_ the same as the structure of the X tree itself

17:11 < andrewmw94> yeah

17:12 < naywhayare> Figure 9 makes this clear, but I wasn't looking at the bottom part...

17:12 < naywhayare> do you know how the lisp implementation does it?

17:12 < andrewmw94> well, I'm not sure if I would say "clear" I needed to read section 3.3 several times and look at another program before I sort of understood

17:13 < andrewmw94> I don't know lisp (other than the little bit I read to try to understand this) so I just focused on the comments

17:14 < naywhayare> hang on. one more clarification:

17:14 < andrewmw94> https://github.com/rpav/spatial-trees/blob/master/x-trees.lisp

17:14 < naywhayare> nevermind, I'm not sure the clarification I am looking for makes sense

17:15 < naywhayare> in Figure 9, the nodes contained by node S (i.e. A'', B'', C, D, E) appear to me to be nodes of their own with their own MBRs

17:16 < andrewmw94> yeah, I understand those to be nodes of the X tree (or possibly MBR's of non-point data in the X tree)

17:16 < naywhayare> "In typical academic fashion, this is extremely badly explained in the paper itself; I think I've reconstructed what's necessary, but it's not obvious that it's a huge win."

17:17 < andrewmw94> I liked that part too. :)

17:17 < naywhayare> so it seems to me that the X tree nodes *must* also contain split tree structures

17:17 < naywhayare> they can't be thrown away after construction time, since a user may call Insert() with new points and the information is necessary then

17:18 < andrewmw94> well, they have to store the history somehow. I'm not quite done, but I'm close enough to be fairly certain that my idea of storing a bitfield will work

17:19 < naywhayare> okay; can you go ahead and finish that, then I can take a look at the implementation?

17:19 < naywhayare> this still doesn't address your original question, though, which was how to handle the split tree for a topological split

17:19 < andrewmw94> Yeah. I don't see how it is possible to preserve the relevant binary tree structure and have nodes "randomly" reinserted in different places.

17:19 < naywhayare> take a look at Figure 8... specifically, if (overlap(r1, r2) > MAX_OVERLAP)

17:20 < andrewmw94> yeah?

17:20 < naywhayare> so, that condition being true implies that the topological split failed

17:21 < naywhayare> but if it is false, it seems like all you would need to do is insert the information about the split into the parent node's split tree

17:21 < andrewmw94> Well, not exactly "failed". It has more than 20% overlap.

17:22 < naywhayare> ah, yeah, "failed" is a bit harsh. but either way, it means that the topological split won't be used and the overlap minimal split will be used

17:22 < naywhayare> but in the case where the topological split *is* used, it does split on a certain dimension, so it seems like you could add it to the split tree in the same way

17:22 < andrewmw94> yeah. Minimal overlap is easy. You take the left "half" and the right "half" of the search history

17:23 < andrewmw94> but when you do topological splits, there is no gaurantee that eg. B'' and D of fig. 9 will be assigned to the same node

17:23 < andrewmw94> so what do you do with the split history then?

17:25 < naywhayare> this is a difficult problem. I need to get lunch (death timer is down to about 10 minutes, gotta eat soon), and I will try and think about this over lunch

17:25 < andrewmw94> I can think of at least one possible solution, but it's rather complicated, loses lots of information, and I have little indication that it is what the authors intended

17:26 < naywhayare> one quick thought is that maybe the authors intended a topological split without random reinsertion?

17:27 < andrewmw94> Yeah, I'm kind of assuming that at this point. But the assignment in the topological split is still "random" in that I think I can arbitrarily break up the split history tree

17:27 < andrewmw94> by specially constructing a sequence of points to insert

17:27 < naywhayare> let me get lunch and wrap my head around the problem further. I don't think that I am thinking about this quite correctly

17:27 < naywhayare> I'll be back in about an hour, I think

17:27 < andrewmw94> ok. Hopefully I'll have the other idea implemented.

17:28 < naywhayare> okay; if you get hung up, maybe it is a good idea to move to the dual-tree traverser and start testing that in parallel with finishing the X tree

18:43 < naywhayare> andrewmw94: sorry, I was out for a little longer than I thought...

18:43 < andrewmw94> no problem. I just finished writing it. I'm compiling now.

18:43 < naywhayare> ah, okay. when it works, can you check it in? then I can look through and understand your idea

18:44 < naywhayare> I have a basic idea of how the whole thing can be stored, but it doesn't completely account for random reinsertion

18:45 < naywhayare> although, more of my thought has gone into "how can we store the split history for an X tree node but not for an R tree node?" and as usual my answer is with templates... :)

18:48 < andrewmw94> more templates is always a good idea :)

18:49 < naywhayare> I think sumedh would agree :)

19:05 jbc__ has quit [Quit: jbc__]

22:11 jbc__ has joined #mlpack

22:31 jbc__ has quit [Quit: jbc__]