naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
< jenkins-mlpack>
Ryan Curtin: First pass: try to make things 80 characters (I didn't touch the big long type
< jenkins-mlpack>
lines...).
< jenkins-mlpack>
Starting build #2103 for job mlpack - svn checkin test (previous build: SUCCESS)
oldbeardo has joined #mlpack
< andrewmw94>
naywhayare: So the dual tree traverser that I wrote doesn't work correctly, and I'm pretty sure it's due to the traversalInfo stuff. Could you explain what that does and what I need to do with it?
oldbeardo has quit [Ping timeout: 246 seconds]
< naywhayare>
andrewmw94: yeah, the traversal info stuff is an idea that I had that I hadn't had time to properly document
< naywhayare>
a TraversalInfo object is held by the Rules class
< naywhayare>
and it is expected that when Score() or BaseCase() are called, the TraversalInfo object that the Rules class holds contains information about the preceding node combination
< naywhayare>
so I guess in short, when you call Score(), the Rules class will update its internally held TraversalInfo object
< naywhayare>
in the traversal, you must preserve this information, so that when Score() is called on the child combinations, the rules class has the same TraversalInfo object
< andrewmw94>
Does "preceding node combination" mean the parent nodes?
< naywhayare>
yes; for any given node combination, the "preceding node combination" is the combination "one step higher" than that combination
< naywhayare>
but that's not necessarily both parent nodes; sometimes a traversal may recurse down the query node but not the reference (or vice versa)
< naywhayare>
if I haven't done a great job of explaining, please let me know. I have actually never explained this concept to anyone at all, so I am not sure how much sense my explanations make
< andrewmw94>
I don't understand yet, but that doesn't mean your explanation is bad. Give me a few minutes to compare it too the code in BinaryTree
< andrewmw94>
ok, let's see if I can explain it back to you
< naywhayare>
yeah, that is a good idea
< andrewmw94>
When Traverse is called, the rule will have a certain traversal info. You store that.
< andrewmw94>
then for each time you call rule.Score(), you need to prefix it with rule.TraversalInfo() = traversalInfo
< andrewmw94>
(technically, you don't have to the first time)
< naywhayare>
I think that is correct, yes
< naywhayare>
you must set the rule's traversal info to be that traversal info resulting from the preceding node combination's Score()/BaseCase() calls
< naywhayare>
so you recurse into one child, then restore the traversal info, then recurse into the next child, then restore the traversal info, etc.
< andrewmw94>
after score is called, rule.TraversalInfo() will change, so you need to save that in say travInfos[i]. Then before you call Traverse(nodeCombination[i]) you need to have rule.TraversalInfo() = travInfos[i]
< naywhayare>
yeah, but I think that it is not always necessary to store a big array of traversal infos
< naywhayare>
(in some cases it may be)
< naywhayare>
(it depends on the situation and how the traversal is set up)
< andrewmw94>
hmm. So I score all of the nodes, and then sort them by lowest to highest score.
< naywhayare>
ah... in that case, holding them all is necessary
< andrewmw94>
ok, just making sure
< naywhayare>
so, a bit of background
< naywhayare>
the whole reason the traversal infos exist is because of this concept called a "parent-child prune" (or, really, you can call it whatever you want, I guess)
< naywhayare>
suppose that I am doing nearest neighbor search
< naywhayare>
and at some node combination (N_q, N_r), I can say that MinDistance(N_q, N_r) is, say, 10
< naywhayare>
I can also do a little bit of reasoning about some child N_qc of N_q, if I know the distance from the center of N_qc to the center of N_q
< naywhayare>
suppose that this tree type has ball-shaped bounds, which can be defined as just a center and a radius
< naywhayare>
so if N_q has radius r_q and N_qc has radius r_qc, then using the triangle inequality we can say that MinDistance(N_qc, N_r) is greater than or equal to (10 + r_q - r_qc)
< naywhayare>
oops, sorry, that's (10 + r_q - r_qc - distance between center of N_q and N_qc)
< naywhayare>
so this calculation is quite fast if you've cached the distance between centers (which is usually done with ParentDistance())
< naywhayare>
and it's possible that you can find that MinDistance(N_qc, N_r) is greater than MinDistance(N_q, N_r), sometimes enough so that you can make a prune
< naywhayare>
and you've made that prune without ever actually doing the O(d) MinDistance(N_qc, N_r) calculation
< naywhayare>
I hope that I've gotten the concept across somewhat clearly. it comes from John Langford's original cover tree implementation, and it isn't commented very well there
< andrewmw94>
I think I got it
< naywhayare>
at some point, I need to write up an explanation of these "little tricks"; it becomes much easier to see with a simple figure
< naywhayare>
anyway, if you're going to do this parent-child prune, you need to know MinDistance(N_q, N_r)
< naywhayare>
you can't cache it in the Rules class since that doesn't know anything about the traversal... it just has BaseCase() and Score() which can be called with basically any arbitrary ordering
< naywhayare>
you can't cache it in the StatisticType since there are O(N^2) combinations and you only have O(2N) StatisticTypes (for O(N) nodes in each tree)
< naywhayare>
you could modify the signature of Score() and BaseCase() significantly, but that's super tedious, and the needs for parent-child pruning may be different based on the problem being solved
< naywhayare>
so a TraversalInfo class that's defined by the Rules class is one way to do this
< naywhayare>
you can either pass the TraversalInfo object to Score(), thereby modifying the Score function, or you can have a traversal info stored in the rules class which gets set by the traversal
< naywhayare>
the first option results in some cases where the traversal info is unnecessarily copied (i.e. the first recursion after any Score() call)
< naywhayare>
so I haven't thought of anything better than the second option
< naywhayare>
obviously this all needs to be written up into some very long comprehensive document
< naywhayare>
I just need to find the time... (I have been saying that for years now)
< andrewmw94>
yeah. I found a few bugs based on this, but it still doesn't seem to be working (at least, it's taking longer than the single tree traverser would)
< andrewmw94>
but those bugs would have taken me hours to find without this explanation
< naywhayare>
I'll take a look through whatever you have after lunch and see if I spot anything obvious
< naywhayare>
I noticed that the single tree traverser takes a long time
< naywhayare>
but I don't think it's the traverser code that makes it take a long time... I think it's the fact that the RectangleTree is holding a local dataset (but I'm not sure; I haven't done profiling yet)
< naywhayare>
the tree-building procedure seems pretty fast; not as fast as the binary space tree, but still quite fast
< jenkins-mlpack>
* Ryan Curtin: Clarify some difficult ternary operator operations (the compiler should give
< jenkins-mlpack>
something that is resultantly the same code anyway). Also, use the new
< andrewmw94>
I think that depends on the tree type. The R tree builds quickly but the search is pretty slow
< andrewmw94>
the R* tree (latest version) builds pretty slowly, but the search isn't too bad.
< andrewmw94>
still several times slower than the BSP tree, but about on par with the cover tree IIRC
< naywhayare>
yeah; I'll have to run and get actual benchmarks
< naywhayare>
what I was looking at when I did the run was the number of BaseCase() calculations
< naywhayare>
on the corel dataset it was something like 300M for the R tree, 200M for the R* tree, and 30M for the binary space tree
< naywhayare>
saying what it *should* be is a little more difficult though
< andrewmw94>
hmm. How many dimensions is that?
< naywhayare>
32 I think
< andrewmw94>
Yeah, I think so too.
< naywhayare>
the kd-tree performs poorly as dimension increases, but I am not sure how it performs w.r.t. dimension compared to the R tree variants
< andrewmw94>
According to the X tree paper, that would present serious issues for overlap for the R tree and R* tree
< andrewmw94>
the BSP tree doesn't have overlap
< andrewmw94>
but it also has 100% coverage
< naywhayare>
what concerned me more was that the R tree does 10 times more base case calculations, but takes (in real time) about 50 to 60 times longer, not ~10
< naywhayare>
yeah, that is true
< andrewmw94>
so I'm unsure how high the number of base cases should be
< naywhayare>
your reasoning is probably right, and 300M may be the minimum that the R tree can do
< andrewmw94>
I hope not, but we'll see.
< naywhayare>
yeah; it's something we can figure out as time goes on
< andrewmw94>
in robocode at least, one of the people reported that he couldn't get the R trees to work as well as the BSP trees. I always thought that strange.
< andrewmw94>
but maybe it's just the way it is.
< naywhayare>
the lack of overlap is probably part of the issue
< andrewmw94>
yeah, bulk loading should help that enormously.
< andrewmw94>
also, how long ago did you test the R* tree.
< naywhayare>
a day or two ago
< andrewmw94>
ahh. Never mind then
< naywhayare>
I've made a few minor improvements to that code, but I never made any substantial speedup in the build time
< naywhayare>
probably 3-5% speedup (dependent on the dataset and solar flares and everything else as usual)
< andrewmw94>
indeed
< andrewmw94>
CMEs can do a lot
< naywhayare>
as time goes on I'll probably compile actual numbers and benchmarks
oldbeardo has joined #mlpack
sumedhghaisas has joined #mlpack
oldbeardo has quit [Quit: Page closed]
sumedhghaisas has quit [Ping timeout: 244 seconds]
sumedhghaisas has joined #mlpack
govg has quit [Read error: Connection reset by peer]
govg has joined #mlpack
< jenkins-mlpack>
Starting build #2104 for job mlpack - svn checkin test (previous build: SUCCESS)