#mlpack on 2014-08-20 — irc logs at libera.irclog.whitequark.org

2014-05-21 16:24 naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

00:01 jbc_ has quit [Quit: jbc_]

00:33 sumedh_ has quit [Ping timeout: 272 seconds]

01:32 jbc_ has joined #mlpack

02:16 < jenkins-mlpack> Project mlpack - nightly matrix build build #566: ABORTED in 2 days 22 hr: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/566/

02:16 < jenkins-mlpack> * siddharth.950: Made Reg SVD work with CF.

02:16 < jenkins-mlpack> * Ryan Curtin: Minor formatting change, and use zeros() instead of fill().

02:16 < jenkins-mlpack> * saxena.udit: Decision Stumps modified, along with adding Classify() function to AdaBoost. Other minor changes (renaming).

02:16 < jenkins-mlpack> * andrewmw94: X tree

03:10 jbc_ has quit [Quit: jbc_]

03:16 andrewmw94 has quit [Quit: Leaving.]

07:56 govg has quit [Ping timeout: 245 seconds]

07:58 govg has joined #mlpack

07:58 govg has quit [Changing host]

07:58 govg has joined #mlpack

09:01 sumedhghaisas has joined #mlpack

11:30 jbc_ has joined #mlpack

12:09 sumedhghaisas has quit [Ping timeout: 260 seconds]

12:39 jbc_ has quit [Quit: jbc_]

14:27 andrewmw94 has joined #mlpack

14:55 < jenkins-mlpack> Project mlpack - svn checkin test build #2102: SUCCESS in 3 days 10 hr: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2102/

14:55 < jenkins-mlpack> Ryan Curtin: First pass: try to make things 80 characters (I didn't touch the big long type

14:55 < jenkins-mlpack> lines...).

14:55 < jenkins-mlpack> Starting build #2103 for job mlpack - svn checkin test (previous build: SUCCESS)

14:57 oldbeardo has joined #mlpack

15:03 < andrewmw94> naywhayare: So the dual tree traverser that I wrote doesn't work correctly, and I'm pretty sure it's due to the traversalInfo stuff. Could you explain what that does and what I need to do with it?

15:24 oldbeardo has quit [Ping timeout: 246 seconds]

15:56 < naywhayare> andrewmw94: yeah, the traversal info stuff is an idea that I had that I hadn't had time to properly document

15:57 < naywhayare> a TraversalInfo object is held by the Rules class

15:58 < naywhayare> and it is expected that when Score() or BaseCase() are called, the TraversalInfo object that the Rules class holds contains information about the preceding node combination

15:58 < naywhayare> so I guess in short, when you call Score(), the Rules class will update its internally held TraversalInfo object

15:59 < naywhayare> in the traversal, you must preserve this information, so that when Score() is called on the child combinations, the rules class has the same TraversalInfo object

15:59 < naywhayare> i.e.

15:59 < naywhayare> rules.Score(queryNode, referenceNode);

15:59 < naywhayare> TraversalInfo t = rules.TraversalInfo(); // Copy it to save it.

15:59 < naywhayare> // Now recurse...

16:00 < naywhayare> if (rules.Score(queryNode.Child(0), referenceNode.Child(0)) != DBL_MAX) { Traverse(queryNode.Child(0), referenceNode.Child(0)); }

16:00 < naywhayare> // Restore traversal info before next score.

16:00 < naywhayare> rules.TraversalInfo() = t;

16:01 < naywhayare> if (rules.Score(queryNode.Child(0), referenceNode.Child(1)) != DBL_MAX) { Traverse(queryNode.Child(0), referenceNode.Child(0)); }

16:01 < naywhayare> and so forth...

16:01 < andrewmw94> Does "preceding node combination" mean the parent nodes?

16:01 < naywhayare> yes; for any given node combination, the "preceding node combination" is the combination "one step higher" than that combination

16:02 < naywhayare> but that's not necessarily both parent nodes; sometimes a traversal may recurse down the query node but not the reference (or vice versa)

16:02 < naywhayare> if I haven't done a great job of explaining, please let me know. I have actually never explained this concept to anyone at all, so I am not sure how much sense my explanations make

16:03 < andrewmw94> I don't understand yet, but that doesn't mean your explanation is bad. Give me a few minutes to compare it too the code in BinaryTree

16:06 < andrewmw94> ok, let's see if I can explain it back to you

16:06 < naywhayare> yeah, that is a good idea

16:06 < andrewmw94> When Traverse is called, the rule will have a certain traversal info. You store that.

16:07 < andrewmw94> then for each time you call rule.Score(), you need to prefix it with rule.TraversalInfo() = traversalInfo

16:07 < andrewmw94> (technically, you don't have to the first time)

16:07 < naywhayare> I think that is correct, yes

16:07 < naywhayare> you must set the rule's traversal info to be that traversal info resulting from the preceding node combination's Score()/BaseCase() calls

16:08 < naywhayare> so you recurse into one child, then restore the traversal info, then recurse into the next child, then restore the traversal info, etc.

16:09 < andrewmw94> after score is called, rule.TraversalInfo() will change, so you need to save that in say travInfos[i]. Then before you call Traverse(nodeCombination[i]) you need to have rule.TraversalInfo() = travInfos[i]

16:10 < naywhayare> yeah, but I think that it is not always necessary to store a big array of traversal infos

16:10 < naywhayare> (in some cases it may be)

16:10 < naywhayare> (it depends on the situation and how the traversal is set up)

16:12 < andrewmw94> hmm. So I score all of the nodes, and then sort them by lowest to highest score.

16:12 < naywhayare> ah... in that case, holding them all is necessary

16:12 < andrewmw94> ok, just making sure

16:13 < naywhayare> so, a bit of background

16:14 < naywhayare> the whole reason the traversal infos exist is because of this concept called a "parent-child prune" (or, really, you can call it whatever you want, I guess)

16:14 < naywhayare> suppose that I am doing nearest neighbor search

16:15 < naywhayare> and at some node combination (N_q, N_r), I can say that MinDistance(N_q, N_r) is, say, 10

16:16 < naywhayare> I can also do a little bit of reasoning about some child N_qc of N_q, if I know the distance from the center of N_qc to the center of N_q

16:16 < naywhayare> suppose that this tree type has ball-shaped bounds, which can be defined as just a center and a radius

16:17 < naywhayare> so if N_q has radius r_q and N_qc has radius r_qc, then using the triangle inequality we can say that MinDistance(N_qc, N_r) is greater than or equal to (10 + r_q - r_qc)

16:17 < naywhayare> oops, sorry, that's (10 + r_q - r_qc - distance between center of N_q and N_qc)

16:18 < naywhayare> so this calculation is quite fast if you've cached the distance between centers (which is usually done with ParentDistance())

16:18 < naywhayare> and it's possible that you can find that MinDistance(N_qc, N_r) is greater than MinDistance(N_q, N_r), sometimes enough so that you can make a prune

16:18 < naywhayare> and you've made that prune without ever actually doing the O(d) MinDistance(N_qc, N_r) calculation

16:19 < naywhayare> I hope that I've gotten the concept across somewhat clearly. it comes from John Langford's original cover tree implementation, and it isn't commented very well there

16:19 < andrewmw94> I think I got it

16:19 < naywhayare> at some point, I need to write up an explanation of these "little tricks"; it becomes much easier to see with a simple figure

16:19 < naywhayare> anyway, if you're going to do this parent-child prune, you need to know MinDistance(N_q, N_r)

16:20 < naywhayare> you can't cache it in the Rules class since that doesn't know anything about the traversal... it just has BaseCase() and Score() which can be called with basically any arbitrary ordering

16:20 < naywhayare> you can't cache it in the StatisticType since there are O(N^2) combinations and you only have O(2N) StatisticTypes (for O(N) nodes in each tree)

16:21 < naywhayare> you could modify the signature of Score() and BaseCase() significantly, but that's super tedious, and the needs for parent-child pruning may be different based on the problem being solved

16:21 < naywhayare> so a TraversalInfo class that's defined by the Rules class is one way to do this

16:22 < naywhayare> you can either pass the TraversalInfo object to Score(), thereby modifying the Score function, or you can have a traversal info stored in the rules class which gets set by the traversal

16:22 < naywhayare> the first option results in some cases where the traversal info is unnecessarily copied (i.e. the first recursion after any Score() call)

16:23 < naywhayare> so I haven't thought of anything better than the second option

16:23 < naywhayare> obviously this all needs to be written up into some very long comprehensive document

16:24 < naywhayare> I just need to find the time... (I have been saying that for years now)

16:24 < andrewmw94> yeah. I found a few bugs based on this, but it still doesn't seem to be working (at least, it's taking longer than the single tree traverser would)

16:25 < andrewmw94> but those bugs would have taken me hours to find without this explanation

16:25 < naywhayare> I'll take a look through whatever you have after lunch and see if I spot anything obvious

16:25 < naywhayare> I noticed that the single tree traverser takes a long time

16:25 < naywhayare> but I don't think it's the traverser code that makes it take a long time... I think it's the fact that the RectangleTree is holding a local dataset (but I'm not sure; I haven't done profiling yet)

16:26 < jenkins-mlpack> Project mlpack - svn checkin test build #2103: SUCCESS in 1 hr 30 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2103/

16:26 < jenkins-mlpack> * andrewmw94: Dual tree traverser.

16:26 < jenkins-mlpack> * siddharth.950: Added help information in CF executable.

16:26 < jenkins-mlpack> * siddharth.950: Added Reg SVD to CF executable.

16:26 < jenkins-mlpack> * Ryan Curtin: Don't use auxiliary structures; find the best node with O(1) storage. Minor

16:26 < jenkins-mlpack> speed improvement.

16:26 < jenkins-mlpack> * Ryan Curtin: Simplify methods a little, and use int& instead of int*.

16:26 < jenkins-mlpack> * Ryan Curtin: Avoid math::Range copy, although realistically gcc should be avoiding that

16:26 < jenkins-mlpack> anyway when it recognizes that the variable is effectively const.

16:26 < jenkins-mlpack> * Ryan Curtin: arma::prod is faster, in this case.

16:26 < jenkins-mlpack> * Ryan Curtin: Inline HRectBound functions. Minor to negligible speedup, but certainly no

16:26 < jenkins-mlpack> slowdown. (at least on gcc)

16:26 < jenkins-mlpack> * Ryan Curtin: Revert most of r17065, since my testing was actually on the R* tree and not on

16:26 < jenkins-mlpack> the R tree, so my results were useless. In reality those changes I made more

16:26 < jenkins-mlpack> than doubled the tree construction time.

16:26 < jenkins-mlpack> * siddharth.950: Added QUIC-SVD code example.

16:26 < naywhayare> the tree-building procedure seems pretty fast; not as fast as the binary space tree, but still quite fast

16:26 < jenkins-mlpack> * Ryan Curtin: Clarify some difficult ternary operator operations (the compiler should give

16:26 < jenkins-mlpack> something that is resultantly the same code anyway). Also, use the new

16:27 < andrewmw94> I think that depends on the tree type. The R tree builds quickly but the search is pretty slow

16:27 < andrewmw94> the R* tree (latest version) builds pretty slowly, but the search isn't too bad.

16:27 < andrewmw94> still several times slower than the BSP tree, but about on par with the cover tree IIRC

16:28 < naywhayare> yeah; I'll have to run and get actual benchmarks

16:29 < naywhayare> what I was looking at when I did the run was the number of BaseCase() calculations

16:29 < naywhayare> on the corel dataset it was something like 300M for the R tree, 200M for the R* tree, and 30M for the binary space tree

16:29 < naywhayare> saying what it *should* be is a little more difficult though

16:29 < andrewmw94> hmm. How many dimensions is that?

16:29 < naywhayare> 32 I think

16:30 < andrewmw94> Yeah, I think so too.

16:30 < naywhayare> the kd-tree performs poorly as dimension increases, but I am not sure how it performs w.r.t. dimension compared to the R tree variants

16:30 < andrewmw94> According to the X tree paper, that would present serious issues for overlap for the R tree and R* tree

16:30 < andrewmw94> the BSP tree doesn't have overlap

16:31 < andrewmw94> but it also has 100% coverage

16:31 < naywhayare> what concerned me more was that the R tree does 10 times more base case calculations, but takes (in real time) about 50 to 60 times longer, not ~10

16:31 < naywhayare> yeah, that is true

16:31 < andrewmw94> so I'm unsure how high the number of base cases should be

16:31 < naywhayare> your reasoning is probably right, and 300M may be the minimum that the R tree can do

16:31 < andrewmw94> I hope not, but we'll see.

16:32 < naywhayare> yeah; it's something we can figure out as time goes on

16:32 < andrewmw94> in robocode at least, one of the people reported that he couldn't get the R trees to work as well as the BSP trees. I always thought that strange.

16:33 < andrewmw94> but maybe it's just the way it is.

16:33 < naywhayare> the lack of overlap is probably part of the issue

16:33 < andrewmw94> yeah, bulk loading should help that enormously.

16:33 < andrewmw94> also, how long ago did you test the R* tree.

16:34 < naywhayare> a day or two ago

16:34 < andrewmw94> ahh. Never mind then

16:34 < naywhayare> I've made a few minor improvements to that code, but I never made any substantial speedup in the build time

16:34 < naywhayare> probably 3-5% speedup (dependent on the dataset and solar flares and everything else as usual)

16:35 < andrewmw94> indeed

16:35 < andrewmw94> CMEs can do a lot

16:36 < naywhayare> as time goes on I'll probably compile actual numbers and benchmarks

16:39 oldbeardo has joined #mlpack

17:04 sumedhghaisas has joined #mlpack

17:21 oldbeardo has quit [Quit: Page closed]

17:34 sumedhghaisas has quit [Ping timeout: 244 seconds]

17:34 sumedhghaisas has joined #mlpack

17:37 govg has quit [Read error: Connection reset by peer]

17:55 govg has joined #mlpack

20:04 < jenkins-mlpack> Starting build #2104 for job mlpack - svn checkin test (previous build: SUCCESS)

20:19 govg has quit [Ping timeout: 250 seconds]

20:36 jbc_ has joined #mlpack

21:35 < jenkins-mlpack> Project mlpack - svn checkin test build #2104: SUCCESS in 1 hr 31 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2104/

21:35 < jenkins-mlpack> * Ryan Curtin: Remove trailing spaces thanks to vimrc, and fix comment correctness.

21:35 < jenkins-mlpack> * andrewmw94: dual tree traverser bug fixes.

21:35 < jenkins-mlpack> * Ryan Curtin: Formatting fixes for GMM conversion utility.

21:35 < jenkins-mlpack> Starting build #2105 for job mlpack - svn checkin test (previous build: SUCCESS)

23:08 < jenkins-mlpack> Project mlpack - svn checkin test build #2105: SUCCESS in 1 hr 32 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2105/

23:08 < jenkins-mlpack> * Ryan Curtin: Convert tabs to spaces.

23:08 < jenkins-mlpack> * Ryan Curtin: Comment normalization for doxygen, and remove no-longer-relevant comment.

23:08 < jenkins-mlpack> * andrewmw94: Dual tree traverser bug fix.

23:08 < jenkins-mlpack> * Ryan Curtin: Standardization of comments for doxygen.

23:08 < jenkins-mlpack> * Ryan Curtin: Spacing issues and remove apparent debug output.

23:08 < jenkins-mlpack> Starting build #2106 for job mlpack - svn checkin test (previous build: SUCCESS)