#mlpack on 2016-02-04 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

04:16 zoq_ has joined #mlpack

04:19 inilahs_ has quit [Ping timeout: 260 seconds]

04:19 slardar has quit [Ping timeout: 260 seconds]

04:19 zoq has quit [Ping timeout: 260 seconds]

04:19 inilahs has joined #mlpack

04:20 slardar has joined #mlpack

10:42 zoq_ is now known as zoq

14:34 < rcurtin> https://github.com/josephmisiti/awesome-machine-learning -- check out C++, "General-Purpose Machine Learning"... mlpack at the top of the list :)

18:40 < zoq> that's usually the best position :)

19:10 jerone has joined #mlpack

19:10 < jerone> Hello

19:12 < rcurtin> jerone: hello there

19:12 < jerone> Hi, I have a question

19:13 < rcurtin> sure, go ahead, I can try to answer

19:13 < jerone> I was wondering when using the DET in a one-class getting the density estimates of the training (normal/positive class) data often come back the same

19:13 < jerone> i.e. there are not unique values for the density estimates of the different training samples

19:13 < jerone> e.g. using MNIST digit 5 as the training set

19:14 < jerone> is that due to how the piecewise constant estimate is defined?

19:14 < jerone> estimator*

19:15 < rcurtin> the DET will return constant probability for different leaves in the tree

19:15 < rcurtin> so regions of the input space will have constant probability, yes

19:15 < rcurtin> however, I'd personally have expected a DET to have more than just one node for the MNIST digit 5

19:15 < rcurtin> I think the program prints how many nodes are in the tree? I forget

19:16 < jerone> I'll give a DET a run now, and let you know.

19:16 < rcurtin> either way, there may be knobs you can tweak... maybe you can set --max_leaf_size to something smaller

19:16 < rcurtin> the fact that it's one class shouldn't change anything though... the DET algorithm itself just models the distribution

19:17 < jerone> max leaf size - controls the maximum number of training points allowed to end at a leaf?

19:17 < rcurtin> yep

19:17 < rcurtin> I think Pari's paper might have considered the multiclass problem and using a DET to get class probability estimates, and the C++ code might do that (the command line program doesn't)

19:17 < jerone> ok, i'll give that a go, thanks!

19:18 < rcurtin> but if he did do that, then all it would have been is just calculating the empirical probability of each class at each node

19:18 < rcurtin> yeah, let me know if you have any issues

19:18 < rcurtin> I'm familiar with the algorithm from a high-level, so maybe I can be helpful, but if not, I think you can always email Pari and he'll probably get back to you sooner or later

19:18 < rcurtin> (I don't know how busy he is these days)

19:19 < jerone> thank you very much. I'm sure you can help me!

19:19 < rcurtin> sure, no problem, I am happy to help :)

19:25 < jerone> Hi rcurtin

19:25 < jerone> Running a DET using the default parameters (5 min size of leaf and 10 max size)

19:25 < jerone> [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha

19:25 < jerone> so only 1 leaf node is returned after pruning

19:25 < jerone> (minus the so)

19:28 < rcurtin> is there only one leaf node before pruning, too?

19:29 < jerone> [INFO ] Performing leave-one-out cross validation. [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308. [INFO ] 1 trees in the sequence; maximum alpha: 0. [INFO ] Optimal alpha: -1. [INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.

19:29 < jerone> yup, before and after pruning there is only 1 leaf node

19:30 < rcurtin> how many training points do you have? you said you were just using mnist digit 5, right?

19:30 < rcurtin> so should be 10k I think

19:30 < jerone> i have sampled with replacement 896 digit 5s

19:31 < jerone> so 896 points and dimension 784

19:31 < rcurtin> okay... can you check to make sure all of the points aren't the same?

19:32 < jerone> Yup, will check

19:34 < jerone> they're definitely not all the same

19:36 < jerone> even when i use a separate set of 3795 unique digit 5s, i still get 1 leaf in the tree prior to pruning

19:37 < rcurtin> okay... can you try one more thing?

19:37 < jerone> yup

19:37 < rcurtin> your dataset has 784 dimensions

19:37 < jerone> yes'

19:37 < rcurtin> can you pick, say, only 10 of these, and then try training a DET?

19:37 < jerone> ok, will do

19:37 < rcurtin> it might be hard to pick good ones, because picking the edges of the images might be bad... they'll be all 0

19:38 < rcurtin> the reason I suggest this is that the DET splitting criterion has to do with finding two children that increase the log-likelihood of the data given the model

19:38 < rcurtin> but this also depends on the volume of each node

19:38 < rcurtin> in 784 dimensions, volumes can explode to infinity or shrink to 0 very easily

19:38 < rcurtin> so the code tries to work in logspace to prevent this, but it's possible that somewhere maybe there is a bug that means that there's an overflow or underflow

19:41 < jerone> again the same

19:41 < jerone> Loading 'dim105vsTrain1.txt' as CSV data. Size is 10 x 896. [INFO ] Performing 10-fold cross validation. [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308. [INFO ] 1 trees in the sequence; maximum alpha: 0. [INFO ] Optimal alpha: -1. [INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.

19:41 < jerone> still 1 leaf node prior to pruning

19:42 < rcurtin> and one row corresponds to one point in dim105vsTrain1.txt?

19:43 < jerone> yeah 1 row is one observation

19:43 < rcurtin> okay

19:43 < rcurtin> I'm not 100% sure what is going on here then

19:43 < rcurtin> can I ask you to file a bug on Github and post the dataset you used?

19:43 < jerone> ok, will do

19:43 < rcurtin> I'll look into it and try to debug... I may not be able to get to it today, but I should be able to find some time for it in the next week I think

19:44 < rcurtin> great, thanks... I appreciate your time in helping debug this :)

19:49 < jerone> posted on github

19:49 < jerone> along with the text file of the data

19:49 < jerone> I appreciate your help whenever you get round to it

19:49 < jerone> Thanks

19:51 < rcurtin> sure, sorry that you are having issues

19:56 < jerone> Also, the DET seems to work fine for my other datasets

19:57 < jerone> it's just the MNIST set that it keeps returning only 1 leaf node

20:00 < jerone> anyway, gotta go. thanks again, and have a nice evening!

20:04 jerone has quit [Ping timeout: 252 seconds]

22:41 travis-ci has joined #mlpack

22:41 < travis-ci> mlpack/mlpack#502 (master - 0c62700 : Ryan Curtin): The build was broken.

22:41 < travis-ci> Change view : https://github.com/mlpack/mlpack/compare/b344394650f1...0c62700893bb

22:41 < travis-ci> Build details : https://travis-ci.org/mlpack/mlpack/builds/107086975

22:41 travis-ci has left #mlpack []

22:45 travis-ci has joined #mlpack

22:45 < travis-ci> mlpack/mlpack#503 (master - fa88d69 : Ryan Curtin): The build was broken.

22:45 < travis-ci> Change view : https://github.com/mlpack/mlpack/compare/0c62700893bb...fa88d69db917

22:45 < travis-ci> Build details : https://travis-ci.org/mlpack/mlpack/builds/107089951

22:45 travis-ci has left #mlpack []