verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
zoq_ has joined #mlpack
inilahs_ has quit [Ping timeout: 260 seconds]
slardar has quit [Ping timeout: 260 seconds]
zoq has quit [Ping timeout: 260 seconds]
inilahs has joined #mlpack
slardar has joined #mlpack
zoq_ is now known as zoq
< rcurtin> https://github.com/josephmisiti/awesome-machine-learning -- check out C++, "General-Purpose Machine Learning"... mlpack at the top of the list :)
< zoq> that's usually the best position :)
jerone has joined #mlpack
< jerone> Hello
< rcurtin> jerone: hello there
< jerone> Hi, I have a question
< rcurtin> sure, go ahead, I can try to answer
< jerone> I was wondering when using the DET in a one-class getting the density estimates of the training (normal/positive class) data often come back the same
< jerone> i.e. there are not unique values for the density estimates of the different training samples
< jerone> e.g. using MNIST digit 5 as the training set
< jerone> is that due to how the piecewise constant estimate is defined?
< jerone> estimator*
< rcurtin> the DET will return constant probability for different leaves in the tree
< rcurtin> so regions of the input space will have constant probability, yes
< rcurtin> however, I'd personally have expected a DET to have more than just one node for the MNIST digit 5
< rcurtin> I think the program prints how many nodes are in the tree? I forget
< jerone> I'll give a DET a run now, and let you know.
< rcurtin> either way, there may be knobs you can tweak... maybe you can set --max_leaf_size to something smaller
< rcurtin> the fact that it's one class shouldn't change anything though... the DET algorithm itself just models the distribution
< jerone> max leaf size - controls the maximum number of training points allowed to end at a leaf?
< rcurtin> yep
< rcurtin> I think Pari's paper might have considered the multiclass problem and using a DET to get class probability estimates, and the C++ code might do that (the command line program doesn't)
< jerone> ok, i'll give that a go, thanks!
< rcurtin> but if he did do that, then all it would have been is just calculating the empirical probability of each class at each node
< rcurtin> yeah, let me know if you have any issues
< rcurtin> I'm familiar with the algorithm from a high-level, so maybe I can be helpful, but if not, I think you can always email Pari and he'll probably get back to you sooner or later
< rcurtin> (I don't know how busy he is these days)
< jerone> thank you very much. I'm sure you can help me!
< rcurtin> sure, no problem, I am happy to help :)
< jerone> Hi rcurtin
< jerone> Running a DET using the default parameters (5 min size of leaf and 10 max size)
< jerone> [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha
< jerone> so only 1 leaf node is returned after pruning
< jerone> (minus the so)
< rcurtin> is there only one leaf node before pruning, too?
< jerone> [INFO ] Performing leave-one-out cross validation. [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308. [INFO ] 1 trees in the sequence; maximum alpha: 0. [INFO ] Optimal alpha: -1. [INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.
< jerone> yup, before and after pruning there is only 1 leaf node
< rcurtin> how many training points do you have? you said you were just using mnist digit 5, right?
< rcurtin> so should be 10k I think
< jerone> i have sampled with replacement 896 digit 5s
< jerone> so 896 points and dimension 784
< rcurtin> okay... can you check to make sure all of the points aren't the same?
< jerone> Yup, will check
< jerone> they're definitely not all the same
< jerone> even when i use a separate set of 3795 unique digit 5s, i still get 1 leaf in the tree prior to pruning
< rcurtin> okay... can you try one more thing?
< jerone> yup
< rcurtin> your dataset has 784 dimensions
< jerone> yes'
< rcurtin> can you pick, say, only 10 of these, and then try training a DET?
< jerone> ok, will do
< rcurtin> it might be hard to pick good ones, because picking the edges of the images might be bad... they'll be all 0
< rcurtin> the reason I suggest this is that the DET splitting criterion has to do with finding two children that increase the log-likelihood of the data given the model
< rcurtin> but this also depends on the volume of each node
< rcurtin> in 784 dimensions, volumes can explode to infinity or shrink to 0 very easily
< rcurtin> so the code tries to work in logspace to prevent this, but it's possible that somewhere maybe there is a bug that means that there's an overflow or underflow
< jerone> again the same
< jerone> Loading 'dim105vsTrain1.txt' as CSV data. Size is 10 x 896. [INFO ] Performing 10-fold cross validation. [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308. [INFO ] 1 trees in the sequence; maximum alpha: 0. [INFO ] Optimal alpha: -1. [INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.
< jerone> still 1 leaf node prior to pruning
< rcurtin> and one row corresponds to one point in dim105vsTrain1.txt?
< jerone> yeah 1 row is one observation
< rcurtin> okay
< rcurtin> I'm not 100% sure what is going on here then
< rcurtin> can I ask you to file a bug on Github and post the dataset you used?
< jerone> ok, will do
< rcurtin> I'll look into it and try to debug... I may not be able to get to it today, but I should be able to find some time for it in the next week I think
< rcurtin> great, thanks... I appreciate your time in helping debug this :)
< jerone> posted on github
< jerone> along with the text file of the data
< jerone> I appreciate your help whenever you get round to it
< jerone> Thanks
< rcurtin> sure, sorry that you are having issues
< jerone> Also, the DET seems to work fine for my other datasets
< jerone> it's just the MNIST set that it keeps returning only 1 leaf node
< jerone> anyway, gotta go. thanks again, and have a nice evening!
jerone has quit [Ping timeout: 252 seconds]
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#502 (master - 0c62700 : Ryan Curtin): The build was broken.
travis-ci has left #mlpack []
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#503 (master - fa88d69 : Ryan Curtin): The build was broken.
travis-ci has left #mlpack []