verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
< jerone>
I was wondering when using the DET in a one-class getting the density estimates of the training (normal/positive class) data often come back the same
< jerone>
i.e. there are not unique values for the density estimates of the different training samples
< jerone>
e.g. using MNIST digit 5 as the training set
< jerone>
is that due to how the piecewise constant estimate is defined?
< jerone>
estimator*
< rcurtin>
the DET will return constant probability for different leaves in the tree
< rcurtin>
so regions of the input space will have constant probability, yes
< rcurtin>
however, I'd personally have expected a DET to have more than just one node for the MNIST digit 5
< rcurtin>
I think the program prints how many nodes are in the tree? I forget
< jerone>
I'll give a DET a run now, and let you know.
< rcurtin>
either way, there may be knobs you can tweak... maybe you can set --max_leaf_size to something smaller
< rcurtin>
the fact that it's one class shouldn't change anything though... the DET algorithm itself just models the distribution
< jerone>
max leaf size - controls the maximum number of training points allowed to end at a leaf?
< rcurtin>
yep
< rcurtin>
I think Pari's paper might have considered the multiclass problem and using a DET to get class probability estimates, and the C++ code might do that (the command line program doesn't)
< jerone>
ok, i'll give that a go, thanks!
< rcurtin>
but if he did do that, then all it would have been is just calculating the empirical probability of each class at each node
< rcurtin>
yeah, let me know if you have any issues
< rcurtin>
I'm familiar with the algorithm from a high-level, so maybe I can be helpful, but if not, I think you can always email Pari and he'll probably get back to you sooner or later
< rcurtin>
(I don't know how busy he is these days)
< jerone>
thank you very much. I'm sure you can help me!
< rcurtin>
sure, no problem, I am happy to help :)
< jerone>
Hi rcurtin
< jerone>
Running a DET using the default parameters (5 min size of leaf and 10 max size)
< jerone>
[INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha
< jerone>
so only 1 leaf node is returned after pruning
< jerone>
(minus the so)
< rcurtin>
is there only one leaf node before pruning, too?
< jerone>
[INFO ] Performing leave-one-out cross validation. [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308. [INFO ] 1 trees in the sequence; maximum alpha: 0. [INFO ] Optimal alpha: -1. [INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.
< jerone>
yup, before and after pruning there is only 1 leaf node
< rcurtin>
how many training points do you have? you said you were just using mnist digit 5, right?
< rcurtin>
so should be 10k I think
< jerone>
i have sampled with replacement 896 digit 5s
< jerone>
so 896 points and dimension 784
< rcurtin>
okay... can you check to make sure all of the points aren't the same?
< jerone>
Yup, will check
< jerone>
they're definitely not all the same
< jerone>
even when i use a separate set of 3795 unique digit 5s, i still get 1 leaf in the tree prior to pruning
< rcurtin>
okay... can you try one more thing?
< jerone>
yup
< rcurtin>
your dataset has 784 dimensions
< jerone>
yes'
< rcurtin>
can you pick, say, only 10 of these, and then try training a DET?
< jerone>
ok, will do
< rcurtin>
it might be hard to pick good ones, because picking the edges of the images might be bad... they'll be all 0
< rcurtin>
the reason I suggest this is that the DET splitting criterion has to do with finding two children that increase the log-likelihood of the data given the model
< rcurtin>
but this also depends on the volume of each node
< rcurtin>
in 784 dimensions, volumes can explode to infinity or shrink to 0 very easily
< rcurtin>
so the code tries to work in logspace to prevent this, but it's possible that somewhere maybe there is a bug that means that there's an overflow or underflow
< jerone>
again the same
< jerone>
Loading 'dim105vsTrain1.txt' as CSV data. Size is 10 x 896. [INFO ] Performing 10-fold cross validation. [INFO ] 1 leaf nodes in the tree using full dataset; minimum alpha: 1.79769e+308. [INFO ] 1 trees in the sequence; maximum alpha: 0. [INFO ] Optimal alpha: -1. [INFO ] 1 leaf nodes in the optimally pruned tree; optimal alpha: -1.79769e+308.
< jerone>
still 1 leaf node prior to pruning
< rcurtin>
and one row corresponds to one point in dim105vsTrain1.txt?
< jerone>
yeah 1 row is one observation
< rcurtin>
okay
< rcurtin>
I'm not 100% sure what is going on here then
< rcurtin>
can I ask you to file a bug on Github and post the dataset you used?
< jerone>
ok, will do
< rcurtin>
I'll look into it and try to debug... I may not be able to get to it today, but I should be able to find some time for it in the next week I think
< rcurtin>
great, thanks... I appreciate your time in helping debug this :)
< jerone>
posted on github
< jerone>
along with the text file of the data
< jerone>
I appreciate your help whenever you get round to it
< jerone>
Thanks
< rcurtin>
sure, sorry that you are having issues
< jerone>
Also, the DET seems to work fine for my other datasets
< jerone>
it's just the MNIST set that it keeps returning only 1 leaf node
< jerone>
anyway, gotta go. thanks again, and have a nice evening!
jerone has quit [Ping timeout: 252 seconds]
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#502 (master - 0c62700 : Ryan Curtin): The build was broken.