naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 255 seconds]
transponder has joined #mlpack
< transponder>
hello
< transponder>
How much time the GMM estimation is supposed to take?
< transponder>
I have about 10^6 data vectors of 13 domensions
< transponder>
and am trying to train 256 gaussian MM
< transponder>
running on 2 gb ram laptop
< naywhayare>
so, 100k points... 256 gaussians
< naywhayare>
did you compile without debugging symbols?
< transponder>
yes
< naywhayare>
okay, well that's good
< transponder>
256 gaussians, in 13 dimensions
< naywhayare>
it's difficult to say how long the estimation is supposed to take because it is highly dependent on the data
< naywhayare>
the algorithm that we have implemented should scale as... O(nk) per iteration, I think
< naywhayare>
but I don't know how to easily place a bound on the number of iterations
< transponder>
Ok...its been running now for 17 hrs. What is k ?
< naywhayare>
(n is the number of points, k is the number of copmonents)
< naywhayare>
is it giving any output at all?
< transponder>
n will be 13*10^6 right
< transponder>
how do i check if its giving any output
< naywhayare>
i.e. "EMFit::Estimate(): iteration ..."
< naywhayare>
is it printing anything on the screen?
< naywhayare>
also, did you write a C++ program by hand, or are you using the 'gmm' program?
< transponder>
i am simply doing this
< transponder>
GMM<> g(256, 13);
< transponder>
arma::mat data;
< transponder>
arma::mat data;
< transponder>
arma::mat data;
< transponder>
arma::mat data;
< transponder>
arma::mat data;
< transponder>
then loading data
< transponder>
sorry
< transponder>
g.Estimate(data);
< naywhayare>
okay... that won't give output unless you turn on output from the 'mlpack::Log::Info' object by doing this... 'mlpack::Log::Info.ignoreInput = false;'
< transponder>
g.Save("output_file");
< transponder>
k i will try that
< naywhayare>
you may also want to try with a subset of your data to start with
< transponder>
yes,l tested with subset, works fine
< transponder>
now gave complete data for training
< transponder>
btw in the [INFO] there needs to be timestamp
< transponder>
so that we know how much time spent in each step
< transponder>
its still stuck at first step,
< transponder>
reading data
< transponder>
8 minutes
< transponder>
using ~580 mb memory
< transponder>
a correction btw, its 10^7 data vectors
< transponder>
still [INFO ] Loading..
< transponder>
after 35 mins, still says Loading 'file' size is x y
< transponder>
mem usage is almost constant over time
< naywhayare>
how big is the data file?
< transponder>
682 mb, and 13 x 5438715 exact matrix size
< naywhayare>
in what format?
< transponder>
plain text
< naywhayare>
when the data is loaded it must be transposed, and sometimes this can take a while
< naywhayare>
5.3M points is a lot of points, but I would still expect that to load within 35 minutes
< transponder>
can i transpose it myself
< transponder>
and give to load
< transponder>
can do it quickly in matlab i guess
< naywhayare>
if you really wanted to, you could do that, and data::Load() takes an additional argument to not transpose
< naywhayare>
but, there is nothing wrong with the transposition code
< naywhayare>
I would wager a bet that your laptop is swapping
< transponder>
k
< naywhayare>
a dataset of that size will use about 0.5 GB of RAM
< transponder>
total used free shared buffers cached
< transponder>
Mem: 3805 3349 456 0 239 691
< transponder>
-/+ buffers/cache: 2417 1388
< transponder>
Swap: 7629 1467 6162
< transponder>
output of free -m
< naywhayare>
that doesn't necessarily mean that it isn't swapping. the kernel may be swapping things out in order to provide a 0.5 GB contiguous section of memory
< naywhayare>
also, that's 4GB not 2GB
< transponder>
yes
< naywhayare>
that makes me less inclined to believe the system is swapping, then
< naywhayare>
unless you are running very many other processes
< transponder>
ther is firefox and and idle instant of matlab
< transponder>
nothing else
< naywhayare>
do you have another system that you can test on?
< naywhayare>
or, are you able to get me a copy of the dataset so I can see if I can reproduce it?
< naywhayare>
it only takes 33 seconds on my system, and it does not hang
< naywhayare>
the output is this:
< naywhayare>
[INFO ] Loading '/home/ryan/Downloads/sampledata.txt' as raw ASCII formatted data. Size is 13 x 5438715.
< naywhayare>
(the 'Size is 13 x 5438715' bit doesn't show up until it is loaded, and the call to data::Load() completes)
< transponder>
ok, i thought that the [INFO] comes while its still loading
< transponder>
does it show anything more in the log?
< transponder>
it doesnt in mine
< transponder>
so i guess its doing EM
< transponder>
but no iteration complete
< naywhayare>
well, like I said, that was the end of the program, so no more output from that one
< naywhayare>
I wanted to confirm that there was no issue loading the data.
< transponder>
ok
< transponder>
cud you try GMM estimate, i copied my data to another machine but will have to reinstall mlpack there
< transponder>
or u think this is OK
< transponder>
as in iteration may take some hours
< naywhayare>
yes, I am running that now
< naywhayare>
I would not be surprised if each iteration takes a long time
< naywhayare>
5.3M points is a lot of points, and 256 dimensions is very many
< naywhayare>
sorry... 256 components
< transponder>
yes, but i dont know whats the benchmark
< naywhayare>
you mean the benchmark on the mlpack website?
< transponder>
as in how much time does it take generally for X amount of data
< naywhayare>
that depends quite a lot on the data
< transponder>
if the number of iteration could be pre-fixed, can one estimate time before running gmm train?
< naywhayare>
probably, but the easiest way to do that would be to run only a single iteration
< naywhayare>
however, it's also not that simple, because before an iteration of the GMM estimation algorithm can be run, the initial parameters are estimated with k-means
< naywhayare>
so unfortunately I don't think I have a good answer for how long it will take
< transponder>
ok .. is ther a commnd to do just one iteration?
< naywhayare>
I believe the maximum number of iterations can be set through the GMM class
< naywhayare>
you can consult the API documentation
< transponder>
k
< transponder>
btw thanks for ur help, i ll come to chat later if i get some result in log or otherwise ..
< transponder>
will it help if input data is sorted in any way, or those kind of things are taken care of / wudnt matter
< transponder>
or may be small subset cud be used to make an initial GMM
< transponder>
and then somehow use that GMM to train another GMM on complete data
< transponder>
?
< naywhayare>
the ordering of the input data shouldn't matter
< naywhayare>
it would be possible to do what you suggested -- train a GMM on a small subset of the data
< naywhayare>
but you will have to get your hands a little dirty with the code
< naywhayare>
note that the GMM class uses the EMFit class for the actual fitting
< naywhayare>
and then the EMFit class uses an InitialClusteringType class to do the initial clustering
< naywhayare>
you could build a class to use as the InitialClusteringType that trained a GMM on a small subset of the data, then used that as the initial GMM estimate
< transponder>
k, would that initial estimate go as input to EMFit?
< naywhayare>
no, you would have to create a new class to use as the template parameter to EMFit
< naywhayare>
you would need to spend some time with the GMM code to see how it works
< transponder>
k..
< transponder>
will try that .. if this remains too slow
< transponder>
btw r ther any alternative packages that u know of?
< transponder>
ther is one GMMBayes in matlab, but its slow as expected
< transponder>
may be something that cud use multiple CPUs
< transponder>
ther is dlib ml package but it doesnt have EM
< transponder>
[i'll join ltr, going for food]
< naywhayare>
I don't know of any other packages that you could use for this
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
< transponder>
k
< transponder>
its 2:40 Hrs since i started running, no new output in log file
< jenkins-mlpack>
Ryan Curtin: Don't do base cases before recursing. This is slightly cleaner code and will
< jenkins-mlpack>
properly handle the case where the tree is only one node deep.
< jenkins-mlpack>
Starting build #2055 for job mlpack - svn checkin test (previous build: SUCCESS)
< udit_s>
naywhayare: Hello. Are you free now ?
< naywhayare>
udit_s: yes, but I will be leaving for lunch at some point... not sure when
< udit_s>
hmm... okay - did you get a chance to go through the mail ?
< udit_s>
also, great work on the release ! :)
< udit_s>
I've tested on a non-linearly separable dataset - and the tests pass.
< udit_s>
The only problem I have is one I've talked about in the mail - the tolerance issue.
< naywhayare>
okay
< naywhayare>
let me look at the tolerance issue now
< naywhayare>
why not std::abs(rt - crt), on line 130?
< naywhayare>
also, why not just use a 'break' statement instead of 'i = iterations'?
< naywhayare>
also, a minor note for the tests... if you use BOOST_REQUIRE_CLOSE() or BOOST_REQUIRE_LT() as opposed to BOOST_REQUIRE(), the output will be better
< naywhayare>
for instance, you have 'BOOST_REQUIRE(hammingLoss <= a.ztAccumulator);'
< naywhayare>
if that statement fails, boost.test will output something like 'failure: BOOST_REQUIRE(hammingLoss <= a.ztAccumulator)' and then it will print the value of 'hammingLoss <= a.ztAccumulator', which will be false
< udit_s>
I've read somewhere that using breaks is, like, a bad habit and should be avoided. Instead I just used i = iterations.
< udit_s>
Okay. Got that.
< naywhayare>
but if you use BOOST_REQUIRE_LT(), then it'll print something like 'failure: BOOST_REQUIRE_LT(hammingLoss, a.ztAccumulator)' and then it will print the value of hammingLoss and also the value of a.ztAccumulator
< naywhayare>
it can make it a lot easier to debug things :0
< naywhayare>
*:)
< udit_s>
:D
< naywhayare>
do you have a reference for why break statements are bad? it's pretty commonly known that goto's are a bad idea for a handful of reasons, but breaks I've never heard...
< udit_s>
Read it somewhere in some book when in school... I think I've formed it since then that I should avoid them, except in 'switch' statements. :)
< naywhayare>
I don't see too many compelling arguments against break there
< udit_s>
and the std::abs issue, where was I going wrong ? In considering that double values can be successfully compared ?
< naywhayare>
actually, hang on, I thought about that wrong
< naywhayare>
I don't think a lack of std::abs() would cause it to fail to converge; if anything it would cause it to converge when it actually hadn't
< naywhayare>
ah, yeah, and that's the problem that you were saying you were having -- it converges after the first iteration
< udit_s>
yeah.
< naywhayare>
so I bet what is happening is that crt is larger than rt, so (rt - crt) < 0 < tolerance
< naywhayare>
and as a result it terminates the optimization
< udit_s>
Yikes.
< udit_s>
that's silly of me...
< udit_s>
I think it'll be better to use epsilon from numeric_limits then ?
< naywhayare>
that's machine epsilon, which is the smallest representable difference between two doubles
< naywhayare>
so that'll be way too small... I'd suggest 1e-5 or 1e-10
< udit_s>
yeah, but looking at the values of rt, when it starts repeating - the values are usually between 1e-16 to 1e-17. That's probably when the change in the value of the weights stops affecting the boosting.
< naywhayare>
so maybe something more like 1e-40 or 1e-50 is reasonable?
< naywhayare>
machine epsilon should be extremely small... something like 1e-300
< naywhayare>
you may want to consider storing log(rt) and log(crt) then
< naywhayare>
and comparing those
< udit_s>
why ? I just did std::abs(rt - crt) < 1e-12 ... it works alright. any particular reason for using logs ?
< naywhayare>
if rt and crt are so small that they are getting clipped to 0, then the log of those values should still be representable as nonzero values
< naywhayare>
i.e. 1e-500 can't be represented as a double, but log(1e-500) can
< naywhayare>
(probably something like -2e10, not sure)
< naywhayare>
anyway, if std::abs(rt - crt) works fine, no reason to change it :)
< udit_s>
the values of rt aren't getting as small as that.
< udit_s>
Okay, so can I directly incorporate the tolerance as a user input by using PARAM_DOUBLE(...) ?
< naywhayare>
in the main executable, yeah, you would use PARAM_DOUBLE
< naywhayare>
but for the AdaBoost class, you can have it be a parameter to the constructor (and hold it as an internal member), or a parameter to the actual function that runs AdaBoost
< udit_s>
hmm. Okay. Let me get to that.
< udit_s>
Also, about decision stumps... Is there no match for a subview_row<double> or subview_rowvec ?
< naywhayare>
what do you mean?
< udit_s>
I've been trying to send the sorted weights row vector as a parameter to the calculateEntropy function - it doesn't compile.
< udit_s>
hang on, let me get the exact statement for you.
< naywhayare>
ok, I'm about to get some lunch... I'll be back in an hour or so
< udit_s>
I'll be off to sleep in a while... I'll push the updates for adaboost, and I'll talk to you about the stumps code either tomorrow or over email.