naywhayare changed the topic of #mlpack to: -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs:
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 255 seconds]
transponder has joined #mlpack
< transponder> hello
< transponder> How much time the GMM estimation is supposed to take?
< transponder> I have about 10^6 data vectors of 13 domensions
< transponder> and am trying to train 256 gaussian MM
< transponder> running on 2 gb ram laptop
< naywhayare> so, 100k points... 256 gaussians
< naywhayare> did you compile without debugging symbols?
< transponder> yes
< naywhayare> okay, well that's good
< transponder> 256 gaussians, in 13 dimensions
< naywhayare> it's difficult to say how long the estimation is supposed to take because it is highly dependent on the data
< naywhayare> the algorithm that we have implemented should scale as... O(nk) per iteration, I think
< naywhayare> but I don't know how to easily place a bound on the number of iterations
< transponder> Ok...its been running now for 17 hrs. What is k ?
< naywhayare> (n is the number of points, k is the number of copmonents)
< naywhayare> is it giving any output at all?
< transponder> n will be 13*10^6 right
< transponder> how do i check if its giving any output
< naywhayare> i.e. "EMFit::Estimate(): iteration ..."
< naywhayare> is it printing anything on the screen?
< naywhayare> also, did you write a C++ program by hand, or are you using the 'gmm' program?
< transponder> i am simply doing this
< transponder> GMM<> g(256, 13);
< transponder> arma::mat data;
< transponder> arma::mat data;
< transponder> arma::mat data;
< transponder> arma::mat data;
< transponder> arma::mat data;
< transponder> then loading data
< transponder> sorry
< transponder> g.Estimate(data);
< naywhayare> okay... that won't give output unless you turn on output from the 'mlpack::Log::Info' object by doing this... 'mlpack::Log::Info.ignoreInput = false;'
< transponder> g.Save("output_file");
< transponder> k i will try that
< naywhayare> you may also want to try with a subset of your data to start with
< transponder> yes,l tested with subset, works fine
< transponder> now gave complete data for training
< transponder> btw in the [INFO] there needs to be timestamp
< transponder> so that we know how much time spent in each step
< transponder> its still stuck at first step,
< transponder> reading data
< transponder> 8 minutes
< transponder> using ~580 mb memory
< transponder> a correction btw, its 10^7 data vectors
< transponder> still [INFO ] Loading..
< transponder> after 35 mins, still says Loading 'file' size is x y
< transponder> mem usage is almost constant over time
< naywhayare> how big is the data file?
< transponder> 682 mb, and 13 x 5438715 exact matrix size
< naywhayare> in what format?
< transponder> plain text
< naywhayare> when the data is loaded it must be transposed, and sometimes this can take a while
< naywhayare> 5.3M points is a lot of points, but I would still expect that to load within 35 minutes
< transponder> can i transpose it myself
< transponder> and give to load
< transponder> can do it quickly in matlab i guess
< naywhayare> if you really wanted to, you could do that, and data::Load() takes an additional argument to not transpose
< naywhayare> but, there is nothing wrong with the transposition code
< naywhayare> I would wager a bet that your laptop is swapping
< transponder> k
< naywhayare> a dataset of that size will use about 0.5 GB of RAM
< transponder> total used free shared buffers cached
< transponder> Mem: 3805 3349 456 0 239 691
< transponder> -/+ buffers/cache: 2417 1388
< transponder> Swap: 7629 1467 6162
< transponder> output of free -m
< naywhayare> that doesn't necessarily mean that it isn't swapping. the kernel may be swapping things out in order to provide a 0.5 GB contiguous section of memory
< naywhayare> also, that's 4GB not 2GB
< transponder> yes
< naywhayare> that makes me less inclined to believe the system is swapping, then
< naywhayare> unless you are running very many other processes
< transponder> ther is firefox and and idle instant of matlab
< transponder> nothing else
< naywhayare> do you have another system that you can test on?
< naywhayare> or, are you able to get me a copy of the dataset so I can see if I can reproduce it?
< naywhayare> if indeed it is hanging on the load
< transponder> yes, will try on another machine
< transponder> sure i can upload the data, wait
< transponder> let me know if not able to download
< naywhayare> I created a test program that is only a couple lines long:
< naywhayare> int main() {
< naywhayare> mlpack::Log::Info.ignoreInput = false;
< naywhayare> arma::mat dataset;
< naywhayare> mlpack::data::Load("/home/ryan/Downloads/sampledata.txt", dataset);
< naywhayare> }
< naywhayare> it only takes 33 seconds on my system, and it does not hang
< naywhayare> the output is this:
< naywhayare> [INFO ] Loading '/home/ryan/Downloads/sampledata.txt' as raw ASCII formatted data. Size is 13 x 5438715.
< naywhayare> (the 'Size is 13 x 5438715' bit doesn't show up until it is loaded, and the call to data::Load() completes)
< transponder> ok, i thought that the [INFO] comes while its still loading
< transponder> does it show anything more in the log?
< transponder> it doesnt in mine
< transponder> so i guess its doing EM
< transponder> but no iteration complete
< naywhayare> well, like I said, that was the end of the program, so no more output from that one
< naywhayare> I wanted to confirm that there was no issue loading the data.
< transponder> ok
< transponder> cud you try GMM estimate, i copied my data to another machine but will have to reinstall mlpack there
< transponder> or u think this is OK
< transponder> as in iteration may take some hours
< naywhayare> yes, I am running that now
< naywhayare> I would not be surprised if each iteration takes a long time
< naywhayare> 5.3M points is a lot of points, and 256 dimensions is very many
< naywhayare> sorry... 256 components
< transponder> yes, but i dont know whats the benchmark
< naywhayare> you mean the benchmark on the mlpack website?
< transponder> as in how much time does it take generally for X amount of data
< naywhayare> that depends quite a lot on the data
< transponder> if the number of iteration could be pre-fixed, can one estimate time before running gmm train?
< naywhayare> probably, but the easiest way to do that would be to run only a single iteration
< naywhayare> however, it's also not that simple, because before an iteration of the GMM estimation algorithm can be run, the initial parameters are estimated with k-means
< naywhayare> so unfortunately I don't think I have a good answer for how long it will take
< transponder> ok .. is ther a commnd to do just one iteration?
< naywhayare> I believe the maximum number of iterations can be set through the GMM class
< naywhayare> you can consult the API documentation
< transponder> k
< transponder> btw thanks for ur help, i ll come to chat later if i get some result in log or otherwise ..
< transponder> will it help if input data is sorted in any way, or those kind of things are taken care of / wudnt matter
< transponder> or may be small subset cud be used to make an initial GMM
< transponder> and then somehow use that GMM to train another GMM on complete data
< transponder> ?
< naywhayare> the ordering of the input data shouldn't matter
< naywhayare> it would be possible to do what you suggested -- train a GMM on a small subset of the data
< naywhayare> but you will have to get your hands a little dirty with the code
< naywhayare> note that the GMM class uses the EMFit class for the actual fitting
< naywhayare> and then the EMFit class uses an InitialClusteringType class to do the initial clustering
< naywhayare> you could build a class to use as the InitialClusteringType that trained a GMM on a small subset of the data, then used that as the initial GMM estimate
< transponder> k, would that initial estimate go as input to EMFit?
< naywhayare> no, you would have to create a new class to use as the template parameter to EMFit
< naywhayare> you would need to spend some time with the GMM code to see how it works
< transponder> k..
< transponder> will try that .. if this remains too slow
< transponder> btw r ther any alternative packages that u know of?
< transponder> ther is one GMMBayes in matlab, but its slow as expected
< transponder> may be something that cud use multiple CPUs
< transponder> ther is dlib ml package but it doesnt have EM
< transponder> [i'll join ltr, going for food]
< naywhayare> I don't know of any other packages that you could use for this
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
< transponder> k
< transponder> its 2:40 Hrs since i started running, no new output in log file
govg has quit [Ping timeout: 245 seconds]
govg has joined #mlpack
govg has quit [Changing host]
govg has joined #mlpack
< transponder> hello, i tried package, it gave results in ~ 20 minutes!
govg has quit [Quit: leaving]
transponder has quit [Quit: - A hand crafted IRC client]
transponder has joined #mlpack
transponder has quit [Quit: Page closed]
andrewmw94 has joined #mlpack
< jenkins-mlpack> Starting build #2054 for job mlpack - svn checkin test (previous build: SUCCESS)
udit_s has joined #mlpack
< jenkins-mlpack> Project mlpack - svn checkin test build #2054: SUCCESS in 1 hr 30 min:
< jenkins-mlpack> Ryan Curtin: Don't do base cases before recursing. This is slightly cleaner code and will
< jenkins-mlpack> properly handle the case where the tree is only one node deep.
< jenkins-mlpack> Starting build #2055 for job mlpack - svn checkin test (previous build: SUCCESS)
< udit_s> naywhayare: Hello. Are you free now ?
< naywhayare> udit_s: yes, but I will be leaving for lunch at some point... not sure when
< udit_s> hmm... okay - did you get a chance to go through the mail ?
< udit_s> also, great work on the release ! :)
< udit_s> I've tested on a non-linearly separable dataset - and the tests pass.
< udit_s> The only problem I have is one I've talked about in the mail - the tolerance issue.
< naywhayare> okay
< naywhayare> let me look at the tolerance issue now
< naywhayare> why not std::abs(rt - crt), on line 130?
< naywhayare> also, why not just use a 'break' statement instead of 'i = iterations'?
< naywhayare> also, a minor note for the tests... if you use BOOST_REQUIRE_CLOSE() or BOOST_REQUIRE_LT() as opposed to BOOST_REQUIRE(), the output will be better
< naywhayare> for instance, you have 'BOOST_REQUIRE(hammingLoss <= a.ztAccumulator);'
< naywhayare> if that statement fails, boost.test will output something like 'failure: BOOST_REQUIRE(hammingLoss <= a.ztAccumulator)' and then it will print the value of 'hammingLoss <= a.ztAccumulator', which will be false
< udit_s> I've read somewhere that using breaks is, like, a bad habit and should be avoided. Instead I just used i = iterations.
< udit_s> Okay. Got that.
< naywhayare> but if you use BOOST_REQUIRE_LT(), then it'll print something like 'failure: BOOST_REQUIRE_LT(hammingLoss, a.ztAccumulator)' and then it will print the value of hammingLoss and also the value of a.ztAccumulator
< naywhayare> it can make it a lot easier to debug things :0
< naywhayare> *:)
< udit_s> :D
< naywhayare> do you have a reference for why break statements are bad? it's pretty commonly known that goto's are a bad idea for a handful of reasons, but breaks I've never heard...
< udit_s> Read it somewhere in some book when in school... I think I've formed it since then that I should avoid them, except in 'switch' statements. :)
< udit_s> If you recommend it, I have no problem.
< udit_s> Okay. Let me just get that.
< naywhayare> I don't see too many compelling arguments against break there
< udit_s> and the std::abs issue, where was I going wrong ? In considering that double values can be successfully compared ?
< naywhayare> actually, hang on, I thought about that wrong
< naywhayare> I don't think a lack of std::abs() would cause it to fail to converge; if anything it would cause it to converge when it actually hadn't
< naywhayare> ah, yeah, and that's the problem that you were saying you were having -- it converges after the first iteration
< udit_s> yeah.
< naywhayare> so I bet what is happening is that crt is larger than rt, so (rt - crt) < 0 < tolerance
< naywhayare> and as a result it terminates the optimization
< udit_s> Yikes.
< udit_s> that's silly of me...
< udit_s> I think it'll be better to use epsilon from numeric_limits then ?
< naywhayare> that's machine epsilon, which is the smallest representable difference between two doubles
< naywhayare> so that'll be way too small... I'd suggest 1e-5 or 1e-10
< udit_s> yeah, but looking at the values of rt, when it starts repeating - the values are usually between 1e-16 to 1e-17. That's probably when the change in the value of the weights stops affecting the boosting.
< naywhayare> so maybe something more like 1e-40 or 1e-50 is reasonable?
< naywhayare> machine epsilon should be extremely small... something like 1e-300
< naywhayare> (I don't remember the exact number)
< udit_s> for double it's 2.22.045e-16
< udit_s> that's 2.22045e-016
< naywhayare> oh, I was way off then
< naywhayare> you may want to consider storing log(rt) and log(crt) then
< naywhayare> and comparing those
< udit_s> why ? I just did std::abs(rt - crt) < 1e-12 ... it works alright. any particular reason for using logs ?
< naywhayare> if rt and crt are so small that they are getting clipped to 0, then the log of those values should still be representable as nonzero values
< naywhayare> i.e. 1e-500 can't be represented as a double, but log(1e-500) can
< naywhayare> (probably something like -2e10, not sure)
< naywhayare> anyway, if std::abs(rt - crt) works fine, no reason to change it :)
< udit_s> the values of rt aren't getting as small as that.
< udit_s> Okay, so can I directly incorporate the tolerance as a user input by using PARAM_DOUBLE(...) ?
< naywhayare> in the main executable, yeah, you would use PARAM_DOUBLE
< naywhayare> but for the AdaBoost class, you can have it be a parameter to the constructor (and hold it as an internal member), or a parameter to the actual function that runs AdaBoost
< udit_s> hmm. Okay. Let me get to that.
< udit_s> Also, about decision stumps... Is there no match for a subview_row<double> or subview_rowvec ?
< naywhayare> what do you mean?
< udit_s> I've been trying to send the sorted weights row vector as a parameter to the calculateEntropy function - it doesn't compile.
< udit_s> hang on, let me get the exact statement for you.
< naywhayare> ok, I'm about to get some lunch... I'll be back in an hour or so
< udit_s> I'll be off to sleep in a while... I'll push the updates for adaboost, and I'll talk to you about the stumps code either tomorrow or over email.
< naywhayare> okay
udit_s has quit [Quit: Leaving]
< jenkins-mlpack> Project mlpack - svn checkin test build #2055: SUCCESS in 1 hr 29 min:
< jenkins-mlpack> Ryan Curtin: Extremely minor changes.
< jenkins-mlpack> Starting build #2056 for job mlpack - svn checkin test (previous build: SUCCESS)
< jenkins-mlpack> Project mlpack - nightly matrix build build #536: FAILURE in 2 hr 49 min:
< jenkins-mlpack> * Ryan Curtin: Extremely minor changes.
< jenkins-mlpack> * Ryan Curtin: Don't do base cases before recursing. This is slightly cleaner code and will
< jenkins-mlpack> properly handle the case where the tree is only one node deep.
< jenkins-mlpack> Project mlpack - svn checkin test build #2056: SUCCESS in 1 hr 28 min:
< jenkins-mlpack> saxena.udit: More tests for Adaboost added, with tolerance for change in rt also provided.
< jenkins-mlpack> Starting build #2057 for job mlpack - svn checkin test (previous build: SUCCESS)
< jenkins-mlpack> Project mlpack - svn checkin test build #2057: FAILURE in 7 min 46 sec:
< jenkins-mlpack> Ryan Curtin: Refactor KMeans so that the actual Lloyd iteration step is separate, since there
< jenkins-mlpack> are many ways to do a Lloyd iteration.
< jenkins-mlpack> Starting build #2058 for job mlpack - svn checkin test (previous build: FAILURE -- last SUCCESS #2056 3 hr 53 min ago)
andrewmw94 has quit [Quit: Leaving.]