#mlpack on 2014-07-29 — irc logs at libera.irclog.whitequark.org

2014-05-21 16:24 naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

00:31 sumedhghaisas has joined #mlpack

01:32 sumedhghaisas has quit [Ping timeout: 255 seconds]

02:14 transponder has joined #mlpack

02:14 < transponder> hello

02:15 < transponder> How much time the GMM estimation is supposed to take?

02:15 < transponder> I have about 10^6 data vectors of 13 domensions

02:15 < transponder> and am trying to train 256 gaussian MM

02:15 < transponder> running on 2 gb ram laptop

02:16 < naywhayare> so, 100k points... 256 gaussians

02:17 < naywhayare> did you compile without debugging symbols?

02:18 < transponder> yes

02:18 < naywhayare> okay, well that's good

02:18 < transponder> 256 gaussians, in 13 dimensions

02:18 < naywhayare> it's difficult to say how long the estimation is supposed to take because it is highly dependent on the data

02:18 < naywhayare> the algorithm that we have implemented should scale as... O(nk) per iteration, I think

02:18 < naywhayare> but I don't know how to easily place a bound on the number of iterations

02:19 < transponder> Ok...its been running now for 17 hrs. What is k ?

02:19 < naywhayare> (n is the number of points, k is the number of copmonents)

02:20 < naywhayare> is it giving any output at all?

02:20 < transponder> n will be 13*10^6 right

02:20 < transponder> how do i check if its giving any output

02:20 < naywhayare> i.e. "EMFit::Estimate(): iteration ..."

02:20 < naywhayare> is it printing anything on the screen?

02:20 < naywhayare> also, did you write a C++ program by hand, or are you using the 'gmm' program?

02:21 < transponder> i am simply doing this

02:21 < transponder> GMM<> g(256, 13);

02:22 < transponder> arma::mat data;

02:22 < transponder> then loading data

02:22 < transponder> sorry

02:22 < transponder> g.Estimate(data);

02:22 < naywhayare> okay... that won't give output unless you turn on output from the 'mlpack::Log::Info' object by doing this... 'mlpack::Log::Info.ignoreInput = false;'

02:22 < transponder> g.Save("output_file");

02:22 < transponder> k i will try that

02:24 < naywhayare> you may also want to try with a subset of your data to start with

02:30 < transponder> yes,l tested with subset, works fine

02:31 < transponder> now gave complete data for training

02:31 < transponder> btw in the [INFO] there needs to be timestamp

02:31 < transponder> so that we know how much time spent in each step

02:37 < transponder> its still stuck at first step,

02:37 < transponder> reading data

02:37 < transponder> 8 minutes

02:37 < transponder> using ~580 mb memory

02:39 < transponder> a correction btw, its 10^7 data vectors

02:49 < transponder> still [INFO ] Loading..

03:06 < transponder> after 35 mins, still says Loading 'file' size is x y

03:07 < transponder> mem usage is almost constant over time

03:13 < naywhayare> how big is the data file?

03:15 < transponder> 682 mb, and 13 x 5438715 exact matrix size

03:15 < naywhayare> in what format?

03:15 < transponder> plain text

03:16 < naywhayare> when the data is loaded it must be transposed, and sometimes this can take a while

03:16 < naywhayare> 5.3M points is a lot of points, but I would still expect that to load within 35 minutes

03:16 < transponder> can i transpose it myself

03:16 < transponder> and give to load

03:16 < transponder> can do it quickly in matlab i guess

03:16 < naywhayare> if you really wanted to, you could do that, and data::Load() takes an additional argument to not transpose

03:17 < naywhayare> but, there is nothing wrong with the transposition code

03:17 < naywhayare> I would wager a bet that your laptop is swapping

03:17 < transponder> k

03:17 < naywhayare> a dataset of that size will use about 0.5 GB of RAM

03:17 < transponder> total used free shared buffers cached

03:17 < transponder> Mem: 3805 3349 456 0 239 691

03:17 < transponder> -/+ buffers/cache: 2417 1388

03:17 < transponder> Swap: 7629 1467 6162

03:17 < transponder> output of free -m

03:18 < naywhayare> that doesn't necessarily mean that it isn't swapping. the kernel may be swapping things out in order to provide a 0.5 GB contiguous section of memory

03:18 < naywhayare> also, that's 4GB not 2GB

03:18 < transponder> yes

03:18 < naywhayare> that makes me less inclined to believe the system is swapping, then

03:18 < naywhayare> unless you are running very many other processes

03:19 < transponder> ther is firefox and and idle instant of matlab

03:19 < transponder> nothing else

03:20 < naywhayare> do you have another system that you can test on?

03:20 < naywhayare> or, are you able to get me a copy of the dataset so I can see if I can reproduce it?

03:20 < naywhayare> if indeed it is hanging on the load

03:20 < transponder> yes, will try on another machine

03:20 < transponder> sure i can upload the data, wait

03:39 < transponder> hello, please download from here https://drive.google.com/file/d/0BzNdoGke8tyRMUJuQk53S0VZdFE/edit?usp=sharing

03:39 < transponder> let me know if not able to download

03:47 < naywhayare> I created a test program that is only a couple lines long:

03:47 < naywhayare> int main() {

03:47 < naywhayare> mlpack::Log::Info.ignoreInput = false;

03:47 < naywhayare> arma::mat dataset;

03:47 < naywhayare> mlpack::data::Load("/home/ryan/Downloads/sampledata.txt", dataset);

03:47 < naywhayare> }

03:48 < naywhayare> it only takes 33 seconds on my system, and it does not hang

03:48 < naywhayare> the output is this:

03:48 < naywhayare> [INFO ] Loading '/home/ryan/Downloads/sampledata.txt' as raw ASCII formatted data. Size is 13 x 5438715.

03:48 < naywhayare> (the 'Size is 13 x 5438715' bit doesn't show up until it is loaded, and the call to data::Load() completes)

03:55 < transponder> ok, i thought that the [INFO] comes while its still loading

03:55 < transponder> does it show anything more in the log?

03:56 < transponder> it doesnt in mine

03:56 < transponder> so i guess its doing EM

03:56 < transponder> but no iteration complete

03:57 < naywhayare> well, like I said, that was the end of the program, so no more output from that one

03:57 < naywhayare> I wanted to confirm that there was no issue loading the data.

03:58 < transponder> ok

03:58 < transponder> cud you try GMM estimate, i copied my data to another machine but will have to reinstall mlpack there

03:58 < transponder> or u think this is OK

03:59 < transponder> as in iteration may take some hours

04:00 < naywhayare> yes, I am running that now

04:00 < naywhayare> I would not be surprised if each iteration takes a long time

04:01 < naywhayare> 5.3M points is a lot of points, and 256 dimensions is very many

04:01 < naywhayare> sorry... 256 components

04:01 < transponder> yes, but i dont know whats the benchmark

04:02 < naywhayare> you mean the benchmark on the mlpack website?

04:03 < transponder> as in how much time does it take generally for X amount of data

04:03 < naywhayare> that depends quite a lot on the data

04:03 < transponder> if the number of iteration could be pre-fixed, can one estimate time before running gmm train?

04:04 < naywhayare> probably, but the easiest way to do that would be to run only a single iteration

04:05 < naywhayare> however, it's also not that simple, because before an iteration of the GMM estimation algorithm can be run, the initial parameters are estimated with k-means

04:05 < naywhayare> so unfortunately I don't think I have a good answer for how long it will take

04:05 < transponder> ok .. is ther a commnd to do just one iteration?

04:06 < naywhayare> I believe the maximum number of iterations can be set through the GMM class

04:06 < naywhayare> you can consult the API documentation

04:06 < transponder> k

04:07 < transponder> btw thanks for ur help, i ll come to chat later if i get some result in log or otherwise ..

04:09 < transponder> will it help if input data is sorted in any way, or those kind of things are taken care of / wudnt matter

04:09 < transponder> or may be small subset cud be used to make an initial GMM

04:10 < transponder> and then somehow use that GMM to train another GMM on complete data

04:10 < transponder> ?

04:10 < naywhayare> the ordering of the input data shouldn't matter

04:10 < naywhayare> it would be possible to do what you suggested -- train a GMM on a small subset of the data

04:10 < naywhayare> but you will have to get your hands a little dirty with the code

04:11 < naywhayare> note that the GMM class uses the EMFit class for the actual fitting

04:11 < naywhayare> and then the EMFit class uses an InitialClusteringType class to do the initial clustering

04:12 < naywhayare> you could build a class to use as the InitialClusteringType that trained a GMM on a small subset of the data, then used that as the initial GMM estimate

04:13 < transponder> k, would that initial estimate go as input to EMFit?

04:13 < naywhayare> no, you would have to create a new class to use as the template parameter to EMFit

04:13 < naywhayare> you would need to spend some time with the GMM code to see how it works

04:14 < transponder> k..

04:15 < transponder> will try that .. if this remains too slow

04:15 < transponder> btw r ther any alternative packages that u know of?

04:15 < transponder> ther is one GMMBayes in matlab, but its slow as expected

04:16 < transponder> may be something that cud use multiple CPUs

04:17 < transponder> ther is dlib ml package but it doesnt have EM

04:19 < transponder> [i'll join ltr, going for food]

04:19 < naywhayare> I don't know of any other packages that you could use for this

04:40 govg has joined #mlpack

04:40 govg has quit [Changing host]

04:40 govg has joined #mlpack

05:08 < transponder> k

05:09 < transponder> its 2:40 Hrs since i started running, no new output in log file

05:15 govg has quit [Ping timeout: 245 seconds]

05:33 govg has joined #mlpack

05:33 govg has quit [Changing host]

05:33 govg has joined #mlpack

06:24 < transponder> hello, i tried https://github.com/juandavm/em4gmm package, it gave results in ~ 20 minutes!

06:42 govg has quit [Quit: leaving]

07:47 transponder has quit [Quit: http://www.kiwiirc.com/ - A hand crafted IRC client]

14:26 transponder has joined #mlpack

14:33 transponder has quit [Quit: Page closed]

14:39 andrewmw94 has joined #mlpack

15:56 < jenkins-mlpack> Starting build #2054 for job mlpack - svn checkin test (previous build: SUCCESS)

17:21 udit_s has joined #mlpack

17:26 < jenkins-mlpack> Project mlpack - svn checkin test build #2054: SUCCESS in 1 hr 30 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2054/

17:26 < jenkins-mlpack> Ryan Curtin: Don't do base cases before recursing. This is slightly cleaner code and will

17:26 < jenkins-mlpack> properly handle the case where the tree is only one node deep.

17:26 < jenkins-mlpack> Starting build #2055 for job mlpack - svn checkin test (previous build: SUCCESS)

17:34 < udit_s> naywhayare: Hello. Are you free now ?

17:39 < naywhayare> udit_s: yes, but I will be leaving for lunch at some point... not sure when

17:40 < udit_s> hmm... okay - did you get a chance to go through the mail ?

17:41 < udit_s> also, great work on the release ! :)

17:41 < udit_s> I've tested on a non-linearly separable dataset - and the tests pass.

17:42 < udit_s> The only problem I have is one I've talked about in the mail - the tolerance issue.

17:43 < naywhayare> okay

17:43 < naywhayare> let me look at the tolerance issue now

17:44 < naywhayare> why not std::abs(rt - crt), on line 130?

17:44 < naywhayare> also, why not just use a 'break' statement instead of 'i = iterations'?

17:45 < naywhayare> also, a minor note for the tests... if you use BOOST_REQUIRE_CLOSE() or BOOST_REQUIRE_LT() as opposed to BOOST_REQUIRE(), the output will be better

17:45 < naywhayare> for instance, you have 'BOOST_REQUIRE(hammingLoss <= a.ztAccumulator);'

17:46 < naywhayare> if that statement fails, boost.test will output something like 'failure: BOOST_REQUIRE(hammingLoss <= a.ztAccumulator)' and then it will print the value of 'hammingLoss <= a.ztAccumulator', which will be false

17:46 < udit_s> I've read somewhere that using breaks is, like, a bad habit and should be avoided. Instead I just used i = iterations.

17:46 < udit_s> Okay. Got that.

17:47 < naywhayare> but if you use BOOST_REQUIRE_LT(), then it'll print something like 'failure: BOOST_REQUIRE_LT(hammingLoss, a.ztAccumulator)' and then it will print the value of hammingLoss and also the value of a.ztAccumulator

17:47 < naywhayare> it can make it a lot easier to debug things :0

17:47 < naywhayare> *:)

17:47 < udit_s> :D

17:47 < naywhayare> do you have a reference for why break statements are bad? it's pretty commonly known that goto's are a bad idea for a handful of reasons, but breaks I've never heard...

17:49 < udit_s> Read it somewhere in some book when in school... I think I've formed it since then that I should avoid them, except in 'switch' statements. :)

17:49 < udit_s> If you recommend it, I have no problem.

17:50 < udit_s> Okay. Let me just get that.

17:51 < naywhayare> http://stackoverflow.com/questions/3922599/is-it-a-bad-practice-to-use-break-in-a-for-loop

17:51 < naywhayare> I don't see too many compelling arguments against break there

17:51 < udit_s> and the std::abs issue, where was I going wrong ? In considering that double values can be successfully compared ?

17:52 < naywhayare> actually, hang on, I thought about that wrong

17:52 < naywhayare> I don't think a lack of std::abs() would cause it to fail to converge; if anything it would cause it to converge when it actually hadn't

17:53 < naywhayare> ah, yeah, and that's the problem that you were saying you were having -- it converges after the first iteration

17:54 < udit_s> yeah.

17:54 < naywhayare> so I bet what is happening is that crt is larger than rt, so (rt - crt) < 0 < tolerance

17:54 < naywhayare> and as a result it terminates the optimization

18:02 < udit_s> Yikes.

18:02 < udit_s> that's silly of me...

18:03 < udit_s> I think it'll be better to use epsilon from numeric_limits then ?

18:04 < naywhayare> that's machine epsilon, which is the smallest representable difference between two doubles

18:04 < naywhayare> so that'll be way too small... I'd suggest 1e-5 or 1e-10

18:06 < udit_s> yeah, but looking at the values of rt, when it starts repeating - the values are usually between 1e-16 to 1e-17. That's probably when the change in the value of the weights stops affecting the boosting.

18:07 < naywhayare> so maybe something more like 1e-40 or 1e-50 is reasonable?

18:07 < naywhayare> machine epsilon should be extremely small... something like 1e-300

18:07 < naywhayare> (I don't remember the exact number)

18:08 < udit_s> for double it's 2.22.045e-16

18:09 < udit_s> that's 2.22045e-016

18:10 < udit_s> reference: http://msdn.microsoft.com/en-us/library/6x7575x3(v=vs.71).aspx

18:12 < naywhayare> oh, I was way off then

18:13 < naywhayare> you may want to consider storing log(rt) and log(crt) then

18:13 < naywhayare> and comparing those

18:15 < udit_s> why ? I just did std::abs(rt - crt) < 1e-12 ... it works alright. any particular reason for using logs ?

18:15 < naywhayare> if rt and crt are so small that they are getting clipped to 0, then the log of those values should still be representable as nonzero values

18:15 < naywhayare> i.e. 1e-500 can't be represented as a double, but log(1e-500) can

18:16 < naywhayare> (probably something like -2e10, not sure)

18:17 < naywhayare> anyway, if std::abs(rt - crt) works fine, no reason to change it :)

18:17 < udit_s> the values of rt aren't getting as small as that.

18:21 < udit_s> Okay, so can I directly incorporate the tolerance as a user input by using PARAM_DOUBLE(...) ?

18:21 < naywhayare> in the main executable, yeah, you would use PARAM_DOUBLE

18:22 < naywhayare> but for the AdaBoost class, you can have it be a parameter to the constructor (and hold it as an internal member), or a parameter to the actual function that runs AdaBoost

18:22 < udit_s> hmm. Okay. Let me get to that.

18:23 < udit_s> Also, about decision stumps... Is there no match for a subview_row<double> or subview_rowvec ?

18:23 < naywhayare> what do you mean?

18:24 < udit_s> I've been trying to send the sorted weights row vector as a parameter to the calculateEntropy function - it doesn't compile.

18:25 < udit_s> hang on, let me get the exact statement for you.

18:31 < naywhayare> ok, I'm about to get some lunch... I'll be back in an hour or so

18:33 < udit_s> I'll be off to sleep in a while... I'll push the updates for adaboost, and I'll talk to you about the stumps code either tomorrow or over email.

18:35 < naywhayare> okay

18:54 udit_s has quit [Quit: Leaving]

18:56 < jenkins-mlpack> Project mlpack - svn checkin test build #2055: SUCCESS in 1 hr 29 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2055/

18:56 < jenkins-mlpack> Ryan Curtin: Extremely minor changes.

19:02 < jenkins-mlpack> Starting build #2056 for job mlpack - svn checkin test (previous build: SUCCESS)

19:50 < jenkins-mlpack> Project mlpack - nightly matrix build build #536: FAILURE in 2 hr 49 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/536/

19:50 < jenkins-mlpack> * Ryan Curtin: Extremely minor changes.

19:50 < jenkins-mlpack> * Ryan Curtin: Don't do base cases before recursing. This is slightly cleaner code and will

19:50 < jenkins-mlpack> properly handle the case where the tree is only one node deep.

20:30 < jenkins-mlpack> Project mlpack - svn checkin test build #2056: SUCCESS in 1 hr 28 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2056/

20:30 < jenkins-mlpack> saxena.udit: More tests for Adaboost added, with tolerance for change in rt also provided.

22:10 < jenkins-mlpack> Starting build #2057 for job mlpack - svn checkin test (previous build: SUCCESS)

22:18 < jenkins-mlpack> Project mlpack - svn checkin test build #2057: FAILURE in 7 min 46 sec: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/2057/

22:18 < jenkins-mlpack> Ryan Curtin: Refactor KMeans so that the actual Lloyd iteration step is separate, since there

22:18 < jenkins-mlpack> are many ways to do a Lloyd iteration.

22:55 < jenkins-mlpack> Starting build #2058 for job mlpack - svn checkin test (previous build: FAILURE -- last SUCCESS #2056 3 hr 53 min ago)

23:22 andrewmw94 has quit [Quit: Leaving.]