#mlpack on 2017-07-14 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

00:07 sumedhghaisas has quit [Remote host closed the connection]

03:48 kris1 has quit [Quit: kris1]

04:34 partobs-mdp has joined #mlpack

05:35 < partobs-mdp> zoq: rcurtin: Opening the debugging session. I've noticed that objective blow-ups happen together with gradient blow-ups (Captain Obvious) - the gradient norm goes from 10~20 to 200+. The weird stuff: the maximum gradient element is consistently more than 5 (does clipping work?)

05:36 < partobs-mdp> Before "blow-ups" the maximum gradient element goes to some insane values (~25-30)

05:38 < partobs-mdp> As an idea: what if doing gradient clipping inside layer gradient was a mistake? The goal is to clip the gradient w.r.t. network weights, not w.r.t. error layer inputs.

05:47 kris1 has joined #mlpack

05:50 < partobs-mdp> Well, the result has improved from 35% to 50%. Further I'll try to clip the gradient to lower values ([-2, +2]) and add some dropout - *now* it looks like it's going to work :)

05:57 kris1 has quit [Quit: kris1]

05:57 < partobs-mdp> The explosions are still there. My take on it: no regularization -> the model goes to some high-likelihood, low-measure spikes -> overfitting

05:57 < partobs-mdp> The gradient clipping somewhat rectified it, because SGD steps became less sharp

05:58 < partobs-mdp> However, it still went to those "overfitting" areas

05:58 < partobs-mdp> So Dropout looks like the only good idea

06:13 kris1 has joined #mlpack

06:17 < partobs-mdp> Well, even with added Dropout it still converges to worse parameter values than MeanSquaredError on all tasks

06:18 < partobs-mdp> My take: MeanSqquaredError implicitly clipped the gradient (graident = input - target, all its elements are <= 1 in absolute value)

06:18 < partobs-mdp> I think it's time to switch back to the MeanSquaredError

06:18 kris1 has quit [Quit: kris1]

06:21 andrzejk_ has joined #mlpack

06:36 kris1 has joined #mlpack

06:53 kris1 has quit [Quit: kris1]

07:19 kris1 has joined #mlpack

07:19 kris1 has quit [Client Quit]

07:45 kris1 has joined #mlpack

07:47 andrzejk_ has quit [Quit: Textual IRC Client: www.textualapp.com]

07:52 kris1 has quit [Quit: kris1]

08:27 < partobs-mdp> zoq: rcurtin: The weird thing is that after plugging back MeanSquaredError I get 20% (both with and without Dropout)

08:41 < partobs-mdp> It looks like the problem is not in CrossEntropyError. Can anyone run the code and see if it really yields 20% on MSE model?

08:49 < partobs-mdp> I'm feeling myself slightly crazy because earlier @zoq reported >80% on our previous MSE model

09:11 kris1 has joined #mlpack

09:28 vivekp has quit [Ping timeout: 260 seconds]

09:29 shikhar has joined #mlpack

09:30 vivekp has joined #mlpack

09:39 vivekp has quit [Ping timeout: 260 seconds]

09:40 vivekp has joined #mlpack

10:39 < zoq> partobs-mdp: So you're saying if you revert everything you can't replicate my results?

10:40 < partobs-mdp> Yes, I still have 20% using current upstream code from my fork

10:40 < partobs-mdp> (50% using CrossEntropy)

10:40 < zoq> partobs-mdp: Okay, let me rerun the experiments.

11:32 < zoq> partobs-mdp: Almost the same results: https://gist.github.com/zoq/f265d132f7b2f52d41e3ef8daf2c70bd

11:33 < zoq> partobs-mdp: Let me drop in the cross entropy layer, I'll also test the LogSoftmax + Negative log likelihood layer which is should be the same.

11:59 < zoq> partobs-mdp: cross entropy results: https://gist.github.com/zoq/08760807bc3e23aa27b50a51bf80311e

12:00 < zoq> partobs-mdp: So I can at least replicate the cross entropy results

12:20 < zoq> partobs-mdp: ah wait I mixed the LogSoftmax layer with the CrossEntropyError, let me rerun the script

12:48 < zoq> partobs-mdp: updated results: https://gist.github.com/zoq/08760807bc3e23aa27b50a51bf80311e

12:49 < partobs-mdp> On MSE or on cros-entropy?

12:49 < partobs-mdp> *cross

12:49 < zoq> partobs-mdp: cross

12:49 < zoq> about the same as for the mse

12:50 < partobs-mdp> Well, that's strange, because the results here are way lower: https://github.com/mlpack/models/pull/1#issuecomment-314845673

12:50 < rcurtin> partobs-mdp: time to start looking at valgrind maybe? (hopefully not, that gets tedious)

12:51 < rcurtin> let me try and replicate myself and see what I get, but I am headed to the office now so it will be ~30-45m before I can start the runs

12:51 < zoq> partobs-mdp: I think since we use random weights we see slighty different results

12:52 < zoq> rcurtin: good idea, it's strange that I get about the same results with either mse or cross entropy

12:52 < rcurtin> if the issue is the random initialization, I would expect that several runs with different seeds would produce wildly different results

12:53 < partobs-mdp> but that will be strange, because I was consistently getting much lower results

12:54 < zoq> yeah, I agree

12:55 < zoq> my results are all > 0.8

12:55 < zoq> partobs-mdp: did you test in with DEBUG=ON?

12:55 < partobs-mdp> zoq: I'll rerun it, maybe I have done something stupid while building

12:55 < zoq> I used DEBUG=OFF

12:56 < zoq> or the default settings

12:56 < partobs-mdp> zoq: I think no - but I'll check history | grep cmake later on

12:57 < zoq> okay, in the meantime let me run the add task with the cross entropy layer

13:00 < partobs-mdp> zoq: Ran SortTask on maxLen = 10, getting 7% precision

13:00 < partobs-mdp> zoq: Command: bin/mlpack_lstm_baseline -t sort -e 500 -b 2 -l 10 -r 1 -s 128 -v

13:00 < partobs-mdp> cmake -D DEBUG=ON -D PROFILE=ON ../ - yes I did use DEBUG=ON. Should I rebuild?

13:01 < zoq> hm, let's see if you get better results with DEBUG=OFF

13:01 < zoq> if that#s the case we ahve to dig for the issue

13:03 < zoq> [INFO ] SGD: maximum iterations (64000) reached; terminating optimization.

13:03 < zoq> [INFO ] RNN::RNN(): final objective of trained model is 151.434.

13:03 < zoq> [INFO ] Running evaluation loop.

13:03 < zoq> [INFO ] Final score: 0.0234375

13:03 < zoq> Final score: 0.0234375

13:05 < zoq> but I think we should use much more samples here

13:10 < partobs-mdp> zoq: Whoa, I still get 7% even with DEBUG=OFF

13:10 < partobs-mdp> Unreproducible bug?

13:11 < partobs-mdp> zoq: Ouch, I wrote "sort" instead of "copy"

13:13 < zoq> we shouldn't expect that we see the same results, since we use different initial weights, and also shuffel the samples

13:13 < zoq> okay, let me run copy

13:15 < partobs-mdp> By the way, that gradient explosion problem is not specific to CrossEntropyLayer - I get similar effect on MeanSquareError (I run Copy)

13:16 < zoq> strange

13:17 < zoq> [INFO ] Final score: 0.703125

13:19 < zoq> I guess, we could test if the GRU layer might perform better

13:19 < rcurtin> is the training being done the same way on both systems? or is one setup training in batch on many samples, and the other training on only one sample like in the models#1 PR?

13:19 < partobs-mdp> I've got 0.75 on MSE

13:19 < partobs-mdp> *precision

13:19 < zoq> I used the script from the models PR

13:20 < zoq> so we get bascially the same results here right?

13:21 < partobs-mdp> now - yes. But the results you got from the script were different (~84%)

13:21 < zoq> would be intersting to see if you get about the same results with CE

13:21 < partobs-mdp> Can this be explained by random initialization?

13:21 < zoq> There are multiple factors: weights, shuffle

13:24 < zoq> rcurtin: for the last results we both just used: bin/mlpack_lstm_baseline -t copy -e 500 -b 2 -l 10 -r 1 -s 128 -v

13:24 < rcurtin> ok, which if I understand right only trains on a single point?

13:26 < zoq> I think it uses the fixed representation(arma::mat), so it should train over all points

13:26 < rcurtin> ah sorry you are right

13:33 < rcurtin> ok, I'm at my desk now, let me see what I can reproduce

13:43 < rcurtin> [INFO ] RNN::RNN(): final objective of trained model is 0.264269.

13:43 < rcurtin> [INFO ] Final score: 0.804688

13:44 < rcurtin> and that's for $ bin/mlpack_lstm_baseline -t copy -e 500 -b 2 -l 10 -r 1 -s 128 -v

13:44 < rcurtin> I noticed that during optimization, I did see what I think is the same 'gradient explosion' problem

13:44 < rcurtin> the objective would get down under 1, then increase to ~100, then work its way back down

13:44 < rcurtin> in each case though, the optimizer was able to recover

13:45 < rcurtin> the objective got as low as 1e-5 (around SGD iteration 56000), but then exploded to ~500 and worked back down to the final objective of 0.264269

13:45 < rcurtin> let me try a few runs, to see what kind of variance I get

13:46 < rcurtin> in order to do that, I'll have to add 'mlpack::math::RandomSeed(std::time(NULL));' to the start of the program

14:03 shikhar has quit [Quit: WeeChat 1.7]

14:18 < zoq> rcurtin: okay, I guess we should test the GRU layer and see if we encounter the same gradient explosion

14:21 < rcurtin> I'm seeing a lot of variance, and whether or not I get good performance seems directly related to whether or not the gradient exploded right before the end of the run

14:21 < rcurtin> for instance, the last run I did, the gradient exploded at iteration ~63k and did not recover before optimization was over, and I got:

14:21 < rcurtin> [INFO ] RNN::RNN(): final objective of trained model is 206.974.

14:21 < rcurtin> [INFO ] Final score: 0.046875

14:25 < partobs-mdp> By the way, I finally managed to reproduce the results in CopyTask

14:25 < partobs-mdp> [INFO ] Final score: 0.898438

14:25 < partobs-mdp> CopyTask, maxLen = 10

14:27 < rcurtin> partobs-mdp: can you add 'mlpack::math::RandomSeed(std::time(NULL));' to the top of lstm_baseline_main.cpp and try running it again?

14:27 < rcurtin> I am wondering if the random seed that you compiled with has bad luck and just happens to cause the gradient to explode right at the end of optimization

14:30 < partobs-mdp> Well, I think I managed to get the experiment right. Now the only real discrepancy is AddTask, so I should only try to change the representation, right?

14:30 < partobs-mdp> zoq: But don't we try to reproduce LSTM experiment? The paper reported some nice results on LSTM

14:32 < partobs-mdp> By the way, I'm rerunning AddTask on a sample of 1024 sequences

14:35 < rcurtin> partobs-mdp: zoq: here are the results for 5 runs of the copy task with different random seeds

14:35 < rcurtin> run 1: minimum objective 0.017164 (iteration 57601); final objective 0.0582562, final score 0.734375

14:35 < rcurtin> run 2: minimum objective 6.65353e-05 (iteration 48385); final objective 0.0390115; final score 0.6875

14:36 < rcurtin> run 3: minimum objective 0.0440252 (iteration 63617); final objective 206.974; final score 0.046875

14:36 < rcurtin> run 4: minimum objective 0.0096776 (iteration 64000); final objective 0.0096776; final score 0.570312

14:36 < rcurtin> run 5: minimum objective 0.00414738 (iteration 39937); final objective 1.00561; final score 0.703125

14:37 < partobs-mdp> rcurtin: CopyTask, maxLength = 10, error function - MSE?

14:37 < rcurtin> so there is a huge amount of variation based on initial conditions

14:37 < rcurtin> $ bin/mlpack_lstm_baseline -t copy -e 500 -b 2 -l 10 -r 1 -s 128 -v

14:37 < rcurtin> with what is currently it your branch, which is MSE

14:38 < rcurtin> I want to see what happens if I run training for a lot longer... I'll give results in a while (it'll take a long time, each run as it is takes 5m)

14:39 < rcurtin> I guess, my current thoughts are, optimizers can be picky and sensitive to initial conditions

14:39 < rcurtin> the gradient explosions I am seeing don't concern me *too* much, I am not sure something is wrong in the implementation---instead, I suspect that since the optimizer is able to recover, that is just kind of a "thing that happens" with this problem

14:39 < rcurtin> it seems the optimization surface is very tricky

14:40 kris1 has quit [Quit: kris1]

14:40 < partobs-mdp> I'll post my results (same task, but using CrossEntropyError) - so far only 1 run, I'll report more about next runs later

14:40 < partobs-mdp> [INFO ] Final score: 0.398438

14:41 < rcurtin> can I have a link to the paper we are comparing against? I can't find it quickly in the code, but you probably have the URL easily at hand :)

14:41 < partobs-mdp> Blow-up around 57k iters

14:41 < partobs-mdp> https://arxiv.org/abs/1602.03218

14:41 < rcurtin> thanks

14:41 < zoq> Probably, if I remember right the NTM authors report that the the model does not converge for each run.

14:43 < partobs-mdp> From the paper: "When the training is finished, we select the model parameters that gave the lowest error rate on validation batches and report the error using these parameters on fresh 2,500 random examples."

14:43 < rcurtin> I think our implementation of adam is not minibatch currently, and minibatch could help avoid some of the bad steps that cause the explosions, I think

14:43 < rcurtin> partobs-mdp: ah! there is what I was looking for

14:44 < rcurtin> so I guess, they are taking the best model from the 100 epochs of training then

14:44 < partobs-mdp> Well, then it looks like we've conducted the experiment correctly all this time :)

14:44 < rcurtin> I wondered if this could give better results---if you took the model just before the gradient exploded, I bet it would perform a lot better

14:45 < rcurtin> to modify the lstm_baseline_main.cpp code to do that, I guess we would need to save the model at each epoch and then evaluate on a validation set

14:45 < partobs-mdp> rcurtin: But how do we do that without modifying SGD optimizer code?

14:46 < rcurtin> partobs-mdp: hm, if Train() starts from the existing set of parameters, you can train one epoch at a time then copy the model

14:47 < rcurtin> i.e. for (size_t e = 0; e < epochs; ++e) {

14:47 < partobs-mdp> rcurtin: But even then we have to modify RNN::Train(), right?

14:47 < rcurtin> opt.MaxIterations() = numSamples;

14:47 < rcurtin> m.Train(data, labels, opt); // this is not exactly right, I am just guessing as I sketch this up

14:48 < rcurtin> oldModels.push_back(m); // for some std::vector<...> oldModels

14:48 < rcurtin> }

14:48 < partobs-mdp> or even opt.MaxIterations() = 100 * numSamples;

14:48 < partobs-mdp> And then we run Evaluate on the entire model pool, right?

14:48 < rcurtin> that would be one way to do it, but the key is that only works if RNN::Train() saves the existing parameters and starts from there when Train() is called

14:48 < rcurtin> yeah, that could work, but I think it could be done more efficiently---

14:48 < rcurtin> after you've trained one epoch, don't save the model, just evaluate on the validation set

14:48 < rcurtin> then save the accuracy

14:49 < rcurtin> if the accuracy is the best, you can save the model itself too (in case you want to do something with the best trained model later)

14:50 < partobs-mdp> So the pipeline is like this: train on $k$ epochs on training set, evaluate validation set accuracy and test set accuracy; if this model is the best in term of validation score, save it. Repeat the above, report the test set accuracy of the best model

14:51 < partobs-mdp> *Or, rather, save not the model, but its validation and test scores

14:51 < rcurtin> yeah, that seems reasonable to me

14:51 < rcurtin> if you are only interested in the validation and test scores, then there is no need to save the best model

14:52 < rcurtin> they say that each epoch consists of '1000 batches of size 50'

14:52 < partobs-mdp> Wow, that's weird. They use 1000 * 50 = 50000 samples just to train on AddTask with maxLength = 10 (there are only 1024 sequences possible in this task)

14:52 < partobs-mdp> *CopyTask

14:53 < partobs-mdp> Does it even generalize?

14:53 < rcurtin> ah, do they do copytask? I only see reverse, search, merge, sort, and add

14:53 < zoq> Probably not :)

14:53 < partobs-mdp> My bad :)

14:53 < zoq> in the NTM paper

14:53 < partobs-mdp> Okay, but the reasoning is valid for ReverseTask

14:54 < zoq> NTM: https://arxiv.org/abs/1410.5401

14:54 < rcurtin> partobs-mdp: the reverse task is given a sequence of 10-bit vectors, so I think that there are more than just 1024 possibilities there

14:55 < rcurtin> that would be 2^(10*m) possibilities for inputs, which is >50k for m=2

14:55 < partobs-mdp> My bad again - their tasks are indeed big :)

14:55 < rcurtin> but zoq is right, the copy task is done in the NTM paper; the setup is probably a bit different there (let me read it real quick)

14:58 < rcurtin> I don't see as many details in the NTM paper about the training of the network, but for the copy task, they are copying sequences of input vectors, not just a single input vector

14:58 shikhar has joined #mlpack

15:01 < rcurtin> zoq: the only thing I am concluding from Figure 3 is that their LSTM training doesn't have the big explosions, but I suspect that has to do with the optimizer they are using, which they don't provide many details about

15:02 < rcurtin> so I am not sure anyone can hope to reproduce their work exactly, but I think the HAM paper is much more possible to reproduce

15:02 < rcurtin> or, maybe, did I overlook something? I have only managed to give these papers a fairly quick glance

15:03 < partobs-mdp> rcurtin: I only found this (in NTM paper):

15:03 < partobs-mdp> For all experiments, the RMSProp algorithm was used for training in the form described in (Graves, 2013) with momentum of 0.9.

15:03 < partobs-mdp> Oh, and they use *much* bigger LSTM's

15:03 < partobs-mdp> (256 units - or, at least I thought so)

15:04 < partobs-mdp> The learning rates they use are much smaller

15:04 < partobs-mdp> 1e-4 ~ 1e-5 - in contrast to 1e-3 (default of Adam)

15:05 < zoq> they also clipped the gradient [-10, 10]

15:07 < rcurtin> ah, my quick search did not find section 4.6, thank you

15:10 < zoq> I think that is one thing we could also test besides what they did in the HAM paper.

15:15 < zoq> Anyway, we shouldn't concentrate too much on the copy task and the exact results, we already see good results for the copy task, so maybe we should see that we can "fix" the add task and move on with the HAM implementation. What do you think?

15:17 < partobs-mdp> zoq: I also think it's a nice idea - I'll try to change the representation and see what happens

15:18 < partobs-mdp> zoq: However, we will still need to implement that "validation set tuning" pipeline - we have *way* too much variance right now, which may give us wrong impressions about more serious stuff (HAM?)

15:18 < zoq> partobs-mdp: agreed

15:18 < rcurtin> partobs-mdp: I would be interested to see if RMSprop also has the same explosion problem; the results from the NTM paper seem to suggest that it does not

15:19 < rcurtin> but, in either case, agreed, the validation set tuning pipeline is necessary to accurately compare with the HAM paper

15:43 kris1 has joined #mlpack

16:02 govg has quit [Ping timeout: 248 seconds]

16:48 partobs-mdp has quit [Remote host closed the connection]

17:11 kris1 has quit [Quit: kris1]

17:13 shikhar has quit [Quit: WeeChat 1.7]

17:33 sumedhghaisas has joined #mlpack

19:04 andrzejku has joined #mlpack

19:11 vivekp has quit [Ping timeout: 255 seconds]

19:16 vivekp has joined #mlpack

20:00 < rcurtin> ironstark: I think, if you're using Jenkins to test your commits, it might be quicker to commit a bunch of changes all at once and see if anything failed, vs. committing each change individually

20:00 < rcurtin> your call in the end, I just thought I would toss the idea out there :)

20:06 < ironstark> rcurtin: I am doing that only for the current PR's because they were all tested and made online itself. I wont be doing this for the new ones.

20:11 < rcurtin> do you mean that you are modifying the code in an online editor, then pushing before testing?

20:43 < sumedhghaisas> zoq, rcurtin: Hey Ryan, Marcus. Kinda need your help in understanding the BPTT we perform if you are free...

21:03 < zoq> sumedhghais: Hey, I'm sorry, I can't right now; you can ask your question here and we get back to you once we have a chance or we can talk tomorrow.

21:08 < sumedhghaisas> its kind of a big question. But its not stopping me right now so will ask you whenever you get free...

21:13 < rcurtin> sumedhghaisas: unfortunately I also can't help right now, I have a "surprise paper deadline"

21:13 < rcurtin> maybe it would be better to ask the question over email then, if it is a big question?

21:14 < sumedhghaisas> rcurtin: haha ... "surprise paper deadline"? sounds interesting...

21:14 < rcurtin> yeah, it was being prepared as a technical report, but then this morning we found a conference to submit it to... only problem is the deadline is tomorrow... short notice :)

21:14 < sumedhghaisas> yeah... I will send a mail instead.

21:15 < sumedhghaisas> ohh... godspeed :)

21:46 sumedhghaisas has quit [Quit: Ex-Chat]

21:47 sumedhghaisas has joined #mlpack

23:01 andrzejku has quit [Quit: Textual IRC Client: www.textualapp.com]

23:34 conrad_ has joined #mlpack