verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
sumedhghaisas has quit [Remote host closed the connection]
kris1 has quit [Quit: kris1]
partobs-mdp has joined #mlpack
< partobs-mdp>
zoq: rcurtin: Opening the debugging session. I've noticed that objective blow-ups happen together with gradient blow-ups (Captain Obvious) - the gradient norm goes from 10~20 to 200+. The weird stuff: the maximum gradient element is consistently more than 5 (does clipping work?)
< partobs-mdp>
Before "blow-ups" the maximum gradient element goes to some insane values (~25-30)
< partobs-mdp>
As an idea: what if doing gradient clipping inside layer gradient was a mistake? The goal is to clip the gradient w.r.t. network weights, not w.r.t. error layer inputs.
kris1 has joined #mlpack
< partobs-mdp>
Well, the result has improved from 35% to 50%. Further I'll try to clip the gradient to lower values ([-2, +2]) and add some dropout - *now* it looks like it's going to work :)
kris1 has quit [Quit: kris1]
< partobs-mdp>
The explosions are still there. My take on it: no regularization -> the model goes to some high-likelihood, low-measure spikes -> overfitting
< partobs-mdp>
The gradient clipping somewhat rectified it, because SGD steps became less sharp
< partobs-mdp>
However, it still went to those "overfitting" areas
< partobs-mdp>
So Dropout looks like the only good idea
kris1 has joined #mlpack
< partobs-mdp>
Well, even with added Dropout it still converges to worse parameter values than MeanSquaredError on all tasks
< partobs-mdp>
My take: MeanSqquaredError implicitly clipped the gradient (graident = input - target, all its elements are <= 1 in absolute value)
< partobs-mdp>
I think it's time to switch back to the MeanSquaredError
< zoq>
[INFO ] RNN::RNN(): final objective of trained model is 151.434.
< zoq>
[INFO ] Running evaluation loop.
< zoq>
[INFO ] Final score: 0.0234375
< zoq>
Final score: 0.0234375
< zoq>
but I think we should use much more samples here
< partobs-mdp>
zoq: Whoa, I still get 7% even with DEBUG=OFF
< partobs-mdp>
Unreproducible bug?
< partobs-mdp>
zoq: Ouch, I wrote "sort" instead of "copy"
< zoq>
we shouldn't expect that we see the same results, since we use different initial weights, and also shuffel the samples
< zoq>
okay, let me run copy
< partobs-mdp>
By the way, that gradient explosion problem is not specific to CrossEntropyLayer - I get similar effect on MeanSquareError (I run Copy)
< zoq>
strange
< zoq>
[INFO ] Final score: 0.703125
< zoq>
I guess, we could test if the GRU layer might perform better
< rcurtin>
is the training being done the same way on both systems? or is one setup training in batch on many samples, and the other training on only one sample like in the models#1 PR?
< partobs-mdp>
I've got 0.75 on MSE
< partobs-mdp>
*precision
< zoq>
I used the script from the models PR
< zoq>
so we get bascially the same results here right?
< partobs-mdp>
now - yes. But the results you got from the script were different (~84%)
< zoq>
would be intersting to see if you get about the same results with CE
< partobs-mdp>
Can this be explained by random initialization?
< zoq>
There are multiple factors: weights, shuffle
< zoq>
rcurtin: for the last results we both just used: bin/mlpack_lstm_baseline -t copy -e 500 -b 2 -l 10 -r 1 -s 128 -v
< rcurtin>
ok, which if I understand right only trains on a single point?
< zoq>
I think it uses the fixed representation(arma::mat), so it should train over all points
< rcurtin>
ah sorry you are right
< rcurtin>
ok, I'm at my desk now, let me see what I can reproduce
< rcurtin>
[INFO ] RNN::RNN(): final objective of trained model is 0.264269.
< rcurtin>
[INFO ] Final score: 0.804688
< rcurtin>
and that's for $ bin/mlpack_lstm_baseline -t copy -e 500 -b 2 -l 10 -r 1 -s 128 -v
< rcurtin>
I noticed that during optimization, I did see what I think is the same 'gradient explosion' problem
< rcurtin>
the objective would get down under 1, then increase to ~100, then work its way back down
< rcurtin>
in each case though, the optimizer was able to recover
< rcurtin>
the objective got as low as 1e-5 (around SGD iteration 56000), but then exploded to ~500 and worked back down to the final objective of 0.264269
< rcurtin>
let me try a few runs, to see what kind of variance I get
< rcurtin>
in order to do that, I'll have to add 'mlpack::math::RandomSeed(std::time(NULL));' to the start of the program
shikhar has quit [Quit: WeeChat 1.7]
< zoq>
rcurtin: okay, I guess we should test the GRU layer and see if we encounter the same gradient explosion
< rcurtin>
I'm seeing a lot of variance, and whether or not I get good performance seems directly related to whether or not the gradient exploded right before the end of the run
< rcurtin>
for instance, the last run I did, the gradient exploded at iteration ~63k and did not recover before optimization was over, and I got:
< rcurtin>
[INFO ] RNN::RNN(): final objective of trained model is 206.974.
< rcurtin>
[INFO ] Final score: 0.046875
< partobs-mdp>
By the way, I finally managed to reproduce the results in CopyTask
< partobs-mdp>
[INFO ] Final score: 0.898438
< partobs-mdp>
CopyTask, maxLen = 10
< rcurtin>
partobs-mdp: can you add 'mlpack::math::RandomSeed(std::time(NULL));' to the top of lstm_baseline_main.cpp and try running it again?
< rcurtin>
I am wondering if the random seed that you compiled with has bad luck and just happens to cause the gradient to explode right at the end of optimization
< partobs-mdp>
Well, I think I managed to get the experiment right. Now the only real discrepancy is AddTask, so I should only try to change the representation, right?
< partobs-mdp>
zoq: But don't we try to reproduce LSTM experiment? The paper reported some nice results on LSTM
< partobs-mdp>
By the way, I'm rerunning AddTask on a sample of 1024 sequences
< rcurtin>
partobs-mdp: zoq: here are the results for 5 runs of the copy task with different random seeds
< rcurtin>
run 1: minimum objective 0.017164 (iteration 57601); final objective 0.0582562, final score 0.734375
< rcurtin>
run 2: minimum objective 6.65353e-05 (iteration 48385); final objective 0.0390115; final score 0.6875
< rcurtin>
run 3: minimum objective 0.0440252 (iteration 63617); final objective 206.974; final score 0.046875
< rcurtin>
run 4: minimum objective 0.0096776 (iteration 64000); final objective 0.0096776; final score 0.570312
< rcurtin>
run 5: minimum objective 0.00414738 (iteration 39937); final objective 1.00561; final score 0.703125
< rcurtin>
with what is currently it your branch, which is MSE
< rcurtin>
I want to see what happens if I run training for a lot longer... I'll give results in a while (it'll take a long time, each run as it is takes 5m)
< rcurtin>
I guess, my current thoughts are, optimizers can be picky and sensitive to initial conditions
< rcurtin>
the gradient explosions I am seeing don't concern me *too* much, I am not sure something is wrong in the implementation---instead, I suspect that since the optimizer is able to recover, that is just kind of a "thing that happens" with this problem
< rcurtin>
it seems the optimization surface is very tricky
kris1 has quit [Quit: kris1]
< partobs-mdp>
I'll post my results (same task, but using CrossEntropyError) - so far only 1 run, I'll report more about next runs later
< partobs-mdp>
[INFO ] Final score: 0.398438
< rcurtin>
can I have a link to the paper we are comparing against? I can't find it quickly in the code, but you probably have the URL easily at hand :)
< zoq>
Probably, if I remember right the NTM authors report that the the model does not converge for each run.
< partobs-mdp>
From the paper: "When the training is finished, we select the model parameters that gave the lowest error rate on validation batches and report the error using these parameters on fresh 2,500 random examples."
< rcurtin>
I think our implementation of adam is not minibatch currently, and minibatch could help avoid some of the bad steps that cause the explosions, I think
< rcurtin>
partobs-mdp: ah! there is what I was looking for
< rcurtin>
so I guess, they are taking the best model from the 100 epochs of training then
< partobs-mdp>
Well, then it looks like we've conducted the experiment correctly all this time :)
< rcurtin>
I wondered if this could give better results---if you took the model just before the gradient exploded, I bet it would perform a lot better
< rcurtin>
to modify the lstm_baseline_main.cpp code to do that, I guess we would need to save the model at each epoch and then evaluate on a validation set
< partobs-mdp>
rcurtin: But how do we do that without modifying SGD optimizer code?
< rcurtin>
partobs-mdp: hm, if Train() starts from the existing set of parameters, you can train one epoch at a time then copy the model
< rcurtin>
i.e. for (size_t e = 0; e < epochs; ++e) {
< partobs-mdp>
rcurtin: But even then we have to modify RNN::Train(), right?
< rcurtin>
opt.MaxIterations() = numSamples;
< rcurtin>
m.Train(data, labels, opt); // this is not exactly right, I am just guessing as I sketch this up
< rcurtin>
oldModels.push_back(m); // for some std::vector<...> oldModels
< rcurtin>
}
< partobs-mdp>
or even opt.MaxIterations() = 100 * numSamples;
< partobs-mdp>
And then we run Evaluate on the entire model pool, right?
< rcurtin>
that would be one way to do it, but the key is that only works if RNN::Train() saves the existing parameters and starts from there when Train() is called
< rcurtin>
yeah, that could work, but I think it could be done more efficiently---
< rcurtin>
after you've trained one epoch, don't save the model, just evaluate on the validation set
< rcurtin>
then save the accuracy
< rcurtin>
if the accuracy is the best, you can save the model itself too (in case you want to do something with the best trained model later)
< partobs-mdp>
So the pipeline is like this: train on $k$ epochs on training set, evaluate validation set accuracy and test set accuracy; if this model is the best in term of validation score, save it. Repeat the above, report the test set accuracy of the best model
< partobs-mdp>
*Or, rather, save not the model, but its validation and test scores
< rcurtin>
yeah, that seems reasonable to me
< rcurtin>
if you are only interested in the validation and test scores, then there is no need to save the best model
< rcurtin>
they say that each epoch consists of '1000 batches of size 50'
< partobs-mdp>
Wow, that's weird. They use 1000 * 50 = 50000 samples just to train on AddTask with maxLength = 10 (there are only 1024 sequences possible in this task)
< partobs-mdp>
*CopyTask
< partobs-mdp>
Does it even generalize?
< rcurtin>
ah, do they do copytask? I only see reverse, search, merge, sort, and add
< zoq>
Probably not :)
< partobs-mdp>
My bad :)
< zoq>
in the NTM paper
< partobs-mdp>
Okay, but the reasoning is valid for ReverseTask
< rcurtin>
partobs-mdp: the reverse task is given a sequence of 10-bit vectors, so I think that there are more than just 1024 possibilities there
< rcurtin>
that would be 2^(10*m) possibilities for inputs, which is >50k for m=2
< partobs-mdp>
My bad again - their tasks are indeed big :)
< rcurtin>
but zoq is right, the copy task is done in the NTM paper; the setup is probably a bit different there (let me read it real quick)
< rcurtin>
I don't see as many details in the NTM paper about the training of the network, but for the copy task, they are copying sequences of input vectors, not just a single input vector
shikhar has joined #mlpack
< rcurtin>
zoq: the only thing I am concluding from Figure 3 is that their LSTM training doesn't have the big explosions, but I suspect that has to do with the optimizer they are using, which they don't provide many details about
< rcurtin>
so I am not sure anyone can hope to reproduce their work exactly, but I think the HAM paper is much more possible to reproduce
< rcurtin>
or, maybe, did I overlook something? I have only managed to give these papers a fairly quick glance
< partobs-mdp>
rcurtin: I only found this (in NTM paper):
< partobs-mdp>
For all experiments, the RMSProp algorithm was used for training in the form described in (Graves, 2013) with momentum of 0.9.
< partobs-mdp>
Oh, and they use *much* bigger LSTM's
< partobs-mdp>
(256 units - or, at least I thought so)
< partobs-mdp>
The learning rates they use are much smaller
< partobs-mdp>
1e-4 ~ 1e-5 - in contrast to 1e-3 (default of Adam)
< zoq>
they also clipped the gradient [-10, 10]
< rcurtin>
ah, my quick search did not find section 4.6, thank you
< zoq>
I think that is one thing we could also test besides what they did in the HAM paper.
< zoq>
Anyway, we shouldn't concentrate too much on the copy task and the exact results, we already see good results for the copy task, so maybe we should see that we can "fix" the add task and move on with the HAM implementation. What do you think?
< partobs-mdp>
zoq: I also think it's a nice idea - I'll try to change the representation and see what happens
< partobs-mdp>
zoq: However, we will still need to implement that "validation set tuning" pipeline - we have *way* too much variance right now, which may give us wrong impressions about more serious stuff (HAM?)
< zoq>
partobs-mdp: agreed
< rcurtin>
partobs-mdp: I would be interested to see if RMSprop also has the same explosion problem; the results from the NTM paper seem to suggest that it does not
< rcurtin>
but, in either case, agreed, the validation set tuning pipeline is necessary to accurately compare with the HAM paper
kris1 has joined #mlpack
govg has quit [Ping timeout: 248 seconds]
partobs-mdp has quit [Remote host closed the connection]
kris1 has quit [Quit: kris1]
shikhar has quit [Quit: WeeChat 1.7]
sumedhghaisas has joined #mlpack
andrzejku has joined #mlpack
vivekp has quit [Ping timeout: 255 seconds]
vivekp has joined #mlpack
< rcurtin>
ironstark: I think, if you're using Jenkins to test your commits, it might be quicker to commit a bunch of changes all at once and see if anything failed, vs. committing each change individually
< rcurtin>
your call in the end, I just thought I would toss the idea out there :)
< ironstark>
rcurtin: I am doing that only for the current PR's because they were all tested and made online itself. I wont be doing this for the new ones.
< rcurtin>
do you mean that you are modifying the code in an online editor, then pushing before testing?
< sumedhghaisas>
zoq, rcurtin: Hey Ryan, Marcus. Kinda need your help in understanding the BPTT we perform if you are free...
< zoq>
sumedhghais: Hey, I'm sorry, I can't right now; you can ask your question here and we get back to you once we have a chance or we can talk tomorrow.
< sumedhghaisas>
its kind of a big question. But its not stopping me right now so will ask you whenever you get free...
< rcurtin>
sumedhghaisas: unfortunately I also can't help right now, I have a "surprise paper deadline"
< rcurtin>
maybe it would be better to ask the question over email then, if it is a big question?
< sumedhghaisas>
rcurtin: haha ... "surprise paper deadline"? sounds interesting...
< rcurtin>
yeah, it was being prepared as a technical report, but then this morning we found a conference to submit it to... only problem is the deadline is tomorrow... short notice :)
< sumedhghaisas>
yeah... I will send a mail instead.