verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
kris1 has left #mlpack []
kris1 has joined #mlpack
kris1 has left #mlpack []
kris1 has joined #mlpack
kris1 has quit [Client Quit]
vivekp has quit [Ping timeout: 268 seconds]
vivekp has joined #mlpack
mikeling is now known as mikeling|brb
sumedhghaisas has quit [Ping timeout: 260 seconds]
mikeling|brb is now known as mikeling
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 260 seconds]
micyril has joined #mlpack
vivekp has quit [Ping timeout: 255 seconds]
vivekp has joined #mlpack
partobs-mdp has joined #mlpack
andrzejk_ has joined #mlpack
micyril has quit [Quit: Page closed]
andrzejk_ has quit [Quit: Textual IRC Client: www.textualapp.com]
kris1 has joined #mlpack
kris1 has quit [Quit: kris1]
shikhar has joined #mlpack
kris1 has joined #mlpack
kris1 has quit [Quit: kris1]
govg_ has joined #mlpack
govg_ has quit [Quit: leaving]
kris1 has joined #mlpack
shikhar has quit [Read error: Connection reset by peer]
shikhar has joined #mlpack
< partobs-mdp> zoq: I've read your comment on task API pull request. I have found a bug in task instance generator: it was emitting numbers, starting from the *most* significant bit - the HAM paper reports results with representation, starting from the *least* significant bit.
vivekp has quit [Ping timeout: 240 seconds]
< partobs-mdp> zoq: So, in essence, this became Add+Reverse task. As HAM paper reports, vanilla LSTM _miserably_ failed ReverseTask
< partobs-mdp> zoq: If this matters, the paper uses considerably more data for training - 1000 batches * 50 samples each
< partobs-mdp> zoq: Oh, and another thing: they used x-entropy loss, not MSE loss.
< zoq> partobs-mdp: Besides starting from the *least* significant bit, I'm not sure they used a binary encoding for the special symbols (+,=) as you proposed.
< zoq> Also, I think using more data for training especially for longer sequences could improve the results, but that is easy to test.
< zoq> Perhaps a good idea to switch to cross entropy, yes.
< partobs-mdp> zoq: In process - I'm implementing a CrossEntropyError layer right now.
< partobs-mdp> zoq: The linking is still running. During that time, could you take a glance at the current implementations and tell if they look right for you?
< partobs-mdp> zoq: The error evaluation: return -arma::sum(target * arma::log(input) + (1. - target) * arma::log(1. - input));
< partobs-mdp> zoq: The gradient evaluation: output = (1. - target) / (1. - input) - target / input;
vivekp has joined #mlpack
< zoq> partobs-mdp: you can also write: (y - x) / (1.0 - x) * x for the gradient
< partobs-mdp> zoq: Keep getting this message while trying to build mlpack/models fork:
< partobs-mdp> zoq: error: ‘CrossEntropyError’ was not declared in this scope
< partobs-mdp> No idea what's happened - the declaration was more or less copy-paste from MeanSquareError
< partobs-mdp> Of course, I added new files to CMakeList.txt in layers/ directory
< zoq> Have you added the new layer in layer_types.hpp and layer.hpp?
< zoq> ah just layer_types.hpp
< partobs-mdp> zoq: Oops :) I didn't add there, so I'll try and report the results in a few minutes
< zoq> okay
< kris1> the reset function in FFN now only just intialises the the parameters of the network using the intialise rule. It does not reset the indivisual layers.
< zoq> kris1: The NetworkInitialization class that i used for the parameter initialization calls the Reset function: https://github.com/mlpack/mlpack/blob/master/src/mlpack/methods/ann/init_rules/network_init.hpp#L94
< partobs-mdp> zoq: Implemented CrossEntropyLayer. Getting a performance slightly worse than with MeanSquaredError. The problem I noticed is that the objective sometimes "explodes", like this: 6.6 -> 6.0 -> 12.6 -> >500
< partobs-mdp> zoq: Clipping the gradient to range [-5, +5] makes the situation only worse (on 125 epochs) - it gets only 20% vs ~85% on maxLen = 10
< partobs-mdp> zoq: Right now testing on 500 epochs
< zoq> partobs-mdp: which task?
< partobs-mdp> zoq: CopyTask
< zoq> partobs-mdp: hm, maybe you can test another initialization method, like Gaussian
< partobs-mdp> zoq: Can you outline how to change the initialize method?
< partobs-mdp> zoq: The problem meanwhile is getting worse - even though the model gets really low objective values (~0.3), it then explodes to completely insane values (~2k, 4x the objective before optimization)
< zoq> partobs-mdp: instead of doing RNN<> you can use: RNN<NegativeLogLikelihood<>, GaussianInitialization>
< zoq> Another idea is to change the way you train, I made a comment on that in the models repo, right now you train a single sample x times in a row.
< zoq> I think an easy solution would be to provide an interface to train on arma::field or to train on arma::mat.
< partobs-mdp> zoq: But right now the code uses fixed-length arma::mat objects for training. I also use optimizer parameteres to set epoch count. Shouldn't it already train properly?
< partobs-mdp> zoq: error: ‘GaussianInitialization’ was not declared in this scope
< zoq> partobs-mdp: You have to include gaussian_init.hpp from init_rules.
< zoq> partobs-mdp: You already use fixed length arma::mat for training? let me check the code
< zoq> partobs-mdp: Ah I see, in this case nevermind, my bad.
< partobs-mdp> zoq: Managed to compile that, right now evaluating the new model
< zoq> okay, let's see if that improves the results
< partobs-mdp> zoq: so far nothing explodes - waiting for optimizer to get into <1 objective zone :)
< partobs-mdp> zoq: 25k iterations, the explosions are back :( Bumped from steady ~16 to >500
< zoq> you already use gradient clipping right?
< partobs-mdp> yes, I clip to [-5, +5]
< partobs-mdp> maybe I should clip to something smaller - or try to tune the LR
< zoq> hm, yeah I was think to switch Adam with MiniBatchSGD.
< partobs-mdp> [INFO ] RNN::RNN(): final objective of trained model is 0.490519.
< partobs-mdp> [INFO ] Final score: 0.304688
< partobs-mdp> Overfit?
< partobs-mdp> If so, what if I add dropout to my model?
< zoq> Worth a test, at the end of the network?
< partobs-mdp> zoq: Or even after each ReLU?
< partobs-mdp> *after each nonlinearity?
< zoq> Maybe you can also increase the sample size?
< rcurtin> I am watching quietly, partobs-mdp: do you mean that even with the [-5,+5] gradient clipping that the objective explodes then re-converges to 0.490519?
< zoq> hm, yeah we have to be careful with the dropout rate.
< partobs-mdp> rcurtin: yes, it explodes all the way up to 2l xent-loss, and then goes down to 0.5 (even 2-3 time)
< partobs-mdp> *times
< zoq> partobs-mdp: Maybe you can push the cross entropy layer? Maybe you missed something there?
< rcurtin> yeah, that is what I was thinking
< partobs-mdp> zoq: Sure, but I feel fair to warn you: I added the gradient clipping *there* (I know, it's horribly dirty, but we can fix that once we get something working)
< partobs-mdp> clipping to [-2, +2] seems hopeless
< partobs-mdp> Pushed. Added two dropout layers - it kind of optimizes the objective, but it's doing it *real* slow.
< partobs-mdp> Maybe learning rate = 1e-3 is too small here?
< partobs-mdp> By the way, how do I set dropout rate from model.Add()?
< rcurtin> partobs-mdp: if I am understanding this right, there have to be some restrictions on the input and target for the cross entropy layer---the input/target values must be in [0, 1], right?
< rcurtin> my best guess is that if this is happening in a later layer of the network, then 'input' is getting closer and closer to either 0 or 1 and this causes instability in the Backward() calculation
< rcurtin> that would also cause instability in Forward() I guess, because log(0) = -Inf
< zoq> yeah, maybe you could use: trunc_log
< zoq> I guess target / input is also not safe, since input could be zero? so target / (input + eps)?
< rcurtin> I would have thought that clipping to small values would solve this type of problem though---I think that adding epsilon to the denominator effectively does the same
< rcurtin> or, perhaps, is the clipping just causing the objective to explode more slowly? (in which case it would be working correctly, I guess)
kris1 has quit [Quit: kris1]
< partobs-mdp> zoq: rcurtin: Added 1e-2 to denominator in the gradient computation, used trunc_log, still getting explosions
< rcurtin> hmm, can you try intercepting when these explosions happen and inspecting what exactly the cause of the explosion is?
< rcurtin> I guess you could do this with a debugger like gdb and catch when the objective gets very large, or when the gradient gets very large, or something like this
< rcurtin> I think the first thing to do is track down exactly what is causing the gradient to explode (unfortunately that could be very time-consuming); maybe zoq has a better idea?
kris1 has joined #mlpack
< zoq> I agree, it's time-consuming but at the end, we get some insights that could be helpful
kris1_ has joined #mlpack
kris1 has quit [Ping timeout: 260 seconds]
kris1_ is now known as kris1
< shikhar> rcurtin: A quick question, regarding templatizing the gradient parameter for parallel SGD. I found that logistic regression, regularized SVD, NCA, RNN and FFN implement the DecomposableFunctionType interface on the Gradient function.
< shikhar> Should I go about changing them all? I don't see any issues otherwise.
kris1 has quit [Ping timeout: 260 seconds]
kris1 has joined #mlpack
partobs-mdp has quit [Remote host closed the connection]
kris1 has quit [Ping timeout: 260 seconds]
mikeling has quit [Quit: Connection closed for inactivity]
shikhar has quit [Quit: WeeChat 1.7]
kris1 has joined #mlpack
< kris1> FFN2 — > FFN1 i want to pass the delta and Gradient from the FFN1 to FFN2. So i would have to basically ffn1.outlayertype.Delta = ffn1.network().front.Delta() but i am confused how would i pass the gradient values?
< zoq> kris1: The same way should work for the Gradient.
sumedhghaisas has joined #mlpack
< kris1> I have a hard time understanding this. So if set the gradients of the outputlayer of ffn2 to the gradiendt from ffn1 front layer and then call the ffn2.Gradient() function will this equal backpropogation of whole combined network ffn2 -> ffn1
< kris1> this is a very rought implmentation fo the idea of the GAN i am talking about have a look if possible https://gist.github.com/kris-singh/fb455fa809634bc8f1afc2872407352d
hello_ has joined #mlpack
hello_ has quit [Client Quit]
< zoq> kris1: So if I understand you right you like to share the delta and gradient parameter?
< kris1> Well yes kinda. I want to pass the gradients and delta from ffn1 to ffn2. So this is not exaclty sharing. I want to backpropogate through both the FFN1 and FFN2 combined
< zoq> But, if you just wanted to backpropagation through FFN1 and afterwards through FFN2 using the error from FFN1 (delta) why do you need the gradients? Maybe I missed something?
< kris1> Hmmm yes sorry. I just need to pass the errors and that would be multiplied by the local gradients.
< kris1> Could you have a look at the gist https://gist.github.com/kris-singh/fb455fa809634bc8f1afc2872407352d. Just wanted to get your thoughts on it.
< zoq> So, yes something like generator.outputlayer.Delta() = discriminator.network.front().Delta(); should work, in the Gradient function of generator.Gradient(); we have to make sure that the delta is used, but besids that it looks good.
< zoq> ah you call the Gradient function from the FFN class right?
< zoq> In this case, it should work right away.
< kris1> Yes the Gradient function from the FFN class.
< kris1> Okay thanks i will test it out then on a simple task.
< zoq> Okay, let us know if you run into any problems.
andrzejk_ has joined #mlpack
shikhar has joined #mlpack
andrzejk_ has quit [Quit: Textual IRC Client: www.textualapp.com]
shikhar has quit [Quit: WeeChat 1.7]