ChanServ changed the topic of #mlpack to: "mlpack: a fast, flexible machine learning library :: We don't always respond instantly, but we will respond; please be patient :: Logs at http://www.mlpack.org/irc/
a1 is now known as bbw
bbw is now known as a1
a1 is now known as rpifan_
rpifan_ is now known as ekaj
ekaj is now known as a1
a1 is now known as factorio
factorio is now known as a1
a1 is now known as rpifan_
rpifan_ is now known as a1
mistiry has joined #mlpack
Hunterkll has joined #mlpack
ImQ009 has joined #mlpack
favre49 has joined #mlpack
< favre49> Having some issues while running make - getting this error: make[3]: *** No rule to make target '../src/mlpack/core/util/cli.hpp', needed by 'src/mlpack/tests/cotire/mlpack_test_CXX_prefix.hxx.gch'. Stop.
< zoq> favre49: deleting the build folder should solve the issue
< favre49> Great, that seems to have fixed it. Thanks
< KimSangYeon-DGU[> kartikdutt18: sakshamb189 Hey, are you there?
< sakshamb189[m]> yes I am here
< KimSangYeon-DGU[> Great
< kartikdutt18[m]> Hey KimSangYeon-DGU , sakshamb189 , I'm here.
< KimSangYeon-DGU[> I saw Kartik'
< KimSangYeon-DGU[> * I saw Kartik's comment
< KimSangYeon-DGU[> It seems the weight isn't compatible with mlapck right?
< kartikdutt18[m]> Hmm, I'm not sure, If use PyTorch's model with train, I get the same value as mlpack but those values aren't correct. However when I switch to eval() model in PyTorch the model gives correct prediction.
< kartikdutt18[m]> I manually set determinstic pooling layers and batchnorm layer that gave me a higher confidence output but still the labels weren't correct.
< kartikdutt18[m]> Another thing that I did was that the leaky-relu in PyTorch uses 0.01 as default value where as in mlpack it's different. I also switched to that and I got varying labels atleast but still not correct.
< KimSangYeon-DGU[> Ok, how was the traninig going using mlpack model on your local machine?
< KimSangYeon-DGU[> Did the loss decrease?
< kartikdutt18[m]> The loss is decreasing very slowly, with imagenette resized to 224 I got initial loss of 11.x and is now at 9.x after 2 epochs. The learning rate was 1e-2 (to be on the safe side).
< kartikdutt18[m]> The accuracy after 1st epoch was 10% and after 2nd it was at 15%.
< KimSangYeon-DGU[> How many days do you spend?
< KimSangYeon-DGU[> Was the accuracy on validation, right?
< kartikdutt18[m]> I put it on train the same day as your comment. It takes more than a day for it to train.
< kartikdutt18[m]> Validation and train accuracy (both are nearly the same).
< kartikdutt18[m]> * I put it on train the same day as your comment. It takes more than a day for a single epoch.
< sakshamb189[m]> what is the final accuracy you have kartikdutt18 ?
< kartikdutt18[m]> It's on the 3rd epoch, and after 2nd epoch accuracy was 15%.
< kartikdutt18[m]> <KimSangYeon-DGU[ "How many days do you spend?"> KimSangYeon-DGU: , This was for training or in general?
< sakshamb189[m]> more than a day for an epoch seems very very slow
< KimSangYeon-DGU[> Yes, very slow
< kartikdutt18[m]> It completes and epoch overnight, so it is probably around 16-18 hours.
< kartikdutt18[m]> * It completes an epoch overnight, so it is probably around 16-18 hours.
< KimSangYeon-DGU[> Can we track the weight change during inference?
< KimSangYeon-DGU[> I mean layer by layer
< kartikdutt18[m]> I'm not sure I got this point.For the model which is training, we can access weights after each iteration / epoch. For the model we are trying to transfer I already wrote tests if weights are being transferred correctly and also saw unit testing that each layer produces the correct output.
< KimSangYeon-DGU[> sakshamb189: Would it be better to keep working on weight converter?
< sakshamb189[m]> I thought kartikdutt18 was already working on it. Is that not correct? You just had to resolve an issue with the runningMean
< kartikdutt18[m]> sakshamb189: , It's already running / working. I hardcoded the indices of batchnorm layer to load running mean and variance.
< kartikdutt18[m]> To get a PoC.
< kartikdutt18[m]> However, The model produces the same result as PyTorch when PyTorch model is in training mode but not in eval mode.
< kartikdutt18[m]> To get correct predictions in PyTorch we need to use model in eval mode.
< KimSangYeon-DGU[> kartikdutt18: Is this related to https://github.com/pytorch/pytorch/issues/19902 ?
< sakshamb189[m]> so what issue are you facing with that? I mean using the eval mode
< kartikdutt18[m]> The issue is that if I manually set deterministic for all layers in the snipper [here](https://pastebin.com/kQQF8iXn), I still don't get the same result.
< kartikdutt18[m]> * The issue is that if I manually set deterministic for all layers in the snippet [here](https://pastebin.com/kQQF8iXn), I still don't get the same result.
< kartikdutt18[m]> <KimSangYeon-DGU[ "kartikdutt18: Is this related to"> In eval mode translates to deterministic in mlpack and I already tried setting it manually even though while calling Predict there is a set_deterministic vistor.
a1 is now known as tcp
< kartikdutt18[m]> > <@kimsangyeon-dgu:matrix.org> kartikdutt18: Is this related to https://github.com/pytorch/pytorch/issues/19902 ?
< kartikdutt18[m]> * Eval mode translates to deterministic in mlpack and I already tried setting it manually even though while calling Predict there is a set_deterministic vistor.
tcp is now known as a1
< sakshamb189[m]> yes that is true. So have you been able to identify the difference ?
< sakshamb189[m]> in the eval mode for pytorch and mlpack which is causing the issue
a1 is now known as TCP
TCP is now known as a1
< kartikdutt18[m]> I haven't yet. I checked output of all layers of darknet and saw if they produced same output as PyTorch and they did. I checked all parameters. I tried setting deterministic for all layers as well. Setting deterministic should ideally do the trick but I'm not sure why it's not giving the correct output.
< sakshamb189[m]> you said you verified the outputs of all the layers. You mean during prediction right?
< sakshamb189[m]> ....and there was no difference. If there was no difference why is the result different?
< kartikdutt18[m]> I tried that in two ways, the whole output of network (with PyTorch train mode) and for each individual layer. There is a difference with eval mode.
< sakshamb189[m]> so have you identify at which layer we first the see the difference?
< kartikdutt18[m]> I'm writing an automated script to generate output of each layer in PyTorch and then same in mlpack, it gets tedious to do it manually.
< sakshamb189[m]> hmm.. I think we would only have to go through them one by one starting from the beginning and identify the first point of difference and so on.. what do you think?
< sakshamb189[m]> I think it might be better to stick to the weight converter route because the training is taking too long as of now.
< sakshamb189[m]> and does not seem like a realistic option IMO
< kartikdutt18[m]> Right, I was thinking we can have loop to store output in csvs and they breaking the model down and building it up layer by layer and matching output with csvs rather than using input of ones and zeros (which I used manually).
< kartikdutt18[m]> * Right, I was thinking we can have loop to store output in csvs and then breaking the model down and building it up layer by layer and matching output with csvs rather than using input of ones and zeros (which I used manually).
< kartikdutt18[m]> <sakshamb189[m] "and does not seem like a realist"> Training darknet 19 and darknet 53 will take long agreed, To be realistic, we should be able to train YOLO model as an epoch takes less than 3 hours so even if it did around 9 epochs a day we could have nearly 100 epochs in 8-9 days.
< KimSangYeon-DGU[> Ok, then, we all think we need to keep working on weight converter
< KimSangYeon-DGU[> By digging into it from the beginning
< sakshamb189[m]> <kartikdutt18[m] "Training darknet 19 and darknet "> Let just try to finish one thing first because we have been working on this for the past few weeks without much progress. I think we could work on finishing the weight converter first because that looks more promising as of now. (We have tried training models in the past without any good results)
< kartikdutt18[m]> Sure, Makes sense. I will try figuring out what goes wrong by breaking down the network.
< KimSangYeon-DGU[> Great, Kartik. thanks for the hard work
< kartikdutt18[m]> Thanks. I will keep you guys updated in the issue that I opened.
< kartikdutt18[m]> Also, I can stop the training now then, right?
< kartikdutt18[m]> or should I let it train.
< sakshamb189[m]> the model that is taking one day per epoch?
< kartikdutt18[m]> yes.
< sakshamb189[m]> yes I think you can stop that, could take months to complete.
< kartikdutt18[m]> xD, I'll stop it for now.
< KimSangYeon-DGU[> Ok, is there anything to discuss?
< KimSangYeon-DGU[> I'm done
< kartikdutt18[m]> Nothing more from my side.
< sakshamb189[m]> yes I am done too. Let's meet next week. Have a great weekend ahead!
< KimSangYeon-DGU[> Have a nice week!
< kartikdutt18[m]> Have a great week!
< jeffin143[m]> I am back :)
< jeffin143[m]> Ryan when are we planning release ??
< jeffin143[m]> May be first video meet of next month ?? Again a release while we are all on the meet ?
< himanshu_pathak[> jeffin143: Great . My problem with laptop is also now solved just bought a new machine all thanks to GSOC money 😇
< jeffin143[m]> Woh himanshu_pathak congratulations
< jeffin143[m]> Which laptop ?
< himanshu_pathak[> It was a big update from old 4gb ram with no gpu đŸ¤Ŗ
< jeffin143[m]> That's a beast
< jeffin143[m]> In front of your old one
< jeffin143[m]> Now you don't need to ssh and train
< jeffin143[m]> 😉
< jeffin143[m]> Congratulations 🎉
< himanshu_pathak[> <jeffin143[m] "In front of your old one"> Thanks Yeah a lot of improvement from that.
a1 has left #mlpack []
favre49 has quit [Remote host closed the connection]
ImQ009 has quit [Quit: Leaving]
< nishantkr18[m]> zoq: Are you there?
< zoq> nishantkr18[m]: yes
< nishantkr18[m]> zoq: Could you have a look at https://github.com/mlpack/mlpack/pull/2454
< zoq> nishantkr18[m]: So this sounds like there is some flaw in the CartPole-v0 implementation?
< nishantkr18[m]> zoq: I'm not sure.. Is the agent able to solve using gym_tcp_api?
< nishantkr18[m]> zoq: on your machine?
< zoq> nishantkr18[m]: let me test
< nishantkr18[m]> zoq: Sure
< zoq> nishantkr18[m]: So no change to the code just revert the last commit, with the static weights?
< nishantkr18[m]> zoq: Yeah.
< nishantkr18[m]> zoq: You can increase the replayBuffer capacity to `10000` for better performance
< zoq> nishantkr18[m]: current reward: 139 iteration: 49
< nishantkr18[m]> zoq: It does work then, right?
< zoq> nishantkr18[m]: yes :) let me rerun the experiment
< nishantkr18[m]> zoq: Wasted so much time on it. Just to find it was correct all this time!
< nishantkr18[m]> zoq: Having so many mixed feelings now 😀☹ī¸
< zoq> nishantkr18[m]: current reward: 148 iteration: 49
< zoq> nishantkr18[m]: I wouldn't call it wasted time, but I know the feeling :(
< zoq> nishantkr18[m]: Still interesting that the other implementation can solve the env.
< zoq> I thought Categorical DQN would perform better.
< nishantkr18[m]> zoq: Yeah.. that's what is bothering me..
< nishantkr18[m]> Exactly!
< zoq> Will dive into the CartPole-v0 implementation.
< nishantkr18[m]> Yup. we should check that thoroughly
< zoq> I tested with 10000 as buffer size, but I guess 2000 works as well?
< nishantkr18[m]> yeah,but i think 10000 works better
< nishantkr18[m]> I'm trying to see if categorical is able to solve acrobot
< zoq> Sounds good, awsome to see that you narrowed the issue down, well done.
< nishantkr18[m]> :)
< zoq> nishantkr18[m]: Also, isn't it like 2am for you?
< nishantkr18[m]> zoq: 2:45am to be precise 😆
< nishantkr18[m]> zoq: This week was very important for me.. It took me 4 full days to find the bug in SAC and categorical
< zoq> nishantkr18[m]: I have a strange sleeping habit as well, currently reading "Why We Sleep" ... not sure that is a good idea
< nishantkr18[m]> zoq: 😆
< nishantkr18[m]> zoq: BTW, Earlier we used to use mlpack's CartPole implementation to train an agent, and then we used to test the agent on gym_tcp_api
< zoq> nishantkr18[m]: Right, for testing I still like to keep the envs we have.
< zoq> nishantkr18[m]: For the examples, we already switched so we should be fine there?
< zoq> nishantkr18[m]: just tested 2000 -> https://gym.kurg.org/084e338def8d4/output.webm
< zoq> nishantkr18[m]: looks good as well
< nishantkr18[m]> Yeah that is fine. but, it's strange if the CartPole implementation is not correct. Because using mlack to train and gym to test, we used to get good results
< nishantkr18[m]> this only means that both the implementations should be very similar
< zoq> yes, the logic should be similar
< nishantkr18[m]> Hmm..
< zoq> This is one of the main drawback of the env's, we can't really see the output.
SayanGoswami[m] has joined #mlpack
< nishantkr18[m]> Yup. Btw, categorical is able to get decent results on acrobot-v1 using gym
< zoq> nishantkr18[m]: Great :)
< nishantkr18[m]> Still I feel that in all these simple tasks, simpledqn is able to converge much faster and better..
< zoq> I guess we still should apply the copy constructor changes
< nishantkr18[m]> Yeah sure..
< nishantkr18[m]> Btw, the only change we need to so is to use `targetNetwork.Parameters() = learningNetwork.Parameters()` instead of `targetNetwork = learningNetwork`, right?
< zoq> We have to call Reset before that as well.
< nishantkr18[m]> yeah, I was ablout to mention that
< nishantkr18[m]> nothing else other than that right?
< zoq> yes, that worked for me
< zoq> I mean you still have to do 'targetNetwork = learningNetwork' as well, since this will copy the layer structure.
< nishantkr18[m]> yeah, ofcourse.
SayanGoswami[m] is now known as say4n[m]