#mlpack on 2020-07-30 — irc logs at libera.irclog.whitequark.org

2018-11-12 22:39 ChanServ changed the topic of #mlpack to: "mlpack: a fast, flexible machine learning library :: We don't always respond instantly, but we will respond; please be patient :: Logs at http://www.mlpack.org/irc/

02:11 a1 is now known as bbw

02:11 bbw is now known as a1

02:51 a1 is now known as rpifan_

02:51 rpifan_ is now known as ekaj

02:51 ekaj is now known as a1

03:43 a1 is now known as factorio

03:44 factorio is now known as a1

05:18 a1 is now known as rpifan_

05:22 rpifan_ is now known as a1

05:43 mistiry has joined #mlpack

05:43 Hunterkll has joined #mlpack

06:44 ImQ009 has joined #mlpack

08:21 favre49 has joined #mlpack

08:26 < favre49> Having some issues while running make - getting this error: make[3]: *** No rule to make target '../src/mlpack/core/util/cli.hpp', needed by 'src/mlpack/tests/cotire/mlpack_test_CXX_prefix.hxx.gch'. Stop.

08:52 < zoq> favre49: deleting the build folder should solve the issue

08:58 < favre49> Great, that seems to have fixed it. Thanks

15:00 < KimSangYeon-DGU[> kartikdutt18: sakshamb189 Hey, are you there?

15:00 < sakshamb189[m]> yes I am here

15:00 < KimSangYeon-DGU[> Great

15:00 < kartikdutt18[m]> Hey KimSangYeon-DGU , sakshamb189 , I'm here.

15:00 < KimSangYeon-DGU[> I saw Kartik'

15:00 < KimSangYeon-DGU[> * I saw Kartik's comment

15:01 < KimSangYeon-DGU[> It seems the weight isn't compatible with mlapck right?

15:02 < kartikdutt18[m]> Hmm, I'm not sure, If use PyTorch's model with train, I get the same value as mlpack but those values aren't correct. However when I switch to eval() model in PyTorch the model gives correct prediction.

15:03 < kartikdutt18[m]> I manually set determinstic pooling layers and batchnorm layer that gave me a higher confidence output but still the labels weren't correct.

15:04 < kartikdutt18[m]> Another thing that I did was that the leaky-relu in PyTorch uses 0.01 as default value where as in mlpack it's different. I also switched to that and I got varying labels atleast but still not correct.

15:05 < KimSangYeon-DGU[> Ok, how was the traninig going using mlpack model on your local machine?

15:06 < KimSangYeon-DGU[> Did the loss decrease?

15:07 < kartikdutt18[m]> The loss is decreasing very slowly, with imagenette resized to 224 I got initial loss of 11.x and is now at 9.x after 2 epochs. The learning rate was 1e-2 (to be on the safe side).

15:07 < kartikdutt18[m]> The accuracy after 1st epoch was 10% and after 2nd it was at 15%.

15:07 < KimSangYeon-DGU[> How many days do you spend?

15:08 < KimSangYeon-DGU[> Was the accuracy on validation, right?

15:08 < kartikdutt18[m]> I put it on train the same day as your comment. It takes more than a day for it to train.

15:08 < kartikdutt18[m]> Validation and train accuracy (both are nearly the same).

15:09 < kartikdutt18[m]> * I put it on train the same day as your comment. It takes more than a day for a single epoch.

15:09 < sakshamb189[m]> what is the final accuracy you have kartikdutt18 ?

15:10 < kartikdutt18[m]> It's on the 3rd epoch, and after 2nd epoch accuracy was 15%.

15:10 < kartikdutt18[m]> <KimSangYeon-DGU[ "How many days do you spend?"> KimSangYeon-DGU: , This was for training or in general?

15:11 < sakshamb189[m]> more than a day for an epoch seems very very slow

15:11 < KimSangYeon-DGU[> Yes, very slow

15:12 < kartikdutt18[m]> It completes and epoch overnight, so it is probably around 16-18 hours.

15:12 < kartikdutt18[m]> * It completes an epoch overnight, so it is probably around 16-18 hours.

15:16 < KimSangYeon-DGU[> Can we track the weight change during inference?

15:18 < KimSangYeon-DGU[> I mean layer by layer

15:19 < kartikdutt18[m]> I'm not sure I got this point.For the model which is training, we can access weights after each iteration / epoch. For the model we are trying to transfer I already wrote tests if weights are being transferred correctly and also saw unit testing that each layer produces the correct output.

15:21 < KimSangYeon-DGU[> sakshamb189: Would it be better to keep working on weight converter?

15:25 < sakshamb189[m]> I thought kartikdutt18 was already working on it. Is that not correct? You just had to resolve an issue with the runningMean

15:26 < kartikdutt18[m]> sakshamb189: , It's already running / working. I hardcoded the indices of batchnorm layer to load running mean and variance.

15:26 < kartikdutt18[m]> To get a PoC.

15:27 < kartikdutt18[m]> However, The model produces the same result as PyTorch when PyTorch model is in training mode but not in eval mode.

15:28 < kartikdutt18[m]> To get correct predictions in PyTorch we need to use model in eval mode.

15:30 < KimSangYeon-DGU[> kartikdutt18: Is this related to https://github.com/pytorch/pytorch/issues/19902 ?

15:33 < sakshamb189[m]> so what issue are you facing with that? I mean using the eval mode

15:34 < kartikdutt18[m]> The issue is that if I manually set deterministic for all layers in the snipper [here](https://pastebin.com/kQQF8iXn), I still don't get the same result.

15:35 < kartikdutt18[m]> * The issue is that if I manually set deterministic for all layers in the snippet [here](https://pastebin.com/kQQF8iXn), I still don't get the same result.

15:36 < kartikdutt18[m]> <KimSangYeon-DGU[ "kartikdutt18: Is this related to"> In eval mode translates to deterministic in mlpack and I already tried setting it manually even though while calling Predict there is a set_deterministic vistor.

15:36 a1 is now known as tcp

15:36 < kartikdutt18[m]> > <@kimsangyeon-dgu:matrix.org> kartikdutt18: Is this related to https://github.com/pytorch/pytorch/issues/19902 ?

15:36 < kartikdutt18[m]> * Eval mode translates to deterministic in mlpack and I already tried setting it manually even though while calling Predict there is a set_deterministic vistor.

15:36 tcp is now known as a1

15:36 < sakshamb189[m]> yes that is true. So have you been able to identify the difference ?

15:36 < sakshamb189[m]> in the eval mode for pytorch and mlpack which is causing the issue

15:37 a1 is now known as TCP

15:37 TCP is now known as a1

15:38 < kartikdutt18[m]> I haven't yet. I checked output of all layers of darknet and saw if they produced same output as PyTorch and they did. I checked all parameters. I tried setting deterministic for all layers as well. Setting deterministic should ideally do the trick but I'm not sure why it's not giving the correct output.

15:39 < sakshamb189[m]> you said you verified the outputs of all the layers. You mean during prediction right?

15:39 < sakshamb189[m]> ....and there was no difference. If there was no difference why is the result different?

15:40 < kartikdutt18[m]> I tried that in two ways, the whole output of network (with PyTorch train mode) and for each individual layer. There is a difference with eval mode.

15:40 < sakshamb189[m]> so have you identify at which layer we first the see the difference?

15:42 < kartikdutt18[m]> I'm writing an automated script to generate output of each layer in PyTorch and then same in mlpack, it gets tedious to do it manually.

15:43 < sakshamb189[m]> hmm.. I think we would only have to go through them one by one starting from the beginning and identify the first point of difference and so on.. what do you think?

15:44 < sakshamb189[m]> I think it might be better to stick to the weight converter route because the training is taking too long as of now.

15:44 < sakshamb189[m]> and does not seem like a realistic option IMO

15:45 < kartikdutt18[m]> Right, I was thinking we can have loop to store output in csvs and they breaking the model down and building it up layer by layer and matching output with csvs rather than using input of ones and zeros (which I used manually).

15:45 < kartikdutt18[m]> * Right, I was thinking we can have loop to store output in csvs and then breaking the model down and building it up layer by layer and matching output with csvs rather than using input of ones and zeros (which I used manually).

15:47 < kartikdutt18[m]> <sakshamb189[m] "and does not seem like a realist"> Training darknet 19 and darknet 53 will take long agreed, To be realistic, we should be able to train YOLO model as an epoch takes less than 3 hours so even if it did around 9 epochs a day we could have nearly 100 epochs in 8-9 days.

15:48 < KimSangYeon-DGU[> Ok, then, we all think we need to keep working on weight converter

15:49 < KimSangYeon-DGU[> By digging into it from the beginning

15:51 < sakshamb189[m]> <kartikdutt18[m] "Training darknet 19 and darknet "> Let just try to finish one thing first because we have been working on this for the past few weeks without much progress. I think we could work on finishing the weight converter first because that looks more promising as of now. (We have tried training models in the past without any good results)

15:53 < kartikdutt18[m]> Sure, Makes sense. I will try figuring out what goes wrong by breaking down the network.

15:53 < KimSangYeon-DGU[> Great, Kartik. thanks for the hard work

15:54 < kartikdutt18[m]> Thanks. I will keep you guys updated in the issue that I opened.

15:54 < kartikdutt18[m]> Also, I can stop the training now then, right?

15:54 < kartikdutt18[m]> or should I let it train.

15:55 < sakshamb189[m]> the model that is taking one day per epoch?

15:55 < kartikdutt18[m]> yes.

15:56 < sakshamb189[m]> yes I think you can stop that, could take months to complete.

15:56 < kartikdutt18[m]> xD, I'll stop it for now.

15:57 < KimSangYeon-DGU[> Ok, is there anything to discuss?

15:58 < KimSangYeon-DGU[> I'm done

15:58 < kartikdutt18[m]> Nothing more from my side.

15:59 < sakshamb189[m]> yes I am done too. Let's meet next week. Have a great weekend ahead!

15:59 < KimSangYeon-DGU[> Have a nice week!

15:59 < kartikdutt18[m]> Have a great week!

17:00 < jeffin143[m]> I am back :)

17:00 < jeffin143[m]> Ryan when are we planning release ??

17:01 < jeffin143[m]> May be first video meet of next month ?? Again a release while we are all on the meet ?

17:37 < himanshu_pathak[> jeffin143: Great . My problem with laptop is also now solved just bought a new machine all thanks to GSOC money 😇

17:44 < jeffin143[m]> Woh himanshu_pathak congratulations

17:44 < jeffin143[m]> Which laptop ?

17:45 < himanshu_pathak[> https://www.reliancedigital.in/asus-bq347t-tuf-gaming-laptop-8th-gen-intel-core-i5-8300h-8gb-512gb-ssd-nvidia-geforce-gtx-1050-gddr5-graphics-windows-10-fhd-39-62-cm-15-6-inch-/p/491584571?gclid=CjwKCAjw34n5BRA9EiwA2u9k35qn4QpS2VhYRkCSzobnkiqoqM--XsYXEE44NTRXUpx2iY4OMK9KRBoCYk0QAvD_BwE this one

17:46 < himanshu_pathak[> It was a big update from old 4gb ram with no gpu 🤣

17:51 < jeffin143[m]> That's a beast

17:51 < jeffin143[m]> In front of your old one

17:52 < jeffin143[m]> Now you don't need to ssh and train

17:52 < jeffin143[m]> 😉

17:52 < jeffin143[m]> Congratulations 🎉

17:53 < himanshu_pathak[> <jeffin143[m] "In front of your old one"> Thanks Yeah a lot of improvement from that.

18:09 a1 has left #mlpack []

18:18 favre49 has quit [Remote host closed the connection]

19:18 ImQ009 has quit [Quit: Leaving]

20:35 < nishantkr18[m]> zoq: Are you there?

20:35 < zoq> nishantkr18[m]: yes

20:36 < nishantkr18[m]> zoq: Could you have a look at https://github.com/mlpack/mlpack/pull/2454

20:38 < zoq> nishantkr18[m]: So this sounds like there is some flaw in the CartPole-v0 implementation?

20:40 < nishantkr18[m]> zoq: I'm not sure.. Is the agent able to solve using gym_tcp_api?

20:40 < nishantkr18[m]> zoq: on your machine?

20:40 < zoq> nishantkr18[m]: let me test

20:40 < nishantkr18[m]> zoq: Sure

20:42 < zoq> nishantkr18[m]: So no change to the code just revert the last commit, with the static weights?

20:42 < nishantkr18[m]> zoq: Yeah.

20:43 < nishantkr18[m]> zoq: You can increase the replayBuffer capacity to `10000` for better performance

21:03 < zoq> nishantkr18[m]: https://gym.kurg.org/c93cefe008dc4/output.webm

21:04 < zoq> nishantkr18[m]: current reward: 139 iteration: 49

21:04 < nishantkr18[m]> zoq: It does work then, right?

21:05 < zoq> nishantkr18[m]: yes :) let me rerun the experiment

21:07 < nishantkr18[m]> zoq: Wasted so much time on it. Just to find it was correct all this time!

21:07 < nishantkr18[m]> zoq: Having so many mixed feelings now 😀☹️

21:07 < zoq> nishantkr18[m]: https://gym.kurg.org/8fb708759f9b4/output.webm

21:08 < zoq> nishantkr18[m]: current reward: 148 iteration: 49

21:08 < zoq> nishantkr18[m]: I wouldn't call it wasted time, but I know the feeling :(

21:09 < zoq> nishantkr18[m]: Still interesting that the other implementation can solve the env.

21:09 < zoq> I thought Categorical DQN would perform better.

21:09 < nishantkr18[m]> zoq: Yeah.. that's what is bothering me..

21:10 < nishantkr18[m]> Exactly!

21:11 < zoq> Will dive into the CartPole-v0 implementation.

21:12 < nishantkr18[m]> Yup. we should check that thoroughly

21:13 < zoq> I tested with 10000 as buffer size, but I guess 2000 works as well?

21:13 < nishantkr18[m]> yeah,but i think 10000 works better

21:13 < nishantkr18[m]> I'm trying to see if categorical is able to solve acrobot

21:15 < zoq> Sounds good, awsome to see that you narrowed the issue down, well done.

21:15 < nishantkr18[m]> :)

21:16 < zoq> nishantkr18[m]: Also, isn't it like 2am for you?

21:17 < nishantkr18[m]> zoq: 2:45am to be precise 😆

21:19 < nishantkr18[m]> zoq: This week was very important for me.. It took me 4 full days to find the bug in SAC and categorical

21:19 < zoq> nishantkr18[m]: I have a strange sleeping habit as well, currently reading "Why We Sleep" ... not sure that is a good idea

21:20 < nishantkr18[m]> zoq: 😆

21:25 < nishantkr18[m]> zoq: BTW, Earlier we used to use mlpack's CartPole implementation to train an agent, and then we used to test the agent on gym_tcp_api

21:27 < zoq> nishantkr18[m]: Right, for testing I still like to keep the envs we have.

21:27 < zoq> nishantkr18[m]: For the examples, we already switched so we should be fine there?

21:28 < zoq> nishantkr18[m]: just tested 2000 -> https://gym.kurg.org/084e338def8d4/output.webm

21:28 < zoq> nishantkr18[m]: looks good as well

21:28 < nishantkr18[m]> Yeah that is fine. but, it's strange if the CartPole implementation is not correct. Because using mlack to train and gym to test, we used to get good results

21:28 < nishantkr18[m]> this only means that both the implementations should be very similar

21:29 < zoq> yes, the logic should be similar

21:30 < nishantkr18[m]> Hmm..

21:30 < zoq> This is one of the main drawback of the env's, we can't really see the output.

21:32 SayanGoswami[m] has joined #mlpack

21:43 < nishantkr18[m]> Yup. Btw, categorical is able to get decent results on acrobot-v1 using gym

21:44 < zoq> nishantkr18[m]: Great :)

21:44 < nishantkr18[m]> Still I feel that in all these simple tasks, simpledqn is able to converge much faster and better..

21:44 < zoq> I guess we still should apply the copy constructor changes

21:45 < nishantkr18[m]> Yeah sure..

21:47 < nishantkr18[m]> Btw, the only change we need to so is to use `targetNetwork.Parameters() = learningNetwork.Parameters()` instead of `targetNetwork = learningNetwork`, right?

21:47 < zoq> We have to call Reset before that as well.

21:47 < nishantkr18[m]> yeah, I was ablout to mention that

21:48 < nishantkr18[m]> nothing else other than that right?

21:48 < zoq> yes, that worked for me

21:49 < zoq> I mean you still have to do 'targetNetwork = learningNetwork' as well, since this will copy the layer structure.

21:49 < nishantkr18[m]> yeah, ofcourse.

21:57 SayanGoswami[m] is now known as say4n[m]