<EshaanAgarwal[m]>
jonpsy: zoq have implemented maze environment and pushed the code to PR. Just need to fix a couple of things in it. Meanwhile it would be great if you guys can take a look.
<jonpsy[m]>
<EshaanAgarwal[m]> "jonpsy: zoq same results with..." <- keep 256, but increase depth
<jonpsy[m]>
256, 64, 64 like that perhasp
<EshaanAgarwal[m]>
<jonpsy[m]> "256, 64, 64 like that perhasp" <- Shouldn't last layer as 64 give index out of bounds ?
<jonpsy[m]>
i meant hidden layers
<EshaanAgarwal[m]>
jonpsy[m]: Ok I will make a custom network and then try.
<EshaanAgarwal[m]>
jonpsy: zoq: our test for the maze environment is converging ! I have fixed the environment.
<EshaanAgarwal[m]>
Can you please take a look at it whenever you are free ?
<jonpsy[m]>
one question
<EshaanAgarwal[m]>
jonpsy[m]: Yes ;)
<EshaanAgarwal[m]>
* Yes
<jonpsy[m]>
what about other policies
<jonpsy[m]>
just to be sure, our env just returns `0` right?
<jonpsy[m]>
and only returns `1` when it reaches the goal (aka where `1` was setup)
<jonpsy[m]>
<EshaanAgarwal[m]> "Ok I will make a custom network..." <- also keep decreasing complexity while it still converges
<EshaanAgarwal[m]>
jonpsy[m]: Returns -1 if it's wall or out of bound of maze
<EshaanAgarwal[m]>
Other replays weren't even giving positive avg return
<jonpsy[m]>
It shouldn't "return" anything when it hits a wall
<EshaanAgarwal[m]>
But this was consistently giving positive return of 0.9 in almost 100 episodes
<jonpsy[m]>
it should just go back
<jonpsy[m]>
Basically wall is there to restrict movements
<jonpsy[m]>
so hitting a wall should be "unviable" path
<EshaanAgarwal[m]>
jonpsy[m]: But that was a wrong step so we should give it a negative reward right ?
<EshaanAgarwal[m]>
jonpsy[m]: It is going back and side by side I am giving it a -1 reward for wrong action
<EshaanAgarwal[m]>
EshaanAgarwal[m]: I have implemented this in the code
<jonpsy[m]>
you could check if the path would lead to "-1" aka wall
<jonpsy[m]>
and not go there at all
<EshaanAgarwal[m]>
jonpsy[m]: Agent has the option to choose any action right ! That's the whole point that it learns to understand that it doesn't need to move to a wall
<EshaanAgarwal[m]>
From a 0 state if the agents chooses an action which takes it to a wall then it will not go there but at the same time ! We will give a negative reward to it.
<EshaanAgarwal[m]>
* to it for the wrong action it chose.
akhunti1 has joined #mlpack
akhunti1 has quit [Client Quit]
akhunti1 has joined #mlpack
eshaan has joined #mlpack
akhunti1 has quit [Quit: Client closed]
akhunti1 has joined #mlpack
eshaan has quit [*.net *.split]
akhunti1 has quit [*.net *.split]
<EshaanAgarwal[m]>
<jonpsy[m]> "you could check if the path..." <- zoq: have removed the negative reward associated with the wall ! It's still performing way better than others . I have pushed those changes.
<EshaanAgarwal[m]>
s///
<EshaanAgarwal[m]>
* zoq: have removed the negative reward associated with the wall ! It's able to solve the maze and is performing way better than others . I have pushed those changes.
<jonpsy[m]>
<EshaanAgarwal[m]> "From a 0 state if the agents..." <- I get your point. But I'm tryin to keep things binary here. Win/Lose, that's it
<jonpsy[m]>
Btw, one way we could mkae this interesting
<jonpsy[m]>
is limiting the number of steps
<EshaanAgarwal[m]>
jonpsy[m]: I have done that now ! It's performing in that too.
<jonpsy[m]>
I think we have that feature already
<jonpsy[m]>
number of steps thing?
<EshaanAgarwal[m]>
jonpsy[m]: Yes but I guess for a simple test this should be fine.
<EshaanAgarwal[m]>
jonpsy[m]: Yes ! Can you elaborate ?
<jonpsy[m]>
Yeah, so I for ex if you have a maze
<jonpsy[m]>
It's like a race basically, and you have a time limit. If you don't find within that time limit, you lose
<jonpsy[m]>
that'll help for graceful exit in case an agent gets stuck in infinite back & forth
<EshaanAgarwal[m]>
jonpsy[m]: We are doing that already
<EshaanAgarwal[m]>
It's max number of steps! That I have set as 120 for now but we can reduce it
<jonpsy[m]>
we should defo play with that param
<EshaanAgarwal[m]>
When it takes more than the max steps it loses and game terminates
<EshaanAgarwal[m]>
jonpsy[m]: I will try and share the results !
<jonpsy[m]>
For now, this maze is way too easy. A generic DP can solve this
<EshaanAgarwal[m]>
jonpsy[m]: Yeah you could say the same for the but flipping task too
<jonpsy[m]>
we should aim for bigger, more constraint maze. How were the other policies faring in this regard?
<jonpsy[m]>
s/this/our/, s/regard/current maze/
<EshaanAgarwal[m]>
jonpsy[m]: Not able to solve it most of the times ! Avg returns were around 0.5 but this got 1 in almost 70 epiosdes
<EshaanAgarwal[m]>
Really outperformed
<jonpsy[m]>
that's weird....
<EshaanAgarwal[m]>
jonpsy[m]: Weird how ?
<EshaanAgarwal[m]>
Not 70 on all runs but it was able to solve it ! Random replay was around 0.5-0.6 and moving around that ! I am saying avg return over 50 episodes
<jonpsy[m]>
A generic RL can solve Frozen lake problem reasonably well.
<EshaanAgarwal[m]>
It was able to solve the game in some epsiodes
<jonpsy[m]>
* reasonably well. That is, without neural net, simple value table approach
<EshaanAgarwal[m]>
jonpsy[m]: It can solve but in the same order of samples ! I guess not
<jonpsy[m]>
Ok, let's solidify this then.
<jonpsy[m]>
Let's go for frozen lake game
<EshaanAgarwal[m]>
jonpsy[m]: Do I have to implement that ?
<jonpsy[m]>
it's available in openai gym
<EshaanAgarwal[m]>
jonpsy[m]: Oh okay ! Let me check that out
<EshaanAgarwal[m]>
Shouldn't we focus on the documentation and other stuff ! As the deadline is nearing ?
<jonpsy[m]>
but not for C++ ig, but I can show you a link
<jonpsy[m]>
EshaanAgarwal[m]: Have you not started it already?
<jonpsy[m]>
you've commented the codes, right?
<EshaanAgarwal[m]>
jonpsy[m]: I have but a little guidance on what all you expect would help.
<jonpsy[m]>
i see, i've worked on ensmallen documnetation. Never on mlpack documentation
<EshaanAgarwal[m]>
jonpsy[m]: Yess ! I have.
<EshaanAgarwal[m]>
EshaanAgarwal[m]: I mean this with reference to gsoc submission
<jonpsy[m]>
Oh that
<EshaanAgarwal[m]>
jonpsy[m]: Would there be difference ?
<rcurtin[m]>
I missed the context of the conversation, but the mlpack documentation should ideally someday be like the ensmallen documentation but it is not there yet 😃 needs a lot of work...
<jonpsy[m]>
hey there rcurtin , perfect timing!
<jonpsy[m]>
So I was wondering, if we add a new method, is there anywhere we need to document the method (Except the code comments)
<EshaanAgarwal[m]>
jonpsy[m]: Meanwhile can we merge HER ? So that we could work on PPO or the Frozen lake env ?
<rcurtin[m]>
if you want, you could write a tutorial and add it to `doc/tutorials/`, but that's often a lot of work; for now, we probably should leave the user-facing documentation as comments in the code, and then as time goes on (maybe if we go GSoD?), we can extract all that into a Markdown file like ensmallen
<rcurtin[m]>
but at least personally that is what I'd like to see---it's much easier to maintain
<rcurtin[m]>
for ensmallen maintaining that documentation is a little easier because the scope/task of the library is so limited; it will be harder for mlpack to figure out how to organize it all
<rcurtin[m]>
but I think it can be done
<jonpsy[m]>
rcurtin[m]: Defo to be looked at during GSoD
<EshaanAgarwal[m]>
EshaanAgarwal[m]: jonpsy: can we discuss how we proceed from here ?
<jonpsy[m]>
It's still a little unsettling to me...
<EshaanAgarwal[m]>
jonpsy[m]: Unsettling in the sense ?
<jonpsy[m]>
let's stick with maze itself. Don't think we have time for anything else
<EshaanAgarwal[m]>
As for limiting steps to solve the maze ! I will do that. But I feel for testing purposes the maze should do.
<EshaanAgarwal[m]>
jonpsy[m]: Ok and from here ?
<jonpsy[m]>
Increase complexity of maze
<EshaanAgarwal[m]>
jonpsy[m]: Should we ? By how much ?
<jonpsy[m]>
zoq: ping
<jonpsy[m]>
had a question
<jonpsy[m]>
there?
<zoq[m]>
Can you compare this with the existing RL policy, HER should be better and not worse.
<EshaanAgarwal[m]>
zoq[m]: HER was better ! They weren't able to perform better when I tried RandomReplay and Prioritzed Replay
<jonpsy[m]>
they were converging though, right?
<jonpsy[m]>
just, not as often?
<EshaanAgarwal[m]>
jonpsy[m]: No converging the test ! For converging it needs to have avg return 1 in last 50 epiosdes
<EshaanAgarwal[m]>
0.99*
<jonpsy[m]>
can you list the policies, with the neural net and their average return?
<EshaanAgarwal[m]>
I set the threshold high so that HER performance can be gauged
<EshaanAgarwal[m]>
jonpsy[m]: Will have to run it to give you numbers. Give me some minutes !
<jonpsy[m]>
Cool, always track the numbers in a doc.
<EshaanAgarwal[m]>
jonpsy[m]: Will update the numbers in the doc with screen shots and ping you !