<EshaanAgarwal[m]>
<jonpsy[m]> "That bit flip one is a classic..." <- so should i go ahead with it ?
<jonpsy[m]>
Sure, as long as a test is reproducible
<EshaanAgarwal[m]>
jonpsy[m]: I have implemented the environment. Would you like to take a look at it whenever you have time ? Its in the same PR for HER
<jonpsy[m]>
sure
<jonpsy[m]>
can you jog me through exactly how its tested?
<EshaanAgarwal[m]>
jonpsy[m]: I have not written the test yet ! There are some modifications in the q_impl.hpp for which I will have to look again so that we could plug in HER seamlessly.
<jonpsy[m]>
jonpsy[m]: or planned to be tested
<jonpsy[m]>
that bit flip environment for ex. Let's say we start from 00000, goal is 10011. Now whats the game?
<EshaanAgarwal[m]>
<EshaanAgarwal[m]> "Which is - Imagine a single..." <- However as mentioned here, we have a arbitrary initial state and arbitrary final vector state. Agent at a time can provide the index where it wants to flip. And agent in the end should learn to reach any goal vector from intiaol vector
<EshaanAgarwal[m]>
jonpsy[m]: Yeah so like agent can provide me the index where I need to flip ! Let's say it's 0th then we flip 0 to 1 and 10000 is our next state . Let's say from here it agains chooses 0th index then our next state becomes 00000 back again
<EshaanAgarwal[m]>
And so forth
<EshaanAgarwal[m]>
Finally it should learn to take us to the desired goal ( 10011)
<jonpsy[m]>
after training, how do you specify which goal to choose?
<EshaanAgarwal[m]>
jonpsy[m]: Can you elaborate please ?
<jonpsy[m]>
So, at training it should've learned how to reach multitude of goals, correct?
<jonpsy[m]>
s/at/after/
<EshaanAgarwal[m]>
jonpsy[m]: Yeah so basically we have our one final true goal which is the state we want to reach . Now based on the goal selection strategy we will store more transitions where the goal would be different ( I.e let's say it reached B from A after a epsiode while it's actual goal was C . So we will save some more transitions where we will goal as C ( as if C was our final goal all along)
<jonpsy[m]>
we should go for examples with sparse rewards
<jonpsy[m]>
that robot hand example, any way we can simulate it?
<EshaanAgarwal[m]>
jonpsy[m]: Well in the paper they has coded it themselves
<EshaanAgarwal[m]>
Mujoco ones which are present in open ai gym might no help
<jonpsy[m]>
hence my question, can we simulate it
<EshaanAgarwal[m]>
jonpsy[m]: We can code it. They are a little complicated. I will have to see how to do it C++
<EshaanAgarwal[m]>
That's why i looked at the but flip environment because it was simple and kind of worked as a pretty good task