Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve Hex! and figure out hyperparams #29

Open
barakugav opened this issue Aug 18, 2022 · 5 comments
Open

Solve Hex! and figure out hyperparams #29

barakugav opened this issue Aug 18, 2022 · 5 comments

Comments

@barakugav
Copy link
Collaborator

Train a two headed network until the engine wins against us consistently.
Understand how long such training requires and with what hyper parameters?:
learning rate
games generated in self play
temperature for softmax, for how many moves should we use softmax
model structure
should we take a single position from each game or more

a lot of this can be taken from lc0

@poja
Copy link
Owner

poja commented Oct 25, 2022

What is your intuition about this initial attempt?
https://github.com/poja/RL/blob/smaller-hex-2/train/config.json

@barakugav
Copy link
Collaborator Author

I think the ratio of selfplay to training is too high.
In each iteration we will play 100 games with ~60 positions and 1000 simulations per move, 6,000,000 network calculation in total.
If we generate 6000 positions each iterations, i think we should train at least on 20000 entries, and we can choose them from the latest 100000 entries.

In general, the self play is much more computational heavy, and we want to get the most from each data entry, so lets train A LOT!

@poja
Copy link
Owner

poja commented Oct 26, 2022

I want to think more about this - but meanwhile see the attached results with the above config.
Notice how at some point the learning stops (in policy before value).
And the last row seems to me "lucky" in the value-loss, i.e. the next row wouldn't necessarily be as good.

Also, importantly, this is 4x4 hex, so this somewhat affects the numbers you mentioned and the intuition (but is almost the same order of magnitude).

221025_175846.txt

@barakugav
Copy link
Collaborator Author

If the learning stops so fast, maybe our learning rate is too high.
What do you think about 10^-3 in the first 50 iterations and 10^-4 in the other 50?

Also, policy accuracy of 0.3 is not terriable when you have 16 options, but i suspect the network is very limitied. Would you like to try the ConvNetV1 instead?

And another point about the number of training entries: if latest_data_entries=1000 this basiclly says you only learn from the last iteration data, i really think we should either increase latest_data_entries and iteration_data_entries or decrease games_num to ~10 and do 1000 iterations

@barakugav
Copy link
Collaborator Author

Nevertheless, this is super exciting! cant wait to play against it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants