Starcraft: It’s all about Trees

Based on last weeks discussion, I am testing the accuracy of using purely in game data that I can obtain both from replays and from a live stream. This includes Resources (Minerals and Gas), Current Population, Current Max Population, Race, GameID (to track data across one game), Current In Game Time, and finally who wins. No image data is being processed here.

From the replays I collected 53k snapshots over 406 games. Removing snapshots that had a current time of < 60 seconds (because the beginning of the game is not interesting), I was left with 48k snapshots.

Next I spent time trying to find a model that works well with the data.

Running a grid search with SVM, I found the RBF kernel with C=1000 and gamma =0.001 to be the best parameters. I achieved an accuracy of ~66%, but there is waaaaaaay too much variance at the moment. For example:

Running AdaBoost with 100 estimators did much better with ~75% accuracy:

Note:

Both graphs were created with no normalization of data (scaling ) and with keeping the column GameID in the dataset for training.

Questions:

With the above models, I am including the GameID, such that a single game with say 10 snapshots will all have the same GameID. Without this column AdaBoost performs about 10% worse. What is the justification for keeping other than "it gives me better accuracy"?
Currently nothing in the data is normalized. Should values such as resources or population be normalized over the dataset? Should it only normalize per game?

For Next Week:

Keep working on the models
test model over one game and see how the prediction does
Compare to visual data classifier and look into how to possibly combine them together.

Leave a Reply Cancel reply