Abstracted Replay Data Parsers#119
Abstracted Replay Data Parsers#119cole-maclean wants to merge 23 commits intogoogle-deepmind:masterfrom
Conversation
|
Ok @tewalds , do you wanna take a look and see if this is on the right track? I've separated the replay_actions (this name should probably change, I'm thinking parse_replays) from the actual replay parsers, which can be found in the new folder "replay_parsers". There's a base_parser class that implements basic stats and methods needed to parse a replay. The key parsing function is parse_step, which takes as input from the ReplayParser obs, feat and info. Other optional overrides are valid_replay for users to write their own definitions of what constitutes a valid replay, and stats merging/ printing if they want custom printed stats. If the parser parse_step function returns anything, whatever is returned is appended to a data list at each step in the replay, and this data list is saved into a json data file for that specific replay&playerId. There's an optional data_dir argument for users to pass in for where to save these files, if their parser requires it. I've also included my state scrapping script I wrote for a project as an example of how to parse state info from a replay and save them to files. Let me know your thoughts, and if you think this is the right way to go I'll work on updating documentation. Could probably use some testing as well. Thanks! |
| flags.DEFINE_integer("screen_resolution", 16, | ||
| "Resolution for screen feature layers.") | ||
| flags.DEFINE_integer("minimap_resolution", 16, | ||
| "Resolution for minimap feature layers.") |
There was a problem hiding this comment.
Added resolution args to be consistent with play.py
pysc2/bin/replay_actions.py
Outdated
| flags.DEFINE_integer("minimap_resolution", 16, | ||
| "Resolution for minimap feature layers.") | ||
| flags.mark_flag_as_required("replays") | ||
| FLAGS(sys.argv) |
There was a problem hiding this comment.
This should be dealt with by app.run and shouldn't be explicitly needed.
| interface.feature_layer.resolution.y = FLAGS.screen_resolution | ||
| interface.feature_layer.minimap_resolution.x = FLAGS.minimap_resolution | ||
| interface.feature_layer.minimap_resolution.y = FLAGS.minimap_resolution | ||
|
|
There was a problem hiding this comment.
Copied from play.py to be consistent
| self.stage = "" | ||
| self.replay = "" | ||
| self.replay_stats = ReplayStats() | ||
| self.parser = parser_cls() |
There was a problem hiding this comment.
renamed replay_stats to parser, updated throughout code
| merge_dict(self.made_actions, other.made_actions) | ||
| self.crashing_replays |= other.crashing_replays | ||
| self.invalid_replays |= other.invalid_replays | ||
|
|
There was a problem hiding this comment.
merge moved to parser class
pysc2/bin/replay_actions.py
Outdated
| return | ||
| try: | ||
| replay_name = os.path.basename(replay_path)[:10] | ||
| replay_name = os.path.basename(replay_path) |
There was a problem hiding this comment.
changed 10 character truncating to full replay name
There was a problem hiding this comment.
The reason it was just the first 10 is that I was running mainly over replays that were sha1 named, so very long and the prefix gave enough uniqueness. I'm fine with showing the full as long as it's readable.
There was a problem hiding this comment.
Okay, I've left the full replay name but I'll leave it up to you to decide on readability. The only catch is that data files are uniquely named from their replay name and will be overwritten if there are colliding names.
pysc2/bin/replay_actions.py
Outdated
| replay_name, player_id)) | ||
| self.process_replay(controller, replay_data, map_data, | ||
| player_id) | ||
| player_id,info,replay_name) |
There was a problem hiding this comment.
passing info and replay_name to process_replay to allow user to parse info data, replay_name for file naming
| for ability_id in feat.available_actions(obs.observation): | ||
| self.stats.replay_stats.valid_actions[ability_id] += 1 | ||
|
|
||
| if obs.player_result: |
There was a problem hiding this comment.
action stats parsing moved to custom ActionParser class
pysc2/bin/replay_actions.py
Outdated
| if data: | ||
| data_file = FLAGS.data_dir + replay_name + "_" + str(player_id) + '.json' | ||
| with open(data_file,'w') as outfile: | ||
| json.dump(data,outfile) |
There was a problem hiding this comment.
Added to disk file saving at end of replay if custom parser parse_step returns data
pysc2/replay_parsers/state_parser.py
Outdated
| friendly_army,enemy_army,all_features['player'].tolist(), | ||
| all_features['available_actions'].tolist(),actions,winner, | ||
| race,enemy_race] | ||
| return full_state No newline at end of file |
There was a problem hiding this comment.
Returning from this function initiates a data list for each step that will be saved to a file for this replay and player perspective
pysc2/bin/replay_actions.py
Outdated
| flags.DEFINE_integer("step_mul", 8, "How many game steps per observation.") | ||
| flags.DEFINE_string("replays", None, "Path to a directory of replays.") | ||
| flags.DEFINE_string("parser", "pysc2.replay_parsers.base_parser.BaseParser", | ||
| "Which agent to run") |
pysc2/bin/replay_actions.py
Outdated
| flags.DEFINE_string("replays", None, "Path to a directory of replays.") | ||
| flags.DEFINE_string("parser", "pysc2.replay_parsers.base_parser.BaseParser", | ||
| "Which agent to run") | ||
| flags.DEFINE_string("data_dir", "C:/", |
There was a problem hiding this comment.
Please don't set a default value that will be wrong or invalid for a large number of users. C:/ doesn't exist on linux, and any default location will be wrong for someone.
There was a problem hiding this comment.
Okay, I've changed the default to be None, and data is only saved to a file if the user supplies a data_dir. Another option would be to use os.getcwd() and use the current working directory as the default directory.
pysc2/bin/replay_actions.py
Outdated
| return | ||
| try: | ||
| replay_name = os.path.basename(replay_path)[:10] | ||
| replay_name = os.path.basename(replay_path) |
There was a problem hiding this comment.
The reason it was just the first 10 is that I was running mainly over replays that were sha1 named, so very long and the prefix gave enough uniqueness. I'm fine with showing the full as long as it's readable.
pysc2/bin/replay_actions.py
Outdated
| replay_name, player_id)) | ||
| self.process_replay(controller, replay_data, map_data, | ||
| player_id) | ||
| player_id,info,replay_name) |
There was a problem hiding this comment.
Please use the google python style guide: spaces between args.
pysc2/bin/replay_actions.py
Outdated
| self.stats_queue.put(self.stats) | ||
|
|
||
| def process_replay(self, controller, replay_data, map_data, player_id): | ||
| def process_replay(self, controller, replay_data, map_data, player_id,info,replay_name): |
There was a problem hiding this comment.
more spaces for style guide
pysc2/replay_parsers/state_parser.py
Outdated
| def calc_armies(screen): | ||
| friendly_army = [] | ||
| enemy_army = [] | ||
| unit_list = np.unique(screen[6]) |
There was a problem hiding this comment.
Do not hardcode these numbers. Look them up in features.SCREEN_FEATURES. That's both for readability and so they don't break if we add new layers.
pysc2/replay_parsers/state_parser.py
Outdated
| class StateParser(base_parser.BaseParser): | ||
| """Action statistics parser for replays.""" | ||
| def valid_replay(self,info, ping): | ||
| return True |
There was a problem hiding this comment.
ahh yeup. Removed.
pysc2/replay_parsers/state_parser.py
Outdated
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| """Action statistics parser for replays.""" |
There was a problem hiding this comment.
Looking through this file it may be better to put it in your repo. The action stats is already a little bit complicated as an example. It'd be nice to have a simpler example. Some ideas:
- distribution of steps between actions, which could be useful for finding reasonable APMs and step_mul
- distribution of APM and MMRs to graph APM vs MMR
There was a problem hiding this comment.
Agreed, updated example to just scrape the General player information layer from each step, along with the winning player. Could be used to build a winner prediction model based on resource/supply/army size data as the game progresses.
| import collections | ||
| import six | ||
|
|
||
| class BaseParser(object): |
There was a problem hiding this comment.
Have you read about MapReduce? I haven't done enough research, but consider whether that would be a good framework to use here.
There was a problem hiding this comment.
Unfortunately no, not currently in my wheelhouse, but might be a good opportunity to dig into it a little.
pysc2/bin/replay_actions.py
Outdated
| flags.DEFINE_integer("minimap_resolution", 16, | ||
| "Resolution for minimap feature layers.") | ||
| flags.mark_flag_as_required("replays") | ||
| FLAGS(sys.argv) |
There was a problem hiding this comment.
This should be dealt with by app.run and shouldn't be explicitly needed.
| # Save scraped replay data to file at end of replay if parser returns | ||
| # and data_dir provided | ||
| if data: | ||
| if FLAGS.data_dir: |
There was a problem hiding this comment.
Data only saved to file if data_dir provided from user
|
Alright, @tewalds can you take a look when you get a chance? Think I hit all your comments. I also added some documentation about replays and parsers. |
|
I haven't looked at the recent changes, but it's probably also worth renaming replay_actions.py to something more general like process_replays.py or similar. Also, thanks for adding documentation. If you have a reasonable way to do it, some unit tests would be appreciated as well. |
|
Agreed, updated to process_replays. I'll put some thought to testing. |
…and caught in path doesn't exist exception
| self._print("Empty queue, returning") | ||
| return | ||
| try: | ||
| self.load_replay(replay_path, controller, ping) |
There was a problem hiding this comment.
replay loading refactored into own method 'load_replay'
| return | ||
|
|
||
| def load_replay(self, replay_path, controller, ping): | ||
| replay_name = os.path.basename(replay_path) |
There was a problem hiding this comment.
refactored load_replay method to modularize replay processing
|
Ok @tewalds I've added some tests, replay_parser_test. Probably not comprehensive, but should provide a good framework for building up future tests if needed, and follows pretty closely to the design of other test scripts in the repo. I think at this point this should close out a first pass for this functionality. We can discover and experiment with improvements (ie. MapReduce) in a future PR. Let me know if you have any feedback or see some snags! Thanks Timo. |
| merge_dict(self.made_actions, other.made_actions) | ||
|
|
||
| def valid_replay(self, info, ping): | ||
| """Make sure the replay isn't corrupt, and is worth looking at.""" |
There was a problem hiding this comment.
Changed valid replay conditions to be very low for test replay pass. The checks can stand as an example for users who wish to make valid_replay checks more aggressive.
PR to implement an abstraction of the replay_actions module to separate running the replays and the data scrapping as per issue #51 , similar to how running agents is abstracted to allow for arbitrary user written agents. This will likely be a fairly large redesign, so PR is currently a Work In Progress to get things started. Current commit is proof of concept of abstracting a parser class from the replay setup to run the action stats scrapping of the current replay_actions script.