Upgrade to C++20, update lc0 integration, and add one-game-per-file output#1
Draft
ContradNamiseb wants to merge 24 commits into
Draft
Upgrade to C++20, update lc0 integration, and add one-game-per-file output#1ContradNamiseb wants to merge 24 commits into
ContradNamiseb wants to merge 24 commits into
Conversation
- Upgraded C++ standard from C++17 to C++20 - Updated lc0 source file paths to match new lc0 structure: - Removed lc0/src/chess/bitboard.cc (no longer needed) - Changed lc0/src/neural/writer.cc to lc0/src/trainingdata/writer.cc - Added lc0/src/utils/string.cc (fixes StrSplit linker error) - Updated include directories (simplified paths, added root include) - Set explicit Release build type - Added absl/ library dependency - Upgraded TrainingData from V4 to V6: - Replaced V4TrainingDataHashUtil.h with V6TrainingDataHashUtil.h - Updated related source files for V6 compatibility - Updated .gitmodules for lc0 submodule Made with Gemini and Claude Opus
- Modified EnqueueChunks() to write all training positions from a single game directly to its own .gz file - Each game now gets its own chunk file (game_XXXXXX.gz) - Removed the batching logic that combined multiple games into one file - This ensures training data from different games is not mixed together Made with Gemini and Claude
- Replaced boost::hash_range and boost::hash_combine with lc0's HashCat() from utils/hashcat.h in V6TrainingDataHashUtil.h - Removed find_package(Boost) and Boost_INCLUDE_DIRS from CMakeLists.txt - This eliminates the external Boost dependency for easier building Made with Gemini and Claude
- Added CI/CD workflow for Ubuntu (gcc, clang) and Windows (cl) - Builds on push to master and pull requests - Automatically creates pre-release with artifacts on master push - No Boost dependency required (uses lc0 native HashCat) Made with Gemini and Claude
- Copied proto/net.pb.h stub from lc0 to src/proto/ (it was untracked in lc0 submodule) - Added 'src' to include_directories so proto/ can be found - This fixes the 'proto/net.pb.h file not found' error on CI Made with Gemini and Claude
- Updated CMake to use system zlib (via find_package) on Unix - Bundled zlib only used on Windows now - Added zlib1g-dev to CI Linux dependencies - Added proper project() declaration to fix CMake warnings - Fixes 'call to undeclared function' errors for lseek/read/write/close Made with Gemini and Claude
- Added _CRT_SECURE_NO_WARNINGS to suppress deprecation warnings for strcpy, sprintf, fopen, etc. - Added /Zc:strictStrings- to allow const char[] to char* conversion (required by polyglot's getopt.h) Made with Gemini and Claude
- Added /FIarray to force include <array> header (missing in lc0 submodule) - Added /permissive to relax conformance rules (helps with polyglot's legacy C code) - Maintained /Zc:strictStrings- for char* conversions Made with Gemini and Claude
- Moved /Zc:strictStrings- to polyglot-specific file properties - Added global warning suppressions: /wd4996, /wd4267, /wd4244, /wd4390, /wd4018 - This fixes C2440 errors in polyglot's getopt.h and cleans up the build log Made with Gemini and Claude
- Switch Windows CI matrix to use gcc/g++ (MinGW) and Ninja generator - Enable verbose build logging - Add MinGW compile flags for polyglot (-fpermissive) to fix const char* errors Made with Gemini and Claude
- Use -iquote for polyglot sources on GCC/MinGW to prevent <getopt.h> from picking up polyglot/src/getopt.h - This fixes 'undefined reference to getopt_internal' linker error on Windows - MSVC continues to use standard include path as it requires the local getopt.h polyfill Made with Gemini and Claude
- Restore polyglot/src to regular include_directories (iquote wasn't working) - For MinGW: pre-define __GETOPT_H__ to prevent local getopt.h from being included when system unistd.h includes <getopt.h> Made with Gemini and Claude
- Changed artifact upload to only include trainingdata-tool binary - Fixes 'Failed to upload CMakeCache.txt' error in release step Made with Gemini and Claude
- Fixed regex patterns: [%eval ...] now correctly escaped as \[%eval ...\] - Added try-catch around std::stof to prevent crashes on malformed input Made with Gemini and Claude
- Add robust SAN cleaning (whitespace, move numbers, comments, annotations) - Add poly_move_to_uci() function for UCI notation conversion - Change input format to INPUT_CLASSICAL_112_PLANE - Improve result parsing with strcmp and fallback for unrecognized results - Graceful illegal move handling (continue instead of break) - Add best_move and visits parameters to get_v6_training_data - Store Q values directly (relative to side-to-move)
- Implement StaticEvaluator with material balance (P=100, N=320, B=330, R=500, Q=900) - Add piece-square tables for all piece types with MG/EG king tables - Implement pawn structure analysis (doubled, isolated, passed pawns) - Add simplified mobility scoring - Use tapered evaluation (MG/EG interpolation based on game phase) - Integrate into PGNGame for non-Lichess mode - Use sigmoid function for cp-to-Q conversion
…NNIndex - Validate all MoveToNNIndex results before indexing probabilities array - Check legal moves loop indices - Check played_idx before writing probabilities[played_idx] = 1.0 - Check best_idx before storing in result.best_idx - Fallback to 0 or played_idx when index >= 1858 - Add debug warning for invalid indices Fixes crash on Windows where MSVC uninitialized memory (0xCCCC = 52428) exceeds array bounds (1858 elements).
Root cause of Windows crash: move_from() and move_to() return polyglot 0x88 format square indices, but lczero::Square::FromIdx() expects 0-63. - Added square_to_64() conversion before creating lczero::Square - This was causing garbage indices (0xCCCC) and assertion failures - Fixes: move 'e4' now correctly outputs as 'e2e4' (was showing 'b8a4')
Lc0's internal board representation is always from white's perspective. After ApplyMove(), Position::Mirror() is called to switch perspective. The move passed to PositionHistory::Append() must use absolute board squares, not flipped coordinates. Fixes: assertion failure 'ourpieces.intersects(BitBoard::FromSquare(move.from()))' in board.cc:597
- Added is_black_move parameter to poly_move_to_lc0_move() function - Regular moves for black are now flipped to match LC0's white perspective - Castling moves are left unflipped as they use perspective-independent file representation - Prevents assertion failure when applying black's moves to LC0 board Fixes issue where training data tool crashed when processing games with black moves.
Removed the release job from the CI workflow.
Restores the release job that creates pre-release on each push to master.
- Fix bug in poly_move_to_lc0_move where normal moves and en-passant were incorrectly nested inside the castling check block - Correctly set result_d for draws (1.0) and wins/losses (0.0) - Initialize all V6 training data fields: played_q, played_d, played_m, root_d, best_d, root_m, best_m, policy_kld - This ensures compatibility with the lc0 rescorer tool
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR modernizes the trainingdata-tool by upgrading to C++20, updating the lc0 integration to work with the latest lc0 codebase, and improving the output format to write one game per chunk file.
Changes
Build System Updates
lc0/src/utils/string.ccto fix linker error forStrSplitfunctionlc0 Integration Updates
lc0/src/chess/bitboard.cc(no longer needed)lc0/src/neural/writer.cctolc0/src/trainingdata/writer.cc.gitmodulesfor lc0 submoduleabsl/library dependencyTraining Data Format Upgrade
V4TrainingDataHashUtil.hwithV6TrainingDataHashUtil.hOutput Format Improvement
TrainingDataWriterto write one game per chunk file.gzfileTesting
Made with Gemini and Claude
But dont merge just yet I still have to manually verify this code.