Skip to content

Upgrade to C++20, update lc0 integration, and add one-game-per-file output#1

Draft
ContradNamiseb wants to merge 24 commits into
CallOn84:masterfrom
Bonan14:master
Draft

Upgrade to C++20, update lc0 integration, and add one-game-per-file output#1
ContradNamiseb wants to merge 24 commits into
CallOn84:masterfrom
Bonan14:master

Conversation

@ContradNamiseb
Copy link
Copy Markdown

@ContradNamiseb ContradNamiseb commented Jan 7, 2026

Summary

This PR modernizes the trainingdata-tool by upgrading to C++20, updating the lc0 integration to work with the latest lc0 codebase, and improving the output format to write one game per chunk file.

Changes

Build System Updates

  • Upgraded C++ standard from C++17 to C++20
  • Set explicit Release build type
  • Simplified include directories
  • Added lc0/src/utils/string.cc to fix linker error for StrSplit function

lc0 Integration Updates

  • Updated lc0 source file paths to match latest lc0 structure:
    • Removed lc0/src/chess/bitboard.cc (no longer needed)
    • Changed lc0/src/neural/writer.cc to lc0/src/trainingdata/writer.cc
  • Updated .gitmodules for lc0 submodule
  • Added absl/ library dependency

Training Data Format Upgrade

  • Upgraded from V4 to V6 training data format
  • Replaced V4TrainingDataHashUtil.h with V6TrainingDataHashUtil.h
  • Updated related source files for V6 compatibility

Output Format Improvement

  • Modified TrainingDataWriter to write one game per chunk file
  • Each game's training positions are now isolated in their own .gz file
  • Removed batching logic that previously combined multiple games into one file

Testing

  • Successfully built on Linux with GCC
  • Verified output: Converting 5 games produces 5 separate files with varying sizes reflecting different game lengths

Made with Gemini and Claude

But dont merge just yet I still have to manually verify this code.

- Upgraded C++ standard from C++17 to C++20
- Updated lc0 source file paths to match new lc0 structure:
  - Removed lc0/src/chess/bitboard.cc (no longer needed)
  - Changed lc0/src/neural/writer.cc to lc0/src/trainingdata/writer.cc
  - Added lc0/src/utils/string.cc (fixes StrSplit linker error)
- Updated include directories (simplified paths, added root include)
- Set explicit Release build type
- Added absl/ library dependency
- Upgraded TrainingData from V4 to V6:
  - Replaced V4TrainingDataHashUtil.h with V6TrainingDataHashUtil.h
  - Updated related source files for V6 compatibility
- Updated .gitmodules for lc0 submodule

Made with Gemini and Claude Opus
- Modified EnqueueChunks() to write all training positions from a single
  game directly to its own .gz file
- Each game now gets its own chunk file (game_XXXXXX.gz)
- Removed the batching logic that combined multiple games into one file
- This ensures training data from different games is not mixed together

Made with Gemini and Claude
@ContradNamiseb ContradNamiseb marked this pull request as draft January 7, 2026 22:08
- Replaced boost::hash_range and boost::hash_combine with lc0's HashCat()
  from utils/hashcat.h in V6TrainingDataHashUtil.h
- Removed find_package(Boost) and Boost_INCLUDE_DIRS from CMakeLists.txt
- This eliminates the external Boost dependency for easier building

Made with Gemini and Claude
- Added CI/CD workflow for Ubuntu (gcc, clang) and Windows (cl)
- Builds on push to master and pull requests
- Automatically creates pre-release with artifacts on master push
- No Boost dependency required (uses lc0 native HashCat)

Made with Gemini and Claude
- Copied proto/net.pb.h stub from lc0 to src/proto/ (it was untracked
  in lc0 submodule)
- Added 'src' to include_directories so proto/ can be found
- This fixes the 'proto/net.pb.h file not found' error on CI

Made with Gemini and Claude
- Updated CMake to use system zlib (via find_package) on Unix
- Bundled zlib only used on Windows now
- Added zlib1g-dev to CI Linux dependencies
- Added proper project() declaration to fix CMake warnings
- Fixes 'call to undeclared function' errors for lseek/read/write/close

Made with Gemini and Claude
- Added _CRT_SECURE_NO_WARNINGS to suppress deprecation warnings for
  strcpy, sprintf, fopen, etc.
- Added /Zc:strictStrings- to allow const char[] to char* conversion
  (required by polyglot's getopt.h)

Made with Gemini and Claude
- Added /FIarray to force include <array> header (missing in lc0 submodule)
- Added /permissive to relax conformance rules (helps with polyglot's legacy C code)
- Maintained /Zc:strictStrings- for char* conversions

Made with Gemini and Claude
- Moved /Zc:strictStrings- to polyglot-specific file properties
- Added global warning suppressions: /wd4996, /wd4267, /wd4244, /wd4390, /wd4018
- This fixes C2440 errors in polyglot's getopt.h and cleans up the build log

Made with Gemini and Claude
- Switch Windows CI matrix to use gcc/g++ (MinGW) and Ninja generator
- Enable verbose build logging
- Add MinGW compile flags for polyglot (-fpermissive) to fix const char* errors

Made with Gemini and Claude
- Use -iquote for polyglot sources on GCC/MinGW to prevent <getopt.h> from picking up polyglot/src/getopt.h
- This fixes 'undefined reference to getopt_internal' linker error on Windows
- MSVC continues to use standard include path as it requires the local getopt.h polyfill

Made with Gemini and Claude
- Restore polyglot/src to regular include_directories (iquote wasn't working)
- For MinGW: pre-define __GETOPT_H__ to prevent local getopt.h from being
  included when system unistd.h includes <getopt.h>

Made with Gemini and Claude
- Changed artifact upload to only include trainingdata-tool binary
- Fixes 'Failed to upload CMakeCache.txt' error in release step

Made with Gemini and Claude
- Fixed regex patterns: [%eval ...] now correctly escaped as \[%eval ...\]
- Added try-catch around std::stof to prevent crashes on malformed input

Made with Gemini and Claude
- Add robust SAN cleaning (whitespace, move numbers, comments, annotations)
- Add poly_move_to_uci() function for UCI notation conversion
- Change input format to INPUT_CLASSICAL_112_PLANE
- Improve result parsing with strcmp and fallback for unrecognized results
- Graceful illegal move handling (continue instead of break)
- Add best_move and visits parameters to get_v6_training_data
- Store Q values directly (relative to side-to-move)
- Implement StaticEvaluator with material balance (P=100, N=320, B=330, R=500, Q=900)
- Add piece-square tables for all piece types with MG/EG king tables
- Implement pawn structure analysis (doubled, isolated, passed pawns)
- Add simplified mobility scoring
- Use tapered evaluation (MG/EG interpolation based on game phase)
- Integrate into PGNGame for non-Lichess mode
- Use sigmoid function for cp-to-Q conversion
…NNIndex

- Validate all MoveToNNIndex results before indexing probabilities array
- Check legal moves loop indices
- Check played_idx before writing probabilities[played_idx] = 1.0
- Check best_idx before storing in result.best_idx
- Fallback to 0 or played_idx when index >= 1858
- Add debug warning for invalid indices

Fixes crash on Windows where MSVC uninitialized memory (0xCCCC = 52428)
exceeds array bounds (1858 elements).
Root cause of Windows crash: move_from() and move_to() return polyglot
0x88 format square indices, but lczero::Square::FromIdx() expects 0-63.

- Added square_to_64() conversion before creating lczero::Square
- This was causing garbage indices (0xCCCC) and assertion failures
- Fixes: move 'e4' now correctly outputs as 'e2e4' (was showing 'b8a4')
Lc0's internal board representation is always from white's perspective.
After ApplyMove(), Position::Mirror() is called to switch perspective.
The move passed to PositionHistory::Append() must use absolute board
squares, not flipped coordinates.

Fixes: assertion failure 'ourpieces.intersects(BitBoard::FromSquare(move.from()))' in board.cc:597
- Added is_black_move parameter to poly_move_to_lc0_move() function
- Regular moves for black are now flipped to match LC0's white perspective
- Castling moves are left unflipped as they use perspective-independent file representation
- Prevents assertion failure when applying black's moves to LC0 board

Fixes issue where training data tool crashed when processing games with black moves.
Removed the release job from the CI workflow.
Restores the release job that creates pre-release on each push to master.
- Fix bug in poly_move_to_lc0_move where normal moves and en-passant were incorrectly nested inside the castling check block
- Correctly set result_d for draws (1.0) and wins/losses (0.0)
- Initialize all V6 training data fields: played_q, played_d, played_m, root_d, best_d, root_m, best_m, policy_kld
- This ensures compatibility with the lc0 rescorer tool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant