Skip to content

Fix CodeQL SM02986: char* to wchar_t* cast warning in PythonParam#86

Open
monamaki wants to merge 1 commit intomainfrom
dev/monamaki/codeQLFix100
Open

Fix CodeQL SM02986: char* to wchar_t* cast warning in PythonParam#86
monamaki wants to merge 1 commit intomainfrom
dev/monamaki/codeQLFix100

Conversation

@monamaki
Copy link
Copy Markdown
Contributor

@monamaki monamaki commented Apr 17, 2026

Summary

Addresses CodeQL alert [SM02986] (Cast from char* to wchar_t*) in PythonStringParam<wchar_t>::RetrieveValueAndStrLenInd by adding a suppression comment with justification and a compile-time safety guard.

Problem

CodeQL flags the reinterpret_cast<CharType *>(utf16str) in PythonParam.cpp as potentially unsafe — casting a byte string (char*) to a wide-character string (wchar_t*) can lead to alignment issues, incorrect termination, or buffer overruns.

Analysis

This cast is safe for the following reasons:

  1. Valid contentPyUnicode_AsUTF16String produces valid UTF-16 code units in native byte order. The data is inherently wchar_t-compatible. (CPython docs)
  2. Bounded length — The read length is determined by PyUnicode_GET_LENGTH (character count), not null termination, so there is no risk of buffer overrun.

Changes

  • Alignment: Added static_assert(sizeof(CharType) == sizeof(wchar_t), ...) to enforce at compile time that the template is only instantiated with a type matching wchar_t size.
  • Added an inline CodeQL suppression comment (// CodeQL [SM02986]: ...) with a justification.
  • Added a documentation comment block explaining why the buffer content is valid UTF-16, with a reference to the CPython docs.

Files Changed

  • language-extensions/python/src/PythonParam.cpp: comments + static_assert only; no functional change)

Testing

  • No functional code change: only comments and a compile-time assertion added.
  • Consumed the artifact of this code in the engine dev branch and tests passed. There are parameter data types test coverage that cover nvarchar for Python.

Risk

None — no runtime behavior change.

@monamaki monamaki changed the title codeql for cast Fix CodeQL SM02986: char* to wchar_t* cast warning in PythonParam Apr 18, 2026
@monamaki monamaki marked this pull request as ready for review April 18, 2026 23:45
Copilot AI review requested due to automatic review settings April 18, 2026 23:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates PythonStringParam<wchar_t>::RetrieveValueAndStrLenInd to address a CodeQL warning around reinterpreting UTF-16 bytes as a wide-character pointer by adding explanatory comments, a CodeQL suppression, and a compile-time assertion.

Changes:

  • Added explanatory comments describing PyUnicode_AsUTF16String UTF-16/BOM behavior.
  • Added a static_assert intended as a compile-time safety guard.
  • Added an inline CodeQL suppression comment for SM02986 with justification.
Comments suppressed due to low confidence (1)

language-extensions/python/src/PythonParam.cpp:357

  • PyUnicode_GET_LENGTH returns the number of Unicode code points, not the number of UTF-16 code units. For non-BMP characters (surrogate pairs), size will be smaller than the number of 16-bit units in the PyUnicode_AsUTF16String buffer, so vector<CharType>(wData + 1, wData + 1 + size) will truncate/produce invalid UTF-16. If the intent is to copy the exact UTF-16 payload, compute the element count from the returned bytes length (minus BOM and the trailing UTF-16 NUL) rather than using PyUnicode_GET_LENGTH.
					CharType *wData = reinterpret_cast<CharType *>(utf16str); // CodeQL [SM02986]: The buffer is properly aligned (divisible by 2), already contains real UTF-16 data (SQL NVARCHAR), and we know its exact length (not relying on null termination); so treating it as wchar_t* is safe.

					// Ignore 2 byte BOM at front of wData that was added by PyUnicode_AsUTF16String
					//
					m_value = vector<CharType>(wData + 1, wData + 1 + size);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +350 to +353
// We treat it as an array of 16-bit code units (CharType expected to be the size of wchar_t).
//
CharType *wData = reinterpret_cast<CharType *>(utf16str);
static_assert(sizeof(CharType) == sizeof(wchar_t), "CharType must match wchar_t size for UTF-16 reinterpretation.");
CharType *wData = reinterpret_cast<CharType *>(utf16str); // CodeQL [SM02986]: The buffer is properly aligned (divisible by 2), already contains real UTF-16 data (SQL NVARCHAR), and we know its exact length (not relying on null termination); so treating it as wchar_t* is safe.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_assert(sizeof(CharType) == sizeof(wchar_t)) is effectively a no-op here (this template is explicitly instantiated with CharType = wchar_t), so it doesn’t enforce the UTF-16 assumption that each code unit is 2 bytes. Since the code skips the BOM via wData + 1 and treats the buffer as UTF-16, the guard should instead assert the actual required invariant (e.g., sizeof(wchar_t) == 2 / sizeof(CharType) == 2, or CharType is exactly the intended 16-bit type).

Copilot uses AI. Check for mistakes.
Comment on lines 345 to +352
char *utf16str = PyBytes_AsString(PyUnicode_AsUTF16String(tempObj.ptr()));

// Reinterpret the bytes as wchar_t *, which we will return.
// The buffer contains UTF-16 code units in native byte order. The string always starts with a BOM mark.
// (https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF16String)
// We treat it as an array of 16-bit code units (CharType expected to be the size of wchar_t).
//
CharType *wData = reinterpret_cast<CharType *>(utf16str);
static_assert(sizeof(CharType) == sizeof(wchar_t), "CharType must match wchar_t size for UTF-16 reinterpretation.");
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyUnicode_AsUTF16String returns a new reference; calling it inline inside PyBytes_AsString(...) leaks that PyObject* and also prevents any error handling. If PyUnicode_AsUTF16String fails (e.g., MemoryError) it returns nullptr, and PyBytes_AsString(nullptr) is undefined/crash. Store the returned bytes object in a temporary, validate it’s non-null (and that PyBytes_AsString succeeds), copy out the needed data, then Py_DECREF the temporary.

Copilot uses AI. Check for mistakes.
//
CharType *wData = reinterpret_cast<CharType *>(utf16str);
static_assert(sizeof(CharType) == sizeof(wchar_t), "CharType must match wchar_t size for UTF-16 reinterpretation.");
CharType *wData = reinterpret_cast<CharType *>(utf16str); // CodeQL [SM02986]: The buffer is properly aligned (divisible by 2), already contains real UTF-16 data (SQL NVARCHAR), and we know its exact length (not relying on null termination); so treating it as wchar_t* is safe.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CodeQL suppression justification is factually shaky: “aligned (divisible by 2)” conflates buffer length with pointer alignment, and the safety argument hinges on details (Python allocator alignment, aliasing/type-punning rules) that aren’t captured here. Consider either (a) adjusting the comment to describe the real alignment guarantee you rely on, or (b) avoiding the reinterpret_cast entirely by copying/decoding the bytes into a vector<CharType> without type-punning, which would eliminate the need for the suppression.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants