Fix CodeQL SM02986: char* to wchar_t* cast warning in PythonParam#86
Fix CodeQL SM02986: char* to wchar_t* cast warning in PythonParam#86
Conversation
There was a problem hiding this comment.
Pull request overview
Updates PythonStringParam<wchar_t>::RetrieveValueAndStrLenInd to address a CodeQL warning around reinterpreting UTF-16 bytes as a wide-character pointer by adding explanatory comments, a CodeQL suppression, and a compile-time assertion.
Changes:
- Added explanatory comments describing
PyUnicode_AsUTF16StringUTF-16/BOM behavior. - Added a
static_assertintended as a compile-time safety guard. - Added an inline CodeQL suppression comment for SM02986 with justification.
Comments suppressed due to low confidence (1)
language-extensions/python/src/PythonParam.cpp:357
PyUnicode_GET_LENGTHreturns the number of Unicode code points, not the number of UTF-16 code units. For non-BMP characters (surrogate pairs),sizewill be smaller than the number of 16-bit units in thePyUnicode_AsUTF16Stringbuffer, sovector<CharType>(wData + 1, wData + 1 + size)will truncate/produce invalid UTF-16. If the intent is to copy the exact UTF-16 payload, compute the element count from the returned bytes length (minus BOM and the trailing UTF-16 NUL) rather than usingPyUnicode_GET_LENGTH.
CharType *wData = reinterpret_cast<CharType *>(utf16str); // CodeQL [SM02986]: The buffer is properly aligned (divisible by 2), already contains real UTF-16 data (SQL NVARCHAR), and we know its exact length (not relying on null termination); so treating it as wchar_t* is safe.
// Ignore 2 byte BOM at front of wData that was added by PyUnicode_AsUTF16String
//
m_value = vector<CharType>(wData + 1, wData + 1 + size);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // We treat it as an array of 16-bit code units (CharType expected to be the size of wchar_t). | ||
| // | ||
| CharType *wData = reinterpret_cast<CharType *>(utf16str); | ||
| static_assert(sizeof(CharType) == sizeof(wchar_t), "CharType must match wchar_t size for UTF-16 reinterpretation."); | ||
| CharType *wData = reinterpret_cast<CharType *>(utf16str); // CodeQL [SM02986]: The buffer is properly aligned (divisible by 2), already contains real UTF-16 data (SQL NVARCHAR), and we know its exact length (not relying on null termination); so treating it as wchar_t* is safe. |
There was a problem hiding this comment.
static_assert(sizeof(CharType) == sizeof(wchar_t)) is effectively a no-op here (this template is explicitly instantiated with CharType = wchar_t), so it doesn’t enforce the UTF-16 assumption that each code unit is 2 bytes. Since the code skips the BOM via wData + 1 and treats the buffer as UTF-16, the guard should instead assert the actual required invariant (e.g., sizeof(wchar_t) == 2 / sizeof(CharType) == 2, or CharType is exactly the intended 16-bit type).
| char *utf16str = PyBytes_AsString(PyUnicode_AsUTF16String(tempObj.ptr())); | ||
|
|
||
| // Reinterpret the bytes as wchar_t *, which we will return. | ||
| // The buffer contains UTF-16 code units in native byte order. The string always starts with a BOM mark. | ||
| // (https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF16String) | ||
| // We treat it as an array of 16-bit code units (CharType expected to be the size of wchar_t). | ||
| // | ||
| CharType *wData = reinterpret_cast<CharType *>(utf16str); | ||
| static_assert(sizeof(CharType) == sizeof(wchar_t), "CharType must match wchar_t size for UTF-16 reinterpretation."); |
There was a problem hiding this comment.
PyUnicode_AsUTF16String returns a new reference; calling it inline inside PyBytes_AsString(...) leaks that PyObject* and also prevents any error handling. If PyUnicode_AsUTF16String fails (e.g., MemoryError) it returns nullptr, and PyBytes_AsString(nullptr) is undefined/crash. Store the returned bytes object in a temporary, validate it’s non-null (and that PyBytes_AsString succeeds), copy out the needed data, then Py_DECREF the temporary.
| // | ||
| CharType *wData = reinterpret_cast<CharType *>(utf16str); | ||
| static_assert(sizeof(CharType) == sizeof(wchar_t), "CharType must match wchar_t size for UTF-16 reinterpretation."); | ||
| CharType *wData = reinterpret_cast<CharType *>(utf16str); // CodeQL [SM02986]: The buffer is properly aligned (divisible by 2), already contains real UTF-16 data (SQL NVARCHAR), and we know its exact length (not relying on null termination); so treating it as wchar_t* is safe. |
There was a problem hiding this comment.
The CodeQL suppression justification is factually shaky: “aligned (divisible by 2)” conflates buffer length with pointer alignment, and the safety argument hinges on details (Python allocator alignment, aliasing/type-punning rules) that aren’t captured here. Consider either (a) adjusting the comment to describe the real alignment guarantee you rely on, or (b) avoiding the reinterpret_cast entirely by copying/decoding the bytes into a vector<CharType> without type-punning, which would eliminate the need for the suppression.
Summary
Addresses CodeQL alert [SM02986] (
Cast from char* to wchar_t*) inPythonStringParam<wchar_t>::RetrieveValueAndStrLenIndby adding a suppression comment with justification and a compile-time safety guard.Problem
CodeQL flags the
reinterpret_cast<CharType *>(utf16str)inPythonParam.cppas potentially unsafe — casting a byte string (char*) to a wide-character string (wchar_t*) can lead to alignment issues, incorrect termination, or buffer overruns.Analysis
This cast is safe for the following reasons:
PyUnicode_AsUTF16Stringproduces valid UTF-16 code units in native byte order. The data is inherentlywchar_t-compatible. (CPython docs)PyUnicode_GET_LENGTH(character count), not null termination, so there is no risk of buffer overrun.Changes
static_assert(sizeof(CharType) == sizeof(wchar_t), ...)to enforce at compile time that the template is only instantiated with a type matchingwchar_tsize.// CodeQL [SM02986]: ...) with a justification.Files Changed
language-extensions/python/src/PythonParam.cpp: comments +static_assertonly; no functional change)Testing
Risk
None — no runtime behavior change.