gh-80667: Fix Tangut ideographs names in unicodedata by wismill · Pull Request #101585 · python/cpython

wismill · 2023-02-05T19:11:34Z

Fix Tangut ideographs names in unicodedata.

Partially fixes #80667.

Issue: Bugs and inconsistencies in unicodedata #80667

ghost · 2023-02-05T19:11:59Z

All commit authors signed the Contributor License Agreement.

arhadthedev · 2023-02-05T19:17:13Z

@malemburg, @ezio-melotti as Unicode experts.

wismill · 2023-02-22T14:53:03Z

@malemburg @ezio-melotti kind reminder

wismill · 2023-03-24T16:08:04Z

@malemburg @ezio-melotti @arhadthedev kind reminder

arhadthedev · 2023-03-25T18:51:32Z

Another attempt: @corona10 as a developer who worked with the unicode part of Python via Asian scripts.

SnoopJ

I do not have review authority on CPython, but since this PR has been sitting for a few months, I thought I'd chime in to say it looks good to me in terms of functionality. I just discovered this PR because I just wrote some downstream code to work around the lack of support for this range.

The additional test looks a little complicated to me, it may be duplicating functionality from makeunicodedata.py, but the code added to the runtime is pretty much following the lead of the CJK stuff that already exists.

Tools/unicode/makeunicodedata.py

Lib/test/test_unicodedata.py

SnoopJ · 2023-07-26T02:46:25Z

Modules/unicodedata.c

+        while (namelen--) {
+            v *= 16;
+            if (*name >= '0' && *name <= '9')
+                v += *name - '0';
+            else if (*name >= 'A' && *name <= 'F')
+                v += *name - 'A' + 10;
+            else
+                return 0;
+            name++;
+        }


I'm a little unsettled that this loop is duplicated from above, but I don't see a better way to do it aside from maybe some preprocessor abuse.

Yeah I agree. But since this MR has received very little attention, I am not going to dedicate time to this if there is no opportunity to merge it.

wismill · 2023-07-26T04:52:47Z

@SnoopJ thank you for your review, I resolved 2 items.

I am quite demotivated because no CPython reviewer has been able to review this in close to half a year.

@malemburg @ezio-melotti @arhadthedev @corona10 kind reminder 🙏

rcgale · 2023-07-26T20:41:46Z

If it wasn't already known, unicodedata.name(...) throws a ValueError for these ideographs. Seems like a worthwhile effort to merge this PR if it fixes that!

Example (tried in Python 3.10.8):

import unicodedata

try:
    tangut_ideograph_172fd = b"\xf0\x97\x81\x83".decode("utf8")
    unicodedata.name(tangut_ideograph_172fd)
except ValueError as e:
    raise ValueError(f"Couldn't handle Tangut ideograph {tangut_ideograph_172fd}") from e

Output:

Traceback (most recent call last):
  File "<input>", line 5, in <module>
ValueError: no such name

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/robert/miniforge3/envs/wikigit/lib/python3.10/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 7, in <module>
ValueError: Couldn't handle Tangut ideograph 𗁃

SnoopJ · 2023-07-26T21:06:38Z

If it wasn't already known, unicodedata.name(...) throws a ValueError for these ideographs. Seems like a worthwhile effort to merge this PR if it fixes that!

Yep, this PR includes changes to _getucname() that account for this, and do fix the issue. Against b2c4e92:

$ ./python 
Python 3.13.0a0 (remotes/wismill/wip/tangut-ideographs:b2c4e92767, Jul 26 2023, 16:57:00) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.name("\U000172fd")
'TANGUT IDEOGRAPH-172FD'

I think 172FD is a typo in your sample, though? The given bytes look like they encode U+17043 instead, and under this PR unicodedata.name() reports it that way.

wismill · 2023-07-27T07:51:11Z

If it wasn't already known, unicodedata.name(...) throws a ValueError for these ideographs.

@rcgale I find that unicodedata.name(c, None) with a test is a better pattern.

I think 172FD is a typo in your sample, though? The given bytes look like they encode U+17043 instead

@SnoopJ I confirm that, good catch!

rcgale · 2023-08-05T01:13:18Z

Yep, this PR includes changes to _getucname() that account for this, and do fix the issue. Against b2c4e92:

Thanks, I appreciate the confirmation! Nice to know the bug has been identified, and that there's already a fix implemented. I hope it can be released soon!

I think 172FD is a typo in your sample, though? The given bytes look like they encode U+17043 instead

Yes, it was a typo. I was encountering the same problem with several Tangut characters, and I probably copy/pasted the output from a different example.

wismill · 2024-07-05T11:28:51Z

Really disappointed by the lack of interest of the Python devs after such a long time.

rogerbinns · 2024-08-18T15:55:37Z

It wasn't clear to me if this PR was merged or superseded. It has not been merged and Tangut codepoints still give errors. They were added in Unicode 9.0 in 2016.

Python 3.13.0rc1 (main, Aug 18 2024, 07:50:03) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> print(unicodedata.name("\U00017123"))
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    print(unicodedata.name("\U00017123"))
          ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
ValueError: no such name

wismill · 2024-08-19T09:01:32Z

@rogerbinns feel free to revive this PR.

serhiy-storchaka · 2026-02-13T16:41:59Z

Revived in #144789.

serhiy-storchaka · 2026-02-13T16:48:02Z

I apologize for not getting a review of this PR in time, @wismill. There are not many core developers, especially experts in Unicode. I am now determined to see the matter through to the end. Your PR was generally good, it only needed some updates because the surrounding code changed.

wismill · 2026-02-16T12:22:46Z

@serhiy-storchaka thanks for reviving this!

bedevere-bot added the awaiting review label Feb 5, 2023

bedevere-bot mentioned this pull request Feb 5, 2023

Bugs and inconsistencies in unicodedata #80667

Open

arhadthedev added extension-modules C modules in the Modules dir topic-unicode labels Feb 5, 2023

SnoopJ reviewed Jul 26, 2023

View reviewed changes

wismill added 3 commits July 26, 2023 06:36

unicodedata: Fix Tangut Ideograph names

0609eb2

News entry

3821e1b

Add test

b2c4e92

wismill force-pushed the wip/tangut-ideographs branch from 3657122 to b2c4e92 Compare July 26, 2023 04:43

SnoopJ mentioned this pull request Jul 26, 2023

gh-80667: fix case-sensitivity of some unicode literal escapes #107281

Merged

serhiy-storchaka requested a review from ezio-melotti February 1, 2024 12:02

wismill closed this Jul 5, 2024

serhiy-storchaka mentioned this pull request Feb 13, 2026

gh-80667: Fix Tangut ideographs names in unicodedata #144789

Merged

Uh oh!

Conversation

wismill commented Feb 5, 2023 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Feb 5, 2023 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arhadthedev commented Feb 5, 2023

Uh oh!

wismill commented Feb 22, 2023

Uh oh!

wismill commented Mar 24, 2023

Uh oh!

arhadthedev commented Mar 25, 2023

Uh oh!

SnoopJ left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

SnoopJ Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

wismill Jul 26, 2023

Choose a reason for hiding this comment

Uh oh!

wismill commented Jul 26, 2023

Uh oh!

rcgale commented Jul 26, 2023

Uh oh!

SnoopJ commented Jul 26, 2023

Uh oh!

wismill commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rcgale commented Aug 5, 2023

Uh oh!

wismill commented Jul 5, 2024

Uh oh!

rogerbinns commented Aug 18, 2024

Uh oh!

wismill commented Aug 19, 2024

Uh oh!

serhiy-storchaka commented Feb 13, 2026

Uh oh!

serhiy-storchaka commented Feb 13, 2026

Uh oh!

wismill commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wismill commented Feb 5, 2023 •

edited by bedevere-bot

Loading

ghost commented Feb 5, 2023 •

edited by ghost

Loading

wismill commented Jul 27, 2023 •

edited

Loading