Skip to content

gh-80667: Fix Tangut ideographs names in unicodedata#101585

Closed
wismill wants to merge 3 commits intopython:mainfrom
wismill:wip/tangut-ideographs
Closed

gh-80667: Fix Tangut ideographs names in unicodedata#101585
wismill wants to merge 3 commits intopython:mainfrom
wismill:wip/tangut-ideographs

Conversation

@wismill
Copy link
Contributor

@wismill wismill commented Feb 5, 2023

Fix Tangut ideographs names in unicodedata.

Partially fixes #80667.

@ghost
Copy link

ghost commented Feb 5, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@arhadthedev arhadthedev added extension-modules C modules in the Modules dir topic-unicode labels Feb 5, 2023
@arhadthedev
Copy link
Member

@malemburg, @ezio-melotti as Unicode experts.

@wismill
Copy link
Contributor Author

wismill commented Feb 22, 2023

@malemburg @ezio-melotti kind reminder

@wismill
Copy link
Contributor Author

wismill commented Mar 24, 2023

@malemburg @ezio-melotti @arhadthedev kind reminder

@arhadthedev
Copy link
Member

Another attempt: @corona10 as a developer who worked with the unicode part of Python via Asian scripts.

Copy link
Contributor

@SnoopJ SnoopJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have review authority on CPython, but since this PR has been sitting for a few months, I thought I'd chime in to say it looks good to me in terms of functionality. I just discovered this PR because I just wrote some downstream code to work around the lack of support for this range.

The additional test looks a little complicated to me, it may be duplicating functionality from makeunicodedata.py, but the code added to the runtime is pretty much following the lead of the CJK stuff that already exists.

Comment on lines +1295 to +1298
while (namelen--) {
v *= 16;
if (*name >= '0' && *name <= '9')
v += *name - '0';
else if (*name >= 'A' && *name <= 'F')
v += *name - 'A' + 10;
else
return 0;
name++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsettled that this loop is duplicated from above, but I don't see a better way to do it aside from maybe some preprocessor abuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. But since this MR has received very little attention, I am not going to dedicate time to this if there is no opportunity to merge it.

@wismill wismill force-pushed the wip/tangut-ideographs branch from 3657122 to b2c4e92 Compare July 26, 2023 04:43
@wismill
Copy link
Contributor Author

wismill commented Jul 26, 2023

@SnoopJ thank you for your review, I resolved 2 items.

I am quite demotivated because no CPython reviewer has been able to review this in close to half a year.

@malemburg @ezio-melotti @arhadthedev @corona10 kind reminder 🙏

@rcgale
Copy link

rcgale commented Jul 26, 2023

If it wasn't already known, unicodedata.name(...) throws a ValueError for these ideographs. Seems like a worthwhile effort to merge this PR if it fixes that!

Example (tried in Python 3.10.8):

import unicodedata

try:
    tangut_ideograph_172fd = b"\xf0\x97\x81\x83".decode("utf8")
    unicodedata.name(tangut_ideograph_172fd)
except ValueError as e:
    raise ValueError(f"Couldn't handle Tangut ideograph {tangut_ideograph_172fd}") from e

Output:

Traceback (most recent call last):
  File "<input>", line 5, in <module>
ValueError: no such name

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/robert/miniforge3/envs/wikigit/lib/python3.10/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 7, in <module>
ValueError: Couldn't handle Tangut ideograph 𗁃
 

@SnoopJ
Copy link
Contributor

SnoopJ commented Jul 26, 2023

If it wasn't already known, unicodedata.name(...) throws a ValueError for these ideographs. Seems like a worthwhile effort to merge this PR if it fixes that!

Yep, this PR includes changes to _getucname() that account for this, and do fix the issue. Against b2c4e92:

$ ./python 
Python 3.13.0a0 (remotes/wismill/wip/tangut-ideographs:b2c4e92767, Jul 26 2023, 16:57:00) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.name("\U000172fd")
'TANGUT IDEOGRAPH-172FD'

I think 172FD is a typo in your sample, though? The given bytes look like they encode U+17043 instead, and under this PR unicodedata.name() reports it that way.

@wismill
Copy link
Contributor Author

wismill commented Jul 27, 2023

If it wasn't already known, unicodedata.name(...) throws a ValueError for these ideographs.

@rcgale I find that unicodedata.name(c, None) with a test is a better pattern.

I think 172FD is a typo in your sample, though? The given bytes look like they encode U+17043 instead

@SnoopJ I confirm that, good catch!

@rcgale
Copy link

rcgale commented Aug 5, 2023

Yep, this PR includes changes to _getucname() that account for this, and do fix the issue. Against b2c4e92:

Thanks, I appreciate the confirmation! Nice to know the bug has been identified, and that there's already a fix implemented. I hope it can be released soon!

I think 172FD is a typo in your sample, though? The given bytes look like they encode U+17043 instead

Yes, it was a typo. I was encountering the same problem with several Tangut characters, and I probably copy/pasted the output from a different example.

@wismill
Copy link
Contributor Author

wismill commented Jul 5, 2024

Really disappointed by the lack of interest of the Python devs after such a long time.

@wismill wismill closed this Jul 5, 2024
@rogerbinns
Copy link

It wasn't clear to me if this PR was merged or superseded. It has not been merged and Tangut codepoints still give errors. They were added in Unicode 9.0 in 2016.

Python 3.13.0rc1 (main, Aug 18 2024, 07:50:03) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> print(unicodedata.name("\U00017123"))
Traceback (most recent call last):
  File "<python-input-1>", line 1, in <module>
    print(unicodedata.name("\U00017123"))
          ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
ValueError: no such name

@wismill
Copy link
Contributor Author

wismill commented Aug 19, 2024

@rogerbinns feel free to revive this PR.

@serhiy-storchaka
Copy link
Member

Revived in #144789.

@serhiy-storchaka
Copy link
Member

I apologize for not getting a review of this PR in time, @wismill. There are not many core developers, especially experts in Unicode. I am now determined to see the matter through to the end. Your PR was generally good, it only needed some updates because the surrounding code changed.

@wismill
Copy link
Contributor Author

wismill commented Feb 16, 2026

@serhiy-storchaka thanks for reviving this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bugs and inconsistencies in unicodedata

7 participants