Need help with Python script [unicode normalization]

praj · February 28, 2024, 8:38am

Hi Nick,

Great app you’ve made, and I’ve been using it daily for about 6 months with the extensions available on your site. But after discovering that I can make my extensions recently, I was really excited.

I’m having trouble with my script output, can you please help me identify the issue? I have a felling popclip is having some trouble with unicode characters in this case (Telugu in this case).

So, I’ve developed a simple python script to transliterate text from Roman (IAST) form to Telugu, given here:

from aksharamukha import transliterate

# Define the fixed source and target languages
source_language = "IAST"
target_language = "Telugu"

# Get the user-chosen text to be transliterated
text_to_transliterate = "Kaliyugadalli - Jañjhūṭi"

transliterated_text = transliterate.process(source_language, target_language, text_to_transliterate)

print(transliterated_text, end='')

It’s output is కలియుగదల్లి - జంఝూటి, as expected.

Here’s my YAML-Python extension code, which I hacked into place based on the documentation and some examples from the forum:

#!/opt/homebrew/Caskroom/miniconda/base/bin/python3
# #popclip
# name: Akṣaramukha I2T
# icon: a2అ
# after: paste-result
from aksharamukha import transliterate
import os

# Define the fixed source and target languages
source_language = "IAST"
target_language = "Telugu"

# Get the user-chosen text to be transliterated
text_to_transliterate = os.environ['POPCLIP_TEXT']

# Do the transliteration
transliterated_text = transliterate.process(source_language, target_language, text_to_transliterate)

# Print the transliterated text
print(transliterated_text, end='')

The output I get from this poclip extension is కలియుగదల్లి - జన్ంఝూత్̣ఇ.

So, looking at them side by side to see the difference:

Expected	Received
కలియుగదల్లి - జంఝూటి	కలియుగదల్లి - జన్ంఝూత్̣ఇ

Please let me know if I’ve gone wrong somewhere, thank you!

nick · February 28, 2024, 12:53pm

I’ve run your script, both with python3 directly and within PopClip, and I see a similar result to you.

I’m sure it must be something to do with encoding. but will need to dig in deeper to figure out what.

edit: posting for future reference

INPUT
Kaliyugadalli - Jañjhūṭi

EXPECTED
కలియుగదల్లి - జంఝూటి
E0B095E0 B0B2E0B0 BFE0B0AF E0B181E0 B097E0B0 A6E0B0B2 E0B18DE0 B0B2E0B0 BF202D20 E0B09CE0 B082E0B0 9DE0B182 E0B09FE0 B0BF

ACTUAL
కలియుగదల్లి - జన్ంఝూత్̣ఇ
E0B095E0 B0B2E0B0 BFE0B0AF E0B181E0 B097E0B0 A6E0B0B2 E0B18DE0 B0B2E0B0 BF202D20 E0B09CE0 B0A8E0B1 8DE0B082 E0B09DE0 B182E0B0 A4E0B18D CCA3E0B0 87

praj · February 28, 2024, 1:55pm

Hi Nick,

Appreciate your quick response on this issue!
I took some time to determine that it might be a Popclip issue, after making sure it wasn’t a bug in the Akṣaramukha library.
The problem is probably a general encoding issue, like you said, and I can give you more strings if it helps in testing. I observed the issue with many such cases.

Thank you,
Best,
Pramod

nick · February 28, 2024, 4:15pm

Yes, I don’t hink there is anything wrong with the Python code. I’ll do a deeper investigation. I would like to get the bottom of this.

nick · March 13, 2024, 12:13pm

Figured it out (with a lot of help from chatgpt)

It has to do with the unicode normalization of the input text. It seems the aksharamukha transliterate library is sensitive to normalization and expects input encoded in normalization form “C”.

When run in the terminal, the text arrives in normalization form “C” but when run in popclip the input comes in normalization form “D”.

(As an aside, PopClip is transparent to normalization, so it’s just passing on what it receives. It doesn’t do any explicit normalization.)

e.g. string "Jañjhūṭi", UTF-8 encoded
NFC (Normalization Form C): 4a61c3b16a68c5abe1b9ad69
NFD (Normalization Form D): 4a616ecc836a6875cc8474cca369

We can fix this in the script by normalizing the input using the unicodedata module as follows:

#!/usr/bin/env python3
# #popclip
# name: Akṣaramukha I2T
# icon: a2అ
# after: paste-result
from aksharamukha import transliterate
import os
import unicodedata

# Define the fixed source and target languages
source_language = "IAST"
target_language = "Telugu"

# Get the user-chosen text to be transliterated
text_to_transliterate = os.environ['POPCLIP_TEXT']
text_to_transliterate = unicodedata.normalize('NFC', text_to_transliterate)

# Do the transliteration
transliterated_text = transliterate.process(source_language, target_language, text_to_transliterate)

# Print the transliterated text
print(transliterated_text, end='')

This seems to resolve this issue for the example string you provided.

praj · March 14, 2024, 1:16am

Hi Nick,

Thank you your help to fix this code and sharing the chat with the ai (looks like its pretty good!).

I will test it in practice and let you know how it goes.

Best,
Pramod