Great app you’ve made, and I’ve been using it daily for about 6 months with the extensions available on your site. But after discovering that I can make my extensions recently, I was really excited.
I’m having trouble with my script output, can you please help me identify the issue? I have a felling popclip is having some trouble with unicode characters in this case (Telugu in this case).
So, I’ve developed a simple python script to transliterate text from Roman (IAST) form to Telugu, given here:
from aksharamukha import transliterate
# Define the fixed source and target languages
source_language = "IAST"
target_language = "Telugu"
# Get the user-chosen text to be transliterated
text_to_transliterate = "Kaliyugadalli - Jañjhūṭi"
transliterated_text = transliterate.process(source_language, target_language, text_to_transliterate)
print(transliterated_text, end='')
It’s output is కలియుగదల్లి - జంఝూటి, as expected.
Here’s my YAML-Python extension code, which I hacked into place based on the documentation and some examples from the forum:
#!/opt/homebrew/Caskroom/miniconda/base/bin/python3
# #popclip
# name: Akṣaramukha I2T
# icon: a2అ
# after: paste-result
from aksharamukha import transliterate
import os
# Define the fixed source and target languages
source_language = "IAST"
target_language = "Telugu"
# Get the user-chosen text to be transliterated
text_to_transliterate = os.environ['POPCLIP_TEXT']
# Do the transliteration
transliterated_text = transliterate.process(source_language, target_language, text_to_transliterate)
# Print the transliterated text
print(transliterated_text, end='')
The output I get from this poclip extension is కలియుగదల్లి - జన్ంఝూత్̣ఇ.
So, looking at them side by side to see the difference:
Expected
Received
కలియుగదల్లి - జంఝూటి
కలియుగదల్లి - జన్ంఝూత్̣ఇ
Please let me know if I’ve gone wrong somewhere, thank you!
Appreciate your quick response on this issue!
I took some time to determine that it might be a Popclip issue, after making sure it wasn’t a bug in the Akṣaramukha library.
The problem is probably a general encoding issue, like you said, and I can give you more strings if it helps in testing. I observed the issue with many such cases.
It has to do with the unicode normalization of the input text. It seems the aksharamukha transliterate library is sensitive to normalization and expects input encoded in normalization form “C”.
When run in the terminal, the text arrives in normalization form “C” but when run in popclip the input comes in normalization form “D”.
(As an aside, PopClip is transparent to normalization, so it’s just passing on what it receives. It doesn’t do any explicit normalization.)
e.g. string "Jañjhūṭi", UTF-8 encoded
NFC (Normalization Form C): 4a61c3b16a68c5abe1b9ad69
NFD (Normalization Form D): 4a616ecc836a6875cc8474cca369
We can fix this in the script by normalizing the input using the unicodedata module as follows:
#!/usr/bin/env python3
# #popclip
# name: Akṣaramukha I2T
# icon: a2అ
# after: paste-result
from aksharamukha import transliterate
import os
import unicodedata
# Define the fixed source and target languages
source_language = "IAST"
target_language = "Telugu"
# Get the user-chosen text to be transliterated
text_to_transliterate = os.environ['POPCLIP_TEXT']
text_to_transliterate = unicodedata.normalize('NFC', text_to_transliterate)
# Do the transliteration
transliterated_text = transliterate.process(source_language, target_language, text_to_transliterate)
# Print the transliterated text
print(transliterated_text, end='')
This seems to resolve this issue for the example string you provided.