Encoding Arabic letters with their diacritics (if exists)
map
does not seem to be the right tool for the job. You do not want to map characters to other characters, but group them together. Instead, you might try reduce
(or functools.reduce
in Python 3). Here, I use isalpha
to test what kind of character it is; you might need something else.
>>> is_diacritic = lambda x: not x.isalpha()
>>> verse = "XXA)L_I!I%M<LLL>MMQ*Q"
>>> reduce(lambda lst, x: lst + [x] if not is_diacritic(x) else lst[:-1] + [lst[-1]+x], verse, [])
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']
However, this is barely readable and also creates lots of intermediate lists. Better just use a boring old for
loop, even if you explicitly asked for something else:
res = []
for x in verse:
if not is_diacritic(x):
res.append(x)
else:
res[-1] += x
By iterating pairs of consecutive characters, e.g. using zip(verse, verse[1:])
(i.e. (1,2), (2,3),...
, not (1,2), (3,4), ...
), you could indeed also use a list comprehension, but I'd still vote for the for
loop for readability.
>>> [x + y if is_diacritic(y) else x
... for x, y in zip_longest(verse, verse[1:], fillvalue="")
... if not is_diacritic(x)]
...
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']
You could even do the same using map
and lambda, but you'd also need to filter
first, with another lambda, making the whole thing orders of magnitude uglier and harder to read.
I'm going to throw my hat into the ring with numpy here. You can convert a string into a useable format with
arr = np.array([verse]).view(np.uint32)
You can mask the locations where the following character is diacritical:
mask = np.empty(arr.shape, dtype=np.bool)
np.bitwise_and((arr[1:] > lower), (arr[1:] < upper), out=mask[:-1])
mask[-1] = False
Here, the range [upper, lower]
is a made up way for checking for diacritics. Implement the actual check however you like. In this example, I used the full-blown form of bitwise_and
with empty
to avoid a potentially expensive append of the last element.
Now if you have a numerical method for encoding your code points to a number, which I'm sure you can vectorize, you can do something like:
combined = combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]])
To get the remaining, uncombined characters, you would have to remove both the diactitics and the characters they bind to. The easiest way I can think of doing this is smearing the mask to the right and negating it. Again, I assume that you have a vectorized method to encode the single characters as well:
smeared = mask.copy()
smeared[1:] |= mask[:-1]
single = encode(arr[~smeared])
Combining the result into a final array is conceptually simple but takes a couple of steps. The result will be np.count_nonzeros(mask)
elements shorter than the input, since diacritics are being removed. We need to shift all the mask elements by the amount of their index. Here's one way to do it:
ind = np.flatnonzero(mask)
nnz = ind.size
ind -= np.arange(nnz)
output = np.empty(arr.size - nnz, dtype='U1')
output[ind] = combined
# mask of unmodified elements
out_mask = np.ones(output.size, dtype=np.bool)
out_mask[ind] = False
output[out_mask] = single
The reason I'm suggesting numpy is that it should be able to handle a few million characters in a matter of seconds this way. Getting the output back as a string should be straightforward.
Suggested Implementation
I have been pondering your question and decided to play with some timings and possible implementations. My idea was to map the unicode characters in 0x0621-0x063A, 0x0641-0x064A (26 + 10 = 36 letters) into the lower 6 bits of a uint16
, and the characters 0x064B-0x0652 (8 diacritics) to the next higher 3 bits, assuming these are in fact the only diacritics you need:
def encode_py(char):
char = ord(char) - 0x0621
if char >= 0x20:
char -= 5
return char
def combine_py(char, diacritic):
return encode_py(char) | ((ord(diacritic) - 0x064A) << 6)
In numpy terms:
def encode_numpy(chars):
chars = chars - 0x0621
return np.subtract(chars, 5, where=chars > 0x20, out=chars)
def combine_numpy(chars, diacritics):
chars = encode_numpy(chars)
chars |= (diacritics - 0x064A) << 6
return chars
You can choose to encode further to shorten the representation slightly, but I would not recommend it. This representation has the advantage of being verse-independent, so you can compare portions of different verses, as well as not worry about which representation you're going to get depending on how many verses you encoded together. You can even mask out the top bits of all the codes to compare the raw characters, without diacritics.
So let's say that your verse is a collection of randomly generated numbers in those ranges, with diacritics randomly generated to follow one letter each at most. We can generate a string of length around million pretty easily for comparative purposes:
import random
random.seed(0xB00B5)
alphabet = list(range(0x0621, 0x063B)) + list(range(0x0641, 0x064B))
diactitics = list(range(0x064B, 0x0653))
alphabet = [chr(x) for x in alphabet]
diactitics = [chr(x) for x in diactitics]
def sample(n=1000000, d=0.25):
while n:
yield random.choice(alphabet)
n -= 1
if n and random.random() < d:
yield random.choice(diactitics)
n -= 1
data = ''.join(sample())
This data has completely randomly distributed characters, with approximately a 25% chance of any character being followed by a diacritic. It takes just a few seconds to generate on my not-too-overpowered laptop.
The numpy conversion would look like this:
def convert_numpy(verse):
arr = np.array([verse]).view(np.uint32)
mask = np.empty(arr.shape, dtype=np.bool)
mask[:-1] = (arr[1:] >= 0x064B)
mask[-1] = False
combined = combine_numpy(chars=arr[mask], diacritics=arr[1:][mask[:-1]])
smeared = mask.copy()
smeared[1:] |= mask[:-1]
single = encode_numpy(arr[~smeared])
ind = np.flatnonzero(mask)
nnz = ind.size
ind -= np.arange(nnz)
output = np.empty(arr.size - nnz, dtype=np.uint16)
output[ind] = combined
# mask of unmodified elements
out_mask = np.ones(output.size, dtype=np.bool)
out_mask[ind] = False
output[out_mask] = single
return output
Benchmarks
And now let's %timeit
to see how it goes. First, here are the other implementations. I convert everything to a numpy array or a list of integers for fair comparison. I've also made minor modifications to make the functions return lists of the same quantities to validate accuracy:
from itertools import tee, zip_longest
from functools import reduce
def is_diacritic(c):
return ord(c) >= 0x064B
def pairwise(iterable, fillvalue):
""" Slightly modified itertools pairwise recipe
s -> (s0,s1), (s1,s2), (s2, s3), ...
"""
a, b = tee(iterable)
next(b, None)
return zip_longest(a, b, fillvalue=fillvalue)
def combine_py2(char, diacritic):
return char | ((ord(diacritic) - 0x064A) << 6)
def convert_FHTMitchell(verse):
def convert(verse):
was_diacritic = False # variable to keep track of diacritics -- stops us checking same character twice
# fillvalue will not be encoded but ensures last char is read
for this_char, next_char in pairwise(verse, fillvalue='-'):
if was_diacritic: # last next_char (so this_char) is diacritic
was_diacritic = False
elif is_diacritic(next_char):
yield combine_py(this_char, next_char)
was_diacritic = True
else:
yield encode_py(this_char)
return list(convert(verse))
def convert_tobias_k_1(verse):
return reduce(lambda lst, x: lst + [encode_py(x)] if not is_diacritic(x) else lst[:-1] + [combine_py2(lst[-1], x)], verse, [])
def convert_tobias_k_2(verse):
res = []
for x in verse:
if not is_diacritic(x):
res.append(encode_py(x))
else:
res[-1] = combine_py2(res[-1], x)
return res
def convert_tobias_k_3(verse):
return [combine_py(x, y) if y and is_diacritic(y) else encode_py(x) for x, y in zip_longest(verse, verse[1:], fillvalue="") if not is_diacritic(x)]
Now for the timings:
%timeit result_FHTMitchell = convert_FHTMitchell(data)
338 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result_tobias_k_1 = convert_tobias_k_1(data)
Aborted, took > 5min to run. Appears to scale quadratically with input size: not OK!
%timeit result_tobias_k_2 = convert_tobias_k_2(data)
357 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result_tobias_k_3 = convert_tobias_k_3(data)
466 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit result_numpy = convert_numpy(data)
30.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A comparison of the resulting arrays/lists shows that they are equal as well:
np.array_equal(result_FHTMitchell, result_tobias_k_2) # True
np.array_equal(result_tobias_k_2, result_tobias_k_3) # True
np.array_equal(result_tobias_k_3, result_numpy) # True
I'm using array_equal
here because it performs all the necessary type conversions to verify the actual data.
So the moral of the story is that there are lots of ways to do this, and parsing a few millions of characters shouldn't be prohibitively expensive on its own, until you get into cross referencing and other truly time-consuming tasks. The main thing to take from this is not to use reduce
on lists, since you will be reallocating a lot more than you need to. Even a simple for
loop will work fine for your purposes. Even though numpy is about ten times faster than the other implementations, it doesn't give a huge advantage.
Decoding
For the sake of completeness, here is a function to decode your results:
def decode(arr):
mask = (arr > 0x3F)
nnz = np.count_nonzero(mask)
ind = np.flatnonzero(mask) + np.arange(nnz)
diacritics = (arr[mask] >> 6) + 41
characters = (arr & 0x3F)
characters[characters >= 27] += 5
output = np.empty(arr.size + nnz, dtype='U1').view(np.uint32)
output[ind] = characters[mask]
output[ind + 1] = diacritics
output_mask = np.zeros(output.size, dtype=np.bool)
output_mask[ind] = output_mask[ind + 1] = True
output[~output_mask] = characters[~mask]
output += 0x0621
return output.base.view(f'U{output.size}').item()
As a side note, the work I did here inspired this question: Converting numpy arrays of code points to and from strings