Encoding Arabic letters with their diacritics (if exists)

map does not seem to be the right tool for the job. You do not want to map characters to other characters, but group them together. Instead, you might try reduce (or functools.reduce in Python 3). Here, I use isalpha to test what kind of character it is; you might need something else.

>>> is_diacritic = lambda x: not x.isalpha()
>>> verse = "XXA)L_I!I%M<LLL>MMQ*Q"
>>> reduce(lambda lst, x: lst + [x] if not is_diacritic(x) else lst[:-1] + [lst[-1]+x], verse, [])
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

However, this is barely readable and also creates lots of intermediate lists. Better just use a boring old for loop, even if you explicitly asked for something else:

res = []
for x in verse:
    if not is_diacritic(x):
        res.append(x)
    else:
        res[-1] += x

By iterating pairs of consecutive characters, e.g. using zip(verse, verse[1:]) (i.e. (1,2), (2,3),..., not (1,2), (3,4), ...), you could indeed also use a list comprehension, but I'd still vote for the for loop for readability.

>>> [x + y if is_diacritic(y) else x
...  for x, y in zip_longest(verse, verse[1:], fillvalue="")
...  if not is_diacritic(x)]
...
['X', 'X', 'A)', 'L_', 'I!', 'I%', 'M<', 'L', 'L', 'L>', 'M', 'M', 'Q*', 'Q']

You could even do the same using map and lambda, but you'd also need to filter first, with another lambda, making the whole thing orders of magnitude uglier and harder to read.

I'm going to throw my hat into the ring with numpy here. You can convert a string into a useable format with

arr = np.array([verse]).view(np.uint32)

You can mask the locations where the following character is diacritical:

mask = np.empty(arr.shape, dtype=np.bool)
np.bitwise_and((arr[1:] > lower), (arr[1:] < upper), out=mask[:-1])
mask[-1] = False

Here, the range [upper, lower] is a made up way for checking for diacritics. Implement the actual check however you like. In this example, I used the full-blown form of bitwise_and with empty to avoid a potentially expensive append of the last element.

Now if you have a numerical method for encoding your code points to a number, which I'm sure you can vectorize, you can do something like:

combined = combine(letters=arr[mask], diacritics=arr[1:][mask[:-1]])

To get the remaining, uncombined characters, you would have to remove both the diactitics and the characters they bind to. The easiest way I can think of doing this is smearing the mask to the right and negating it. Again, I assume that you have a vectorized method to encode the single characters as well:

smeared = mask.copy()
smeared[1:] |= mask[:-1]
single = encode(arr[~smeared])

Combining the result into a final array is conceptually simple but takes a couple of steps. The result will be np.count_nonzeros(mask) elements shorter than the input, since diacritics are being removed. We need to shift all the mask elements by the amount of their index. Here's one way to do it:

ind = np.flatnonzero(mask)
nnz = ind.size
ind -= np.arange(nnz)

output = np.empty(arr.size - nnz, dtype='U1')
output[ind] = combined

# mask of unmodified elements
out_mask = np.ones(output.size, dtype=np.bool)
out_mask[ind] = False
output[out_mask] = single

The reason I'm suggesting numpy is that it should be able to handle a few million characters in a matter of seconds this way. Getting the output back as a string should be straightforward.

Suggested Implementation

I have been pondering your question and decided to play with some timings and possible implementations. My idea was to map the unicode characters in 0x0621-0x063A, 0x0641-0x064A (26 + 10 = 36 letters) into the lower 6 bits of a uint16, and the characters 0x064B-0x0652 (8 diacritics) to the next higher 3 bits, assuming these are in fact the only diacritics you need:

def encode_py(char):
    char = ord(char) - 0x0621
    if char >= 0x20:
        char -= 5
    return char

def combine_py(char, diacritic):
    return encode_py(char) | ((ord(diacritic) - 0x064A) << 6)

In numpy terms:

def encode_numpy(chars):
    chars = chars - 0x0621
    return np.subtract(chars, 5, where=chars > 0x20, out=chars)

def combine_numpy(chars, diacritics):
    chars = encode_numpy(chars)
    chars |= (diacritics - 0x064A) << 6
    return chars

You can choose to encode further to shorten the representation slightly, but I would not recommend it. This representation has the advantage of being verse-independent, so you can compare portions of different verses, as well as not worry about which representation you're going to get depending on how many verses you encoded together. You can even mask out the top bits of all the codes to compare the raw characters, without diacritics.

So let's say that your verse is a collection of randomly generated numbers in those ranges, with diacritics randomly generated to follow one letter each at most. We can generate a string of length around million pretty easily for comparative purposes:

import random

random.seed(0xB00B5)

alphabet = list(range(0x0621, 0x063B)) + list(range(0x0641, 0x064B))
diactitics = list(range(0x064B, 0x0653))

alphabet = [chr(x) for x in alphabet]
diactitics = [chr(x) for x in diactitics]

def sample(n=1000000, d=0.25):
    while n:
        yield random.choice(alphabet)
        n -= 1
        if n and random.random() < d:
            yield random.choice(diactitics)
            n -= 1

data = ''.join(sample())

This data has completely randomly distributed characters, with approximately a 25% chance of any character being followed by a diacritic. It takes just a few seconds to generate on my not-too-overpowered laptop.

The numpy conversion would look like this:

def convert_numpy(verse):
    arr = np.array([verse]).view(np.uint32)
    mask = np.empty(arr.shape, dtype=np.bool)
    mask[:-1] = (arr[1:] >= 0x064B)
    mask[-1] = False

    combined = combine_numpy(chars=arr[mask], diacritics=arr[1:][mask[:-1]])

    smeared = mask.copy()
    smeared[1:] |= mask[:-1]
    single = encode_numpy(arr[~smeared])

    ind = np.flatnonzero(mask)
    nnz = ind.size
    ind -= np.arange(nnz)

    output = np.empty(arr.size - nnz, dtype=np.uint16)
    output[ind] = combined

    # mask of unmodified elements
    out_mask = np.ones(output.size, dtype=np.bool)
    out_mask[ind] = False
    output[out_mask] = single

    return output

Benchmarks

And now let's %timeit to see how it goes. First, here are the other implementations. I convert everything to a numpy array or a list of integers for fair comparison. I've also made minor modifications to make the functions return lists of the same quantities to validate accuracy:

from itertools import tee, zip_longest
from functools import reduce

def is_diacritic(c):
    return ord(c) >= 0x064B

def pairwise(iterable, fillvalue):
    """ Slightly modified itertools pairwise recipe
    s -> (s0,s1), (s1,s2), (s2, s3), ... 
    """
    a, b = tee(iterable)
    next(b, None)
    return zip_longest(a, b, fillvalue=fillvalue)

def combine_py2(char, diacritic):
    return char | ((ord(diacritic) - 0x064A) << 6)

def convert_FHTMitchell(verse):
    def convert(verse):
        was_diacritic = False  # variable to keep track of diacritics -- stops us checking same character twice

        # fillvalue will not be encoded but ensures last char is read
        for this_char, next_char in pairwise(verse, fillvalue='-'):
            if was_diacritic:  # last next_char (so this_char) is diacritic
                was_diacritic = False
            elif is_diacritic(next_char):
                yield combine_py(this_char, next_char)
                was_diacritic = True
            else:
                yield encode_py(this_char)

    return list(convert(verse))

def convert_tobias_k_1(verse):
    return reduce(lambda lst, x: lst + [encode_py(x)] if not is_diacritic(x) else lst[:-1] + [combine_py2(lst[-1], x)], verse, [])

def convert_tobias_k_2(verse):
    res = []
    for x in verse:
        if not is_diacritic(x):
            res.append(encode_py(x))
        else:
            res[-1] = combine_py2(res[-1], x)
    return res

def convert_tobias_k_3(verse):
    return [combine_py(x, y) if y and is_diacritic(y) else encode_py(x) for x, y in zip_longest(verse, verse[1:], fillvalue="") if not is_diacritic(x)]

Now for the timings:

%timeit result_FHTMitchell = convert_FHTMitchell(data)
338 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_1 = convert_tobias_k_1(data)
Aborted, took > 5min to run. Appears to scale quadratically with input size: not OK!

%timeit result_tobias_k_2 = convert_tobias_k_2(data)
357 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_tobias_k_3 = convert_tobias_k_3(data)
466 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit result_numpy = convert_numpy(data)
30.2 µs ± 162 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A comparison of the resulting arrays/lists shows that they are equal as well:

np.array_equal(result_FHTMitchell, result_tobias_k_2)  # True
np.array_equal(result_tobias_k_2, result_tobias_k_3)   # True
np.array_equal(result_tobias_k_3, result_numpy)        # True

I'm using array_equal here because it performs all the necessary type conversions to verify the actual data.

So the moral of the story is that there are lots of ways to do this, and parsing a few millions of characters shouldn't be prohibitively expensive on its own, until you get into cross referencing and other truly time-consuming tasks. The main thing to take from this is not to use reduce on lists, since you will be reallocating a lot more than you need to. Even a simple for loop will work fine for your purposes. Even though numpy is about ten times faster than the other implementations, it doesn't give a huge advantage.

Decoding

For the sake of completeness, here is a function to decode your results:

def decode(arr):
    mask = (arr > 0x3F)
    nnz = np.count_nonzero(mask)
    ind = np.flatnonzero(mask) + np.arange(nnz)

    diacritics = (arr[mask] >> 6) + 41
    characters = (arr & 0x3F)
    characters[characters >= 27] += 5

    output = np.empty(arr.size + nnz, dtype='U1').view(np.uint32)
    output[ind] = characters[mask]
    output[ind + 1] = diacritics

    output_mask = np.zeros(output.size, dtype=np.bool)
    output_mask[ind] = output_mask[ind + 1] = True
    output[~output_mask] = characters[~mask]

    output += 0x0621

    return output.base.view(f'U{output.size}').item()

As a side note, the work I did here inspired this question: Converting numpy arrays of code points to and from strings

Encoding Arabic letters with their diacritics (if exists)

Tags:

Python

Lambda

Python 3.X

Related

Recent Posts