Arabic tools

Alphabets

We use PyArabic constants which represents letters, instead of writting Arabic in the code.

hamza            = u'\u0621'
alef_mad         = u'\u0622'
alef_hamza_above = u'\u0623'
waw_hamza        = u'\u0624'
alef_hamza_below = u'\u0625'
yeh_hamza        = u'\u0626'
alef             = u'\u0627'
beh              = u'\u0628'
teh_marbuta      = u'\u0629'
teh              = u'\u062a'
theh             = u'\u062b'
jeem             = u'\u062c'
hah              = u'\u062d'
khah             = u'\u062e'
dal              = u'\u062f'
thal             = u'\u0630'
reh              = u'\u0631'
zain             = u'\u0632'
seen             = u'\u0633'
sheen            = u'\u0634'
sad              = u'\u0635'
dad              = u'\u0636'
tah              = u'\u0637'
zah              = u'\u0638'
ain              = u'\u0639'
ghain            = u'\u063a'
feh              = u'\u0641'
qaf              = u'\u0642'
kaf              = u'\u0643'
lam              = u'\u0644'
meem             = u'\u0645'
noon             = u'\u0646'
heh              = u'\u0647'
waw              = u'\u0648'
alef_maksura     = u'\u0649'
yeh              = u'\u064a'
madda_above      = u'\u0653'
hamza_above      = u'\u0654'
hamza_below      = u'\u0655'
alef_wasl        = u'\u0671'

Alphabetical Systems (Definitions)

Rasm: is any set of letters which are writtern in the same form, namely; they are indistinguishable in wirtting by they are distinguished from the context. For example, the letters ت ث ن ى, they can be written with only one rasm ىـ, without dots.

Alphabetical System: is a set of rasm; dynamically constructed by specifying the letters that you will treat them as one rasm. By the way, the default Arabic alphabet is a special case of the Alphabetical System where each letter is as one rasm.

Predefined systems are stored in systems object.

Default: each letter is treated as a unique rasm.
Without Dots: by removing the dots some letters will be indistinguishable; those letters are treated as one rasm. The following example shows the (Without Dots) system as a list of lists; where the sublist contains the letters which share the same rasm.
Hamazat: consider each any letter accompanied by hamaz ء as one rasm.

NOTE: You may go further and construct your system by speicying what letters you want to treat as one rasm, then you can do some statistical analysis like, count, variance, average, ...

Example:

q.systems.withoutDots
Out: 
[['ب', 'ت', 'ث', 'ن'], # Rasm 1
 ['ح', 'خ', 'ج'], # Rasm 2
 ['د', 'ذ'], # Rasm 3
 ['ر', 'ز'], # Rasm 4
 ['س', 'ش'], # Rasm 5
 ['ص', 'ض'], # Rasm 6
 ['ط', 'ظ'], # Rasm 7
 ['ع', 'غ'], # Rasm 8
 ['ف', 'ق']] # Rasm 9

Constructing a user-defined system:

system = [[alef_hamza_above, alef],
          [beh, teh]]

The previous piece of code means "Treat alef_hamza_above and alef as the same one latter, also treat beh and teh as one letter as well".

The rest of letters can be dynamically constructed using check_system()

And then, a system can be applied to some text analysis functions like counting, filtering, etc.

check_system

check_system(system, index=None)

Returns the alphabet including treated-as-one letters. If you pass the index as the second optional arguement, it returns the letter of the that index only, not the hole alphabet.

Args

system: [[char]], a list of letters, where each letter to be treated as one letter are in one sub-list, see Alphabetical Systems.
index: Optional integer, is a index of a letter in the new system.

Returns

list: full sorted system or a specific index.

Example

q.check_system([['alef', 'beh']])

>>> [['ء'],
['آ'],
['أ', 'ب'],
['ؤ'],
['إ'],
['ئ'],
['ا'],
['ة'],
['ت'],
['ث'],
['ج'],
['ح'],
['خ'],
['د'],
['ذ'],
['ر'],
['ز'],
['س'],
['ش'],
['ص'],
['ض'],
['ط'],
['ظ'],
['ع'],
['غ'],
['ف'],
['ق'],
['ك'],
['ل'],
['م'],
['ن'],
['ه'],
['و'],
['ى'],
['ي']]

The previous example prints each letter as one element in a new alphabet list, as you can see the two letters alef and beh are considered one letter.

factor_alef_mad

factor_alef_mad(sentance)

It returns the sentance having alef_mad factored into alef_hamza and alef_wasel.

Args

sentance: str, a string or list.

Returns

str: sentance having the alef_mad factored

Example

q.factor_alef_mad('آ')

>>> 'أْأَ'

grouping_letter_diacritics

grouping_letter_diacritics(sentance)

Grouping each letter with its diacritics.

Args

sentance: str

Returns

[str]: a list of x, where x is the letter accompanied with its diacritics.

Example

q.grouping_letter_diacritics('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ')

>>> ['إِ', 'نَّ', 'ا', ' ', 'أَ', 'عْ', 'طَ', 'يْ', 'نَ', 'كَ', ' ', 'ا', 'لْ', 'كَ', 'وْ', 'ثَ', 'رَ']

alphabet_excluding

alphabet_excluding(excludedLetters)

returns the alphabet excluding excludedLetters.

Args

excludedLetters: list[Char], letters to be excluded from the alphabet.

Returns

str: alphabet excluding excludedLetters.

Example

 q.alphabet_excluding([q.alef, q.beh, q.qaf, q.teh, q.dal, q.yeh, q.alef_mad])
 >>>
 ['ء',
 'ٔ',
 'أ',
 'ؤ',
 'إ',
 'ئ',
 'ة',
 'ث',
 'ج',
 'ح',
 'خ',
 'ذ',
 'ر',
 'ز',
 'س',
 'ش',
 'ص',
 'ض',
 'ط',
 'ظ',
 'ع',
 'غ',
 'ف',
 'ك',
 'ل',
 'م',
 'ن',
 'ه',
 'و',
 'ى']

strip_tashkeel

strip_tashkeel(string)

convert any letter in the listOfLetter to letter in the given text

Args

string: str, to drop tashkeel from.

Example

x = q.quran.get_verse(12, 2, with_tashkeel=True)
x
>>> 'إِنَّا أَنزَلْنَهُ قُرْءَنًا عَرَبِيًّا لَّعَلَّكُمْ تَعْقِلُونَ'

q.strip_tashkeel(x)
>>> 'إنا أنزلنه قرءنا عربيا لعلكم تعقلون'

buckwalter_transliteration

buckwalter_transliteration(string, reverse=False)

Back and forth Arabic-Bauckwalter transliteration. Revise Buckwalter

Args

string: to be transliterated.
reverse: Optional boolean. False transliterates from Arabic to Bauckwalter, True transliterates from Bauckwalter to Arabic.

Returns

str: transliterated string.

Example

q.buckwalter_transliteration('إِنَّا أَعْطَيْنَكَ الْكَوْثَرَ')

>>> <in~aA >aEoTayonaka Alokawovara