Skip to content

Indo-Pakistani Transliteration System [WIP 🚧]

Notifications You must be signed in to change notification settings

GokulNC/Indic-PersoArabic-Script-Converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Indic-PersoArabic-Script-Converter

Indo-Pakistani Transliteration

A python library to convert from Indian scripts to Pakistani scripts and vice-versa.

Currently supported methods

  1. Rule-based conversion
  • Faster, but does not support short vowels
  • Will not be accurate, especially for Arabic-to-Indic
  1. Sangam Project's online transliteration API
  • Uses an online endpoint for the conversion
  • Produces much better results, but much slower

Usage

Installation

Pre-requisites:

  • Use Python 3.7+
  • pip install git+https://github.com/GokulNC/indic_nlp_library
pip install indo-arabic-transliteration

Using rule-based conversion

from indo_arabic_transliteration.mapper import script_convert
script_convert(text: str, from_script: str, to_script: str)

Using Sangam API

from indo_arabic_transliteration.sangam_api import online_transliterate
online_transliterate(text: str, from_script: str, to_script: str)

Languages

We use the standard BCP 47 language tags to refer to the language-script combinations.

Hindi-Urdu (Hindustani)

Language Script Code
Hindi Devanagari hi-IN
Urdu Perso-Arabic ur-PK

Example:

# Rule-based
script_convert("हैदराबाद‎", 'hi-IN', 'ur-PK') # حیدرآباد
script_convert("حيدرآباد‎", 'ur-PK', 'hi-IN') # हीदराबाद‎

# Online-API
online_transliterate("حيدرآباد‎", 'ur-PK', 'hi-IN') # हैदराबाद‎
online_transliterate("हैदराबाद‎", 'hi-IN', 'ur-PK') # حیدرآباد‎

Notes & Resources:

Panjabi

Language Script Code
East Punjabi Gur'Mukhi pa-IN
West Punjabi ShahMukhi pa-PK

Example:

# Rule-based
script_convert("ਸਿੰਘ", 'pa-IN', 'pa-PK') # سںگھ
script_convert("سںگھ", 'pa-PK', 'pa-IN') # ਸਂਘ

# Online-API
online_transliterate("سنگھ", 'pa-PK', 'pa-IN') # ਸਿੰਘ
online_transliterate("ਸਿੰਘ", 'pa-IN', 'pa-PK') # سِنگھ

Notes & Resources:

Sindhi

Language Script Code
Indian Sindhi Devanagari sd-IN
Pakistani Sindhi Perso-Arabic sd-PK

Example:

# Rule-based
script_convert("हैदराबाद‎", 'sd-IN', 'sd-PK') # حیدرآباد
script_convert("حيدرآباد‎", 'sd-PK', 'sd-IN') # हीदराबाद‎

# Online-API
online_transliterate("حيدرآباد‎", 'sd-PK', 'sd-IN') # हैदराबाद‎
online_transliterate("हैदराबाद‎", 'sd-IN', 'sd-PK') # حیدرآباد‎

Notes & Resources:


Other Methods

MachineLearning-based Transliteration

  • Uses LibIndicTrans library for models
    • Install it by pip install git+https://github.com/libindic/indic-trans
  • Currently supports only Hindi-Urdu languages

API:

from indo_arabic_transliteration.ml_based import ml_transliterate
# Same interface as script_convert()

Indic-to-Arabic with Diacritics

  • Indic scripts are mostly phonetic. Use this to retain diacritics in PersoArabic
    • Currently only supports Hindustani (Hindi to Urdu) and Punjabi (Gurmukhi to Shahmukhi)
    • Uses AksharaMukhi library

API:

from indo_arabic_transliteration.lossless_converter import convert_with_diacritics
# Same interface as script_convert()

Support

  • For help in using the library, please use the GitHub Issues section.
  • For script conversion errors from the online API, please write directly to the Sangam team. We are not related to them in anyway and this is not an official library.

About

Indo-Pakistani Transliteration System [WIP 🚧]

Topics

Resources

Stars

Watchers

Forks

Languages