The written heritage of the “Islamicate” cultures that stretch from modern Bengal to Spain is as vast as it is understudied and underrepresented in the digital humanities. The sheer volume and diversity of the surviving works produced in Persian and Arabic by denizens of these lands in the premodern period makes this body of texts ideal for computational forms of analysis. Efforts to utilize these new digital forms of analysis, however, have been stymied by poor OCR technology for Arabic-script languages and the lack of a open-access, standards-compliant Islamicate corpus.
The Open Islamicate Texts Initiative (OpenITI) is a multi-institutional effort to construct the first machine-actionable scholarly corpus of premodern Islamicate texts. Led by researchers at the Aga Khan University (AKU), Universität Wien (UW), and the Roshan Institute for Persian Studies at the University of Maryland (College Park) and an interdisciplinary advisory board of leading digital humanists and Islamic, Persian, and Arabic studies scholars, OpenITI aims to develop the digital infrastructure necessary to achieve this goal, including improved Arabic-script OCR, Arabic-script standards for OCR output and text encoding, and platforms for collaborative corpus creation (e.g., CorpusBuilder). In the process, OpenITI will enable new synergies between Digital Humanities and the inter-related Islamicate fields of Islamic, Persian, and Arabic Studies.
Since its founding in 2016, OpenITI's work has focused on two primary areas: (1) improvement of Arabic-script OCR, and (2) corpus building. Our work on OCR—done in collaboration with Benjamin Kiessling of Universität Leipzig—has produced some of the most accurate results to date on Arabic-script texts (see full results here). Most importantly, these results were achieved on a open-source OCR engine (Kraken) which is retrainable and can be adapted for highly specific scholarly needs. Beginning in 2017 OpenITI also began collaborating with the SHARIAsource project of Harvard Law School on the creation of a digital text production pipelane, called CorpusBuilder. Currently scheduled for beta release in January 2019, CorpusBuilder is a user-friendly, web-based, open-source application that allows users to upload, OCR, post-correct, annotate, and structurally tag a document. It includes robust version control (built on the git model) and an API as well—both critically important features that will help facilitate the collaboratively model of corpus production that OpenITI champions.
OpenITI's second focus flows out of our OCR work (literally): our ultimate goal is the creation of a machine-actionable and standards-compliant scholarly corpus of Persian and Arabic texts. (We sincerely hope to expand to Ottoman Turkish and Urdu texts in the near future too, as soon as funding permits.) After completing experimental Persian and Arabic corpus development projects over the course of 2015 (i.e., the OpenArabic, KITAB (Knowledge, Information, and The Arabic Book), and Persian Digital Library (PDL) projects), OpenITI team members drafted a development plan that would bring together these efforts in one united Islamicate textual corpus that would contain approximately 10,000 Islamicate texts (ca. 7,000 Arabic and 3,000 Persian texts). This plan calls for us to: (1) review and format existing open-access premodern Persian and Arabic text according to the CapiTainS canonical text services (CTS) and TEI-XML standards; (2) enrich these texts with as much verified metadata as possible; and (3) develop and execute a plan to achieve greater parity in the number, genre, and chronological coverage of both Persian and Arabic texts in the OpenITI corpus after reviewing results of the first phase of this plan. (This need to make the existing collection of digital Persian and Arabic texts more representative of these traditions as a whole is the impetus for our work on Arabic-script OCR.)
|Sarah Bowen Savant||Maxim Romanov||Matthew Thomas Miller|
|Associate Professor @ Institute for the Study of Muslim Civilisations, Aga Khan University, London (kitab-project.org)||Universitätsassistent für Digital Humanities @ Institut für Geschichte, Universität Wien||Assistant Professor, Persian Literature & Digital Humanities and Associate Director, Roshan Initiative in Persian Digital Humanities (PersDig@UMD), Roshan Institute for Persian Studies, University of Maryland, College Park; Affiliate, Maryland Institute for Technology in the Humanities|
|Bridget Almas||Gregory Crane||Fatemeh Keshavarz|
|Software Architect, The Alpheios Project||Alexander von Humboldt Professor of Digital Humanities, Alexander von Humboldt-Lehrstuhl für Digital Humanities Institut für Informatik, Universität Leipzig and Professor of Classics, Tufts University||Roshan Institute Chair in Persian Studies and Director, School of Languages, Literatures, and Cultures, University of Maryland, College Park|
|Ahmet T. Karamustafa|
|Professor of History, University of Maryland, College Park|
|Christopher W. Blackwell||Jamal J. Elias||Sunil Sharma||Olga Davidson|
|Louis G. Forgione University Professor of Classics, Furman University||Walter H. Annenberg Professor in the Humanities and Professor of Religious Studies and South Asia Studies, University of Pennsylvania||Professor of Persian and Indian Literatures, Boston University||Research Fellow, Institute for the Study of Muslim Societies and Civilizations, Boston University|
|Fred Donner||Beatrice Gründler||Konrad Hirschler||Matthew Jockers|
|Professor of Near Eastern History, University of Chicago||Professor, Berlin Graduate School Muslim Cultures and Societies, Freie Universität Berlin||Professor of Middle Eastern History, School of Oriental & African Studies, University of London||Susan J. Rosowski Associate Professor of English, University of Nebraska-Lincoln|
|Hugh Kennedy||Sabine Schmidtke||Paul E. Losensky||Laura Mandell|
|Professor of Arabic, School of Oriental & African Studies, University of London||Professor of Islamic Intellectual History, Institute for Advanced Study, Princeton||Professor of Central Eurasian Studies and Comparative Literature, Indiana University (Bloomington)||Professor of English and Director, Initiative for Digital Humanities, Media, and Culture, Texas A&M University|
|Intisar A. Rabb||Chase Robinson|
|Susan S. and Kenneth L. Wallach Professor, Radcliffe Institute for Advanced Study at Harvard University and Professor of Law, Harvard Law School||President and Distinguished Professor of History, The Graduate Center of the City University of New York|