Working Open Across HathiTrust: Developing Paths for Collaborative Work towards Accessibility and Discoverability of Works in Historic, Hybrid and Multilingual Languages in the HathiTrust Digital Library

31 Jul 2023 (modified: 01 Aug 2023)InvestinOpen 2023 OI Fund SubmissionEveryoneRevisionsBibTeX
Funding Area: Critical shared infrastructure / Infraestructura compartida critica
Problem Statement: HathiTrust Digital Library offers unprecedented access to digitized records while presenting significant challenges for discovering corpora involving historic and hybrid languages with nonstandard language and script combinations—such as OCR errors, mislabeled metadata, and more. Their inaccessibility leads to inequitable research practices, leaving these marginalized language domains significantly understudied both in computational and traditional humanities scholarship. We have designed an open computational workflow as a test case, to discover such corpora in the HathiTrust library by developing a machine learning model to assist in discovering texts in languages whose metadata histories are laden with mislabeling through the test case of Armeno-Turkish: vernacular Turkish written in the Armenian script. With this study, we detected numerous metadata errors and introduced ways of locating works with incomplete metadata. However, currently there are no programmatic mechanisms for updating reference data across institutions. The goal of this project, therefore, is to form a bridge between these communities and HathiTrust. By using this rich test case as an example, we propose to investigate ways of developing open infrastructure overlays on HathiTrust that would allow researchers who work in low-resource, hybrid languages and languages in non-roman scripts to collaborate across HathiTrust and have interoperable structures for similarly marginalized language groups.
Proposed Activities: The project period will be 12 months, starting November 2023 and ending in November 2024. The deliverables of this project would include a documentation for the reference (API) specification that would connect HathiTrust to research communities in different institutions, so that novel metadata results can be collectively shared, updated and maintained. As a first step, Hale Sirin Ryan will finish the workflow that uses a machine learning language model to detect Armeno-Turkish works on HathiTrust (the current version can be found here: https://github.com/comp-int-hum/Armeno-Turkish-Collection), that are otherwise impossible to find by search tools. As the domain expert, Ali Bolcakan will investigate the produced dataset and make necessary annotations. This workflow will be shared openly for the research community to apply to their specific discoverability needs on HathiTrust. As the project team, we currently addressed the discoverability challenge of Armeno-Turkish works with the above-mentioned following workflow. While we propose to share this open-source workflow to aid with similar discoverability challenges, as the next step, we will investigate ways of integrating these findings to the HathiTrust corpora in a way that would allow researchers to collaborate across HathiTrust, openly access and contribute to the records. To do so, we will plan the process to design the documentation for a reference API specification that would allow an interoperable infrastructure across HathiTrust so that metadata findings like ours can be shared and contributed to by the user communities. Specific activities to this end in this grant period will include identifying and assembling a mailing list for the stakeholders and user communities in the multilingual and nonstandard languages research domains, determining structural ways of involving the user community in the different planning and implementation stages to discuss a reference implementation that will allow for collaboration across HathiTrust to collate metadata corrections and expansions. To achieve these goals, we will consult with different experts. The project co-leads Hale Sirin Ryan and Ali Bolcakan will initially plan a work session at UIUC, University of Michigan or JHU with Glen Worthey, the Associate Director for Research Support Services at HathiTrust Research Center, UIUC, Graham Dethmers, Metadata Analyst at HathiTrust and the project consultant Tom Lippincott, the director of the JHU Center for Digital Humanities (CDH) with a joint faculty appointment in the Computer Science Department and the Center for Language and Speech Processing at JHU. Final products will include a documentation for an API specification, a stakeholder community network (such as a listserv, conference interest group, etc.) and concrete plans for user community involvement in the continued steps of this work.
Openness: HathiTrust digital repository that houses 18 million volumes is currently the largest open collaborative effort in consolidating digitized works for research purposes. The current experimental workflow for machine-learning assisted discovery of works in HathiTrust that are in hybrid languages will be shared openly, throughout its developmental stages, on Github (the current version: https://github.com/comp-int-hum/Armeno-Turkish-Collection). The workflow uses Scons—an open-source software construction tool that allows for full replicability of the whole project. During the project period, the authors of this proposal aim to publish both their experimental workflow and the critical insights in open-access journals that make research freely available to the public such as Journal of Open Humanities Data or Digital Humanities Quarterly. The metadata findings and the proposed documentation for the API specification will follow the Principles of Open Scholarly Infrastructure to address sustainability concerns of the efforts. Deliverables of this project, including the findings of the investigation into overlayed ML integration to aid in the discovery of understudied language groups, will be shared with the broader multilingual research community in venues such as ADHO Multilingual DH Special Interest Group and Linked Open Data Special Interest Group.
Challenges: Our goal is to find programmatic ways of contributing to improved metadata information on hybrid and historic languages across HathiTrust, but a major challenge is that each language domain poses unique challenges, such as transliteration and OCR errors, conflicting labelling practices etc. To address these challenges, we have two strategies. 1. We already have a rich case with tangible results, which can be used as a concrete test case for metadata discussions. Additionally, as the core team, we will expand our work to include Greco-Turkish and Judeo-Spanish—fields for which we have the technical and domain expertise. With more test cases, there will be more concrete metadata information that provides more insight into metadata problems in HathiTrust. On the other hand, by working on creating channels of communication (specifically through listservs, special interest groups, conferences, and journals), we will include user communities and domain experts in the conversation throughout the process. Our collaboration with the HathiTrust team allows us to have active guidance about the technical and institutional possibilities of HathiTrust. These strategies are aimed at an “open by design” approach. A challenge beyond the scope of this project will be scaling this work across other digital libraries and archives that are not open access. This is a future goal for us. However, we hope that the results of this project will open the way to involving such institutions.
Neglectedness: Since data-driven, computational, and digital grants prioritize domains with high-resource languages and structured data, marginalized historic and hybrid languages such as Armeno-Turkish, remain neglected at all stages of research, from exploration to experimentation. In addition, these languages remain low resource because they are historic languages for which new data is not generated online. There are some opportunities for endangered languages, such as the NEH grant, Documenting Endangered Languages. However, already extinct historical languages do not fall within their scope. The Digital Humanities Advancement Grant and Humanities Collections and Reference Resources offered by the NEH are currently the two best fitting grants for a project like ours, and we plan to apply for NEH DHAG and NEH HCRR at a further stage. We believe that the present grant will allow us to advance the groundwork for this project by supporting the desired project outcomes: 1. a machine-learning assisted open workflow to aid in discovery of works in historic and hybrid languages, and 2. Developing a process to envision open and sustainable ways of sharing such research outcomes across HathiTrust, which is centered around the user communities and their needs. We hope that these outcomes will prepare us to seek grants to scale up and generalize our findings and ensure the user community needs remain central to the implementation decisions.
Success: By the end of the grant period, our goal is to develop the machine-learning assisted open workflow to aid in discovery of works in historic and hybrid languages. The success of this outcome will be sharing this workflow, which will be entirely accessible to the public and sustainable using github repositories and the Scons build system, which is a free, open-source software accessible to everyone. Another measure will be producing a paper describing the methodology and findings (including successes and failures) to be published in a free, open-access journal such as Journal of Open Humanities Data. We will measure the success of the documentation development process not only by how far we get in terms of the tangible implementation specifications, but by the amount of user community representation involved in the process, measured by the number and the diversity of the stakeholders we will reach through the listserv and conference presentations to special interest groups. We believe that our project will also be important for research in many other historical and geographical contexts utilizing different language and script combinations. This kind of research will also be useful for generalists. For example, researchers in global history can either implement to give a more nuanced and ultimately fairer overview of language and script distributions. Increased representation of marginalized languages in global studies will constitute one measure of success.
Total Budget: 11,500
Budget File: pdf
Affiliations: We are applying as a team of individuals.
LMIE Carveout: While our team members reside in North America, we envision that offering data and paths of access will benefit research communities in Low-and-Middle-Income-Economies, such as Armenia and Turkey as well as India, Lebanon and Egypt, whose domains involve hybrid, historic languages with multiple scripts. The Armeno-Turkish case is conceived as a test case for later implementations in other marginalized languages, such as languages in India, where estimates point to as high as 88 diverse scripts. Academic freedom and funding offered by institutions in Europe and North America are crucial for a project like this one that focuses on hybrid, transnational languages.
Team Skills: Hale Sirin Ryan is a postdoctoral fellow at the JHU Center for Digital Humanities with an interdisciplinary background in humanities and data science. She has a PhD in Intellectual History and Comparative Literature from JHU, with a focus on German and late Ottoman thought, and a program certificate in Applied Data Science and Machine Learning at MIT PE. In this project, she developed the machine learning workflow, and she will lead the planning of the documentation on sustainable, open-access integration of such reference mechanisms. Ali Bolcakan is a researcher in Asian Languages and Cultures at the University of Michigan. He defended his dissertation in Comparative Literature at the University of Michigan. He oversees the accessibility-focused redesign work on the Translation Networks website. He designed and managed student work on an educational digital card game, Tower of Babel: HathiTrust Edition. His essay "Ottoman Babel: Language, Cosmopolitanism and the Novel in the long Tanzimat Period" was published in the edited volume Ottoman Culture and the Project of Modernity. Our advisors, Glen Layne-Worthey, Associate Director for Research Support Services, and Graham Dethmers, Metadata Analyst at HathiTrust, and Tom Lippincott, the director of the JHU Center for Digital Humanities (CDH) and faculty in Computer Science and the Center for Language and Speech Processing at JHU, will provide the expertise on HathiTrust Research Infrastructure and engineering protocols.
Submission Number: 98
Loading