Querido Diário: an open infrastructure to serve the Brazilian research and development ecosystem

31 Jul 2023 (modified: 01 Aug 2023)InvestinOpen 2023 OI Fund SubmissionEveryoneRevisionsBibTeX
Funding Area: Critical shared infrastructure / Infraestructura compartida critica
Problem Statement: This work proposes the improvement of an open infrastructure tackling the problem of the lack of data on Brazilian cities to better serve the ecosystem of research and knowledge creation. In Brazil, accessing and analyzing official decision-making acts published by municipalities is an arduous task. As no centralized platform is available, the only reliable source of information is multiple PDF files of official gazettes published in a closed, pulverized and unstructured way. This makes it extremely difficult for researchers, scholars, scientists and the entire community that uses open knowledge as a source of work. Centralizing this data in one platform, making it easily accessible for a quick search or heavy data analysis, and enriching the information with other public data sources are the primary goals of Querido Diário (“Dear Diary” in English, or QD), an open infrastructure developed and maintained by Open Knowledge Brasil Currently, QD covers 91 cities, where 49 million live. With over 200 thousand indexed files, it already provides a significant new dataset of public acts. In line with open science principles, this project aims to develop and feed a data lake, allowing to cross and connect the entities found on QD texts with other public datasets, enabling real-time queries. In addition, it will provide the development of a repository based on CKAN technology, which will automatically and periodically public the databases consolidated on the platform.
Proposed Activities: Some tools are required to implement a Data Platform based on open-source technologies. We propose a lean architecture based on the most critical tasks: data ingestion (Airbyte), transformation (dbt), storage management (Delta Lake), orchestration (Apache Airflow or Dagster), and visualization (Metabase). After developing the Data Platform and integrating datasets (existing and new ones) and data pipelines, the datasets will be published through a CKAN instance implementation, making it easy for other projects to use the collected data. For personnel, a Data Engineer Tech Leader, a Data Architect, and Software Engineers are necessary to plan, implement and validate the Platform. Also, a Community Manager will be needed to maintain engagement around the project's open-source code and datasets integrations. Here is a rough timeline of the proposed project: 1. Data Platform modeling (2 months); 2. Data Platform staging implementation (3 months) 3. Adaptation of Querido Diário's data pipelines (including the dataset of Brazilian registered companies) to the Data Platform (2 months); 4. Data Platform overall adjustments and move to production environment (1 month); 5. Adaptation and inclusion of other datasets and pipelines (Brazilian elections datasets, expenses of politicians datasets, proposed laws datasets); 6. Community engagement for adding other datasets (6 months); 7. Staging implementation of a CKAN instance for dataset publishing (2 months); CKAN instance adjustments and move to production environment (1 month).
Openness: The project is open in several ways: 1. Its code is open, allowing collaboration and visibility into issues to be improved and in the development roadmap; 2. Its infrastructure uses only open technologies; 3. Data is available with an open license and can be accessed through open APIs, allowing the development of third-parties applications; 4. It is aligned with open science principles and, last but not least; 5. Open governance practices already exist, such as regular maintainers' meetings and open discussion channels. We know it "takes a village to manage and share data” (see Borgman & Bourne, 2022 - https://doi.org/10.1162/99608f92.42eec111), so in the past two years, we have been building a vibrant community around the project, with more than a thousand people in our Discord channel and 'ambassadors' spread all over the country. Last year we launched the "Querido Diario in the Universities" program, making an open call to partnerships with researchers, professors, and students to use QD Data or to develop technical research to help improve the infrastructure. It has been a success so far, and now we need investment to create a better infrastructure to serve this academic ecosystem.
Challenges: A significant change in the infrastructure is challenging, but we have the expertise to do it in the proposed time frame. Engaging the community is also demanding and critical work, and working closely with people and their dataset demands will be necessary, especially in the early stages, where the workflow of integrations will be defined.
Neglectedness: In our field of civil society, it is unusual to have funders willing to invest in infrastructure and who understand the value of investing in open ones – which may perhaps require a more significant investment initially but brings multiple benefits in the medium and long term for the whole ecosystem. We managed to invest in infrastructure indirectly, such as by producing reports, studies, and advocacy for public policies. Still, more is needed to implement the necessary technologies for a robust infrastructure. Our only funding in this regard was from the Inter-American Development Bank, through an open call with the Latin American Initiative for Open Data, and a contribution from the Digital Public Goods Alliance, also after an open call. The resources were primarily used to develop the Querido Diário’s interface and structure for document collection and indexing.
Success: Since the community is the main drive of the project, success will be measured mainly by their adoption and continued engagement with the data, the code, and the platform, and also by the derivative results that the improved data accessibility will provide. These indicators will be our main focus in evaluating the project's success: 1. Number of people contributing to datasets integration; 2. Number of groups interested in participating in the project; 3 Number of derivative works; 4. Number of platforms derived/consuming data; 5. Number of contributions to the new open-source infrastructure code; 6. Number of downloads of datasets in the CKAN platform; 7. Number of active users of Querido Diário's search platform and in the new CKAN platform; 8. Reduced time and complexity of integrating new datasets with Querido Diário data.
Total Budget: USD 18,054.00
Budget File: pdf
Affiliations: Yes. The proposal is developed and will be implemented by the executive team of Open Knowledge Brasil, an organization formally registered in the country since 2013. In addition, the Querido Diário project, on which this proposal is mostly based, is also developed and maintained by the same organization.
LMIE Carveout: Yes. All the people involved in the design and implementation of this proposal are based in Brazil, and the organization hosting the project is also formally registered in the country. In addition, the proposal is focused on freeing and connecting public data from Brazilian municipalities, in which the country’s enormous inequality is manifest. It makes the project’s user and contributor communities primarily Brazilian.
Team Skills: The authors of this proposal, as well as the executive team of Open Knowledge Brazil, have extensive experience developing strategies, methodologies, technologies, and infrastructures for the releasing and connecting of public data. In the last decade, the organization has played a key role in the country in promoting open government data, as well as in the necessary protection of personal data, through a combination of mobilization of citizen groups to strengthen government oversight; development of technologies and training materials aimed at handling public data; encouraging the involvement of the free software and open source community in civic technology projects. The entire technical team has been working on the Querido Diário project for years, having the best knowledge about all the technologies involved in its operation. In addition, Querido Diário already works with a network of researchers and scholars through the “Querido Diário in the Universities” program. Via this initiative, supported by The Data Science Platform applied to Health (PCDaS) of the Oswaldo Cruz Foundation (Fiocruz), researchers from universities and research institutes throughout Brazil use Querido Diário’s open infrastructure to develop practice and academic research activities in the field of data science.
How Did You Hear About This Call: Word of mouth (e.g. conversations and emails from IOI staff, friends, colleagues, etc.) / Boca a boca (por ejemplo, conversaciones y correos electrónicos del personal del IOI, amigos, colegas, etc.)
Submission Number: 146
Loading