Rebuilding the building blocks of RePEc

31 Jul 2023 (modified: 01 Aug 2023)InvestinOpen 2023 OI Fund SubmissionEveryoneRevisionsBibTeX
Funding Area: Critical shared infrastructure / Infraestructura compartida critica
Problem Statement: In the early 1990s, the Internet grew to the point where it became feasible to distribute academic papers for free. At that time, two non-commercial systems appeared, organized by academics themselves, that are dedicated to free publishing. These were xxx.lanl.gov (now arXiv), by Paul Ginsparg, and RePEc, by me, Thomas Krichel. xxx had the backing of the Los Alamos National National Lab. When I started in 1992, I had no data and no server to use. But I had the vision forward. I understood that for economists we need to run a decentralized system. This system would mimic the existing working paper culture. The cost of the system would be absorbed by its participants. 30 years fast forward, RePEc still has no source of revenue. The technology it runs on, both conceptually and technically, is now about 25 years old. The infrastructure here are (1) the protocols and (2) the implementation software. The fact that both have been holding up for so long is a sign of strength. But times have moved on. Both documents and code need maintenance. We need to make sure the code is documented, which it is not at this time. We need to prune features that are not used. This is a job that needs dedicated attention. It can not be done in piecemeal fashion over the years. Therefore I decided to write this application for $5000. If funded I expect to work for a year on this project.
Proposed Activities: The proposal has three aims. (1) It aims to rewrite, clarify and rethink the fundamental pillars on which RePEc rests. These the ReDIF templates and the Guildford protocol. (2) It aims to document, clarify, prune unnecessary parts of and better publish the ReDIF reading software ReDIF-Perl. (3) It aims to create a conceptual document for a next generation of extremely low-cost, extremely low-technology academic publishing infrastructure. Let me take these three pieces in turn. (1) RePEc rests on two protocols that I wrote around 1997. (i) ReDIF, the Research Documents Information Format, is a simple "attribute: value" format. It expresses metadata about academic documents and related elements of reality, such as persons and institutions. The overall design aim is simplicity for non-trained staff. ReDIF is at http://openlib.org/acmes/root/docu/redif_1.html (ii) the Guildford protocol, a set of instructions on how to lay out files on a disk so that the ReDIF files can be harvested via ftp or http. The Guildford protocol is at http://openlib.org/acmes/root/docu/guilp.html There have been marginal changes to these documents though the years. They need a complete reexamination. And the Guildford protocol needs a complete rethink. Now it needs a space on a server that is open to harvesters. While this is still possible to get, it is getting more difficult. Security concerns of IT departments are a part of the problem. Other problems may be found by talking to archive providers and see what can be done to make live easier for them. I count this for about 25% of the time spent. It will take place in months 6 to 9. (2) ReDIF is read by ReDIF-Perl. Perl best language choice at the time. But the use of Perl is declining. We should rewrite the software into Python. I am reluctant to do that without first studying how the internals of ReDIF-Perl work. This is what the proposal aims at. I want completely study the software and document it. I will probably enhance ReDIF-Perl in small parts. For example I want to introduce JSON output. That format did not exist at the time ReDIF-Perl was written. There are major issues of character recognition that ReDIF-Perl should handle better. This second part is where the bulk of the work is. I count this for about 60% of the time spent. This is what I will start with. I suspect this will be done in months 1 to 6. (3) One of the reasons our tools are not more widely used is that we have not done much advertising for what we do. In the third part, I want to spent time to see what we can do to further our ideas of making scholarly publication cheaper and more accessible. I plan to bring out at least one longer piece that will chart the course for the economics community. I may also bring out work that will aim at exporting RePEc beyond economics. I am not committed to this as I prefer to do with with a partner. I count this for about 15% of the time spent. This will be done after the ninth month.
Openness: The exiting protocol documents and ReDIF-Perl software have been available in open access since their first draft. I can not even think of a reason as to why we would limit access to this. Looking at the actual data that RePEc has collected, I have been a fan of bulk distribution via rsync. I manage our rsync collection, see http://rsync.repec.org. Note that ReDIF data can be included in non-free datasets. An example is EconLit. This is produced the American Economics Association. They sell EconLit for a handsome amount of change. We get a grand total of 0 dollar income from this. But we do get recognition as the official way for working paper data to go into the EconLit database. As to wider stakeholder consultation, this could hardly be done in a systematic, scientific way. However I intend to write to all the archive maintainers. I will make some effort to reconnect to people who have maintained archives that are no longer being maintained. I want to see where the problems arise. This is important for the Guildford protocol revision. Ultimately, I may come up with a standardized intermediate protocol that we could later implement for folks who can not manage their own servers any more.
Challenges: ReDIF-perl is written in Perl. I have coded in Perl since 1993. It was my main language until 2018. Then I switch to Python for all new code. I still have code to maintain in Perl. Code in Perl has a nasty reputation of being “write-only”. So my biggest risk is to not understand the code, or to spend so much time on each line that I will never finish. But ultimately what determines the readability of code is its author. Here the author is a hired coder, Ivan V. Kurmanov. I read code by him from another project. I could understand it but it is very tough. My friend Christian Zimmermann gave up on the same code. If push comes to shove I can contact him. The libraries in ReDIF-Perl feature a total of 13293 lines. You may wonder why on earth does one need that much code. Well, ReDIF-Perl does not have hard-wired coding for the ReDIF specification. Instead, it features an implementation for a bespoke schema language that then allows to encode the natural-language specification in the ReDIF documentation. The other challenges is the subject area of text processing. Ivan wrote the software in the late 90s. At that time, the implementation of Unicode was still a work in progress. As a result there are known issues with the ReDIF-Perl reading text of unknown encoding. That is apparent at the frequent appearance of double-encoded utf-8 in RePEc web sites. They are handed this data from ReDIF-Perl. This needs fixing.
Neglectedness: I write my funding applications with the aim to inform, rather than to impress. I have seen a lot of charlatans getting funding for work that never achieved anything. May they rest comfortably with it. I would not. I want to do things that actually work. Often times, that involves dreary stuff. Dreary stuff makes for non-fancy funding applications. I am delighted this call aims at funding things that are otherwise neglected. Still my hope of funding this proposal is low. As a consequence, I only ask for the minimum amount of 5000. I hope it can be squeezed in among better presented and more exciting projects. In general, funders want to fund projects that are new and exciting. The old and boring need not to apply. But I did try to find funding to renew the architecture of RePEc from Donald J. Waters at the Mellon Foundation. I know him personally from the days we were in the OAI committee. I wrote to him on 7 January 2019. The mail is too long to show here. I was told we are not in the Mellon's core as it funds the humanities. Ford would be a better match. I wrote to the person he suggested at the Ford Foundation. That person did not bother to reply. The last contributions to RePEc was from the Research Foundation from the French Central Bank, 3000 euros in 2019 and 2000 euros in 2021. This was for an archival system for RePEc. I collected over 1 million PDF files through that system. The foundation is now closed. So this funding source, however generous, is gone.
Success: Since its founding the mission of RePEc has been to democratize the access to research in economics. Economists know the price of everything, but the value of nothing. We want an alternative way to publish that is free to contribute to and free to use. Over 2000 RePEc archives have been created. There is over a million papers produced by them that I have harvested a PDF copy of. Now sure this is a success. This application is about maintaining the tools that deliver that success. But there is another side to this story. The decentralized architecture of RePEc was the model for the OAI repository initiative. I was invited to join the technical committee. Sadly they did not listen to me. The OAI PMH protocol ended up, in my view, over-engineered and therefore too expensive. Yes, I am an economist... But the fact remains that none of our smaller RePEc contributors could implement it. Looking at COAR Notify it seems the trend goes further towards sophisticated systems. That is what funders want to fund. I suggest that this approach creates publishing systems for the few rather than the many. The effort for a single organization to set up such systems is growing. I am not saying these systems are wrong. But there needs to a alternative approaches for low-resource environments. My hope is to extend my contribution to such alternative approaches. That would be a real success.
Total Budget: 5000
Budget File: pdf
Affiliations: RePEc
LMIE Carveout: I can not fake it to pretend that we fit into this. First we are not based anywhere. Second we make all we have freely available for everybody. Given that we give everything away, we can not track how it is being used, so we can not make the case that we are predominantly used in low-income countries. And we can not require users to track it for us. Then we would no longer be truly open access.
Team Skills: The funds in this proposal will be paid to me, Thomas Krichel. I am the only person who will have to work on this proposal. I am the prime founder of RePEc. Basically RePEc goes back to a gopher server I ran in 1993. In that year, I published the first ever online economics paper. In 1994 I converted the gopher to a web server. Over the years, I managed to get data from some of the larger providers. From 1997 to 1999 I got 129000 pounds JISC funding. Among other things this allowed me to fund Ivan Kurmanov to write ReDIF-Perl. However this does not mean that I do this alone. Clearly RePEc could have never taken up had I not managed to recruit a group of disciples. I found most of them in the 1990s. Since then, they form a loosely defined team that maintains RePEc services and coordinates data providers. RePEc is a meritocracy. Everybody has a set of responsibilities. Tasks are carried out with a minimum of co-ordination. One point to mention here is the organizational nature of RePEc. RePEc is not a legal entity. It has features of an organization. But it is mainly a set of data. Much of that data is collected from a set of about 2000 providers spread over the globe. When a legal representation of RePEc is required, RePEc uses the Open Library Society, Inc. (OLS) In these cases RePEc appears like a project of the OLS. See http://governance.repec.org for details. On request the RePEc board can make a statement to say that the proposal comes from RePEc.
Submission Number: 112
Loading