Track: Web mining and content analysis
Keywords: decentralized web, replication, centralization, deduplication, web data management
Abstract: The InterPlanetary File System (IPFS) is a pioneering effort
for Web 3.0, well-known for its decentralized infrastructure.
However, some recent studies have shown that IPFS exhibits
a high degree of centralization and has integrated centralized components for better performance. While this change
contradicts the core decentralized ethos of IPFS and introduces risks of hurting the data replication level and thus
availability, it also opens some opportunities for better data
management and cost savings through deduplication.
To explore these challenges and opportunities, we start
by collecting an extensive dataset of IPFS internal traffic
spanning the last three years with 20+ billion messages. By
analyzing this long-term trace, we obtain a more complete
and accurate view of how the status of centralization evolves
over an extended period. In particular, (1) IPFS shows a low
replication level in general, with only about 2.71% of data
files replicated more than 5 times. While increasing replication enhances lookup performance and data availability, it
adversely affects downloading throughput due to the over-
head involved in managing peer connections, (2) there is
a clear growing trend in centralization within IPFS in the
last 3 years, with just 5% of peers now hosting over 80% of
the content, significantly decreasing from 21.38% 3 years
ago, which is largely driven by the increase of cloud nodes,
(3) the IPFS default deduplication strategy using Fixed-Size
Chunking (FSC) is largely inefficient, especially with the
current 256KB chunk size, achieving nearly zero efficiency.
Although Content-Defined Chunking (CDC) with smaller
chunks could save significant storage (about 1.8 PB) and
cost, it could impact user performance negatively. We thus
design and evaluate a new metadata format that optimizes
deduplication without compromising performance.
Submission Number: 1546
Loading