📢 Gate Square #MBG Posting Challenge# is Live— Post for MBG Rewards!
Want a share of 1,000 MBG? Get involved now—show your insights and real participation to become an MBG promoter!
💰 20 top posts will each win 50 MBG!
How to Participate:
1️⃣ Research the MBG project
Share your in-depth views on MBG’s fundamentals, community governance, development goals, and tokenomics, etc.
2️⃣ Join and share your real experience
Take part in MBG activities (CandyDrop, Launchpool, or spot trading), and post your screenshots, earnings, or step-by-step tutorials. Content can include profits, beginner-friendl
Evolution of Blockchain Data Indexing Technology: From Node to AI-Enabled Full Chain Services
Evolution of Blockchain Data Indexing Technology: From Nodes to AI-empowered Full-chain Data Services
1 Introduction
Since the first batch of dApps was born in 2017, Blockchain applications have been booming, covering multiple fields such as finance, gaming, and social media. When discussing decentralized applications, have we ever thought about the data sources used by these dApps?
In 2024, AI and Web3 have become hot topics. In the field of artificial intelligence, data is like the source of life, critical for the growth and evolution of AI systems. Just as plants need sunlight and moisture to thrive, AI systems also rely on massive amounts of data to continuously "learn" and "think". Without data support, even the most sophisticated AI algorithms struggle to exhibit their intended intelligence and effectiveness.
This article will conduct an in-depth analysis of the evolution of blockchain data indexing in the process of industry development from the perspective of the accessibility of blockchain data. We will also compare traditional data indexing protocols with emerging blockchain data service protocols, with a particular focus on the similarities and differences in data service and product architecture features of new protocols that incorporate AI technology.
2 The Complexity and Simplicity of Data Indexing: From Blockchain Nodes to Full Chain Databases
2.1 Data Source: Blockchain Node
Blockchain is often described as a decentralized ledger. Blockchain nodes are the foundation of the entire network, responsible for recording, storing, and disseminating all transaction data on-chain. Each node maintains a complete copy of the blockchain data, ensuring the decentralized nature of the network. However, for average users, building and maintaining a node is not an easy task, as it requires specialized skills and comes with high hardware and bandwidth costs. The query capabilities of ordinary nodes are also limited, making it difficult to meet the needs of developers. Therefore, although theoretically anyone can run a node, in practice, users are more inclined to rely on third-party services.
To solve this problem, RPC Node providers have emerged. These providers are responsible for the costs and management of the nodes and offer data services through RPC endpoints. Users can easily access Blockchain data without having to build their own nodes. Public RPC endpoints are free but have rate limits, which may affect the user experience of dApps. Private RPC endpoints provide better performance, but they are less efficient for complex queries and are difficult to scale and achieve cross-network compatibility. Nevertheless, the standardized API interfaces provided by node providers lower the threshold for users to access on-chain data, laying the groundwork for subsequent data parsing and application.
2.2 Data Analysis: From Prototype Data to Usable Data
The data obtained from blockchain nodes is usually raw data that has been encrypted and encoded. While this data ensures the integrity and security of the blockchain, it also increases the difficulty of data parsing. For ordinary users or developers, directly handling this raw data requires a significant amount of technical knowledge and computing resources.
The data parsing process is particularly important in this context. By converting complex prototype data into a more understandable and operable format, users can more intuitively comprehend and utilize this data. The effectiveness of data parsing directly affects the efficiency and effectiveness of blockchain data applications, making it a key link in the entire data indexing process.
2.3 The Evolution of Data Indexers
As the volume of Blockchain data increases, the demand for data indexers is growing. Indexers play a crucial role in organizing on-chain data and sending it to databases for querying. By indexing Blockchain data and providing SQL-like query languages (such as GraphQL API), indexers make data readily available. By offering a unified query interface, indexers allow developers to quickly and accurately retrieve the information they need using standardized query languages, greatly simplifying the process.
Different types of indexers optimize data retrieval in various ways:
Currently, Ethereum archive nodes occupy about 13.5 TB of storage space in the Geth client and about 3 TB in the Erigon client. As the Blockchain grows, the data storage of archive nodes will continue to increase. In the face of such a large amount of data, mainstream indexing protocols not only support multi-chain indexing but also customize data parsing frameworks for different application data needs.
The emergence of indexers has greatly improved the efficiency of data indexing and querying. Compared to traditional RPC endpoints, indexers can efficiently index large amounts of data and support high-speed querying. Users can execute complex queries, easily filter data, and perform post-extraction analysis. Some indexers also support the aggregation of data sources from multiple blockchains, avoiding the issue of multi-chain dApps needing to deploy multiple APIs. By operating in a distributed manner, indexers provide stronger security and performance, reducing the risk of interruptions that centralized RPC providers may bring.
In contrast, the indexer allows users to directly obtain the information they need without dealing with the underlying complex data through a predefined query language. This mechanism significantly improves the efficiency and reliability of data retrieval and is an important innovation in Blockchain data access.
2.4 Full Chain Database: Aligning to Stream Priority
Using index nodes to query data typically means that the API becomes the sole entry point for processing on-chain data. However, when projects enter the scaling stage, more flexible data sources are often required, and standardized APIs struggle to meet this demand. As application requirements become more complex, primary data indexers and their standardized index formats gradually find it difficult to meet the increasingly diverse querying needs, such as searching, cross-chain access, or off-chain data mapping.
In modern data pipeline architectures, the "stream-first" approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift allows organizations to respond immediately to incoming data, drawing insights and making decisions in near real-time. Similarly, blockchain data service providers are also moving towards building blockchain data streams, with traditional indexer service providers successively launching products that obtain real-time blockchain data in a streaming manner.
These services aim to address the demand for real-time analysis of blockchain transactions and provide more comprehensive query capabilities. Just as the "stream-first" architecture innovates the data processing methods in traditional data pipelines by reducing latency and enhancing responsiveness, these blockchain data stream providers also hope to support the development of more applications and assist on-chain data analysis through more advanced and mature data sources.
Redefining the challenges of on-chain data through the lens of modern data pipelines allows us to view the management, storage, and provision of on-chain data from a new perspective. When we start to see indexers like Subgraph and Ethereum ETL as data flows within the data pipeline rather than final outputs, we can envision a possible world where high-performance datasets can be tailored to any business use case.
3 In-Depth Comparison of The Graph, Chainbase, and Space and Time
3.1 The Graph
The Graph network achieves multi-chain data indexing and query services through a decentralized node network, making it easier for developers to index blockchain data and build decentralized applications. Its main product models include a data query execution market and a data indexing cache market, both essentially serving the product query needs of users.
Subgraphs are the fundamental data structure in The Graph network, defining how to extract and transform data from the Blockchain into a queryable format. Anyone can create subgraphs, and multiple applications can reuse these subgraphs, enhancing data reusability and utilization efficiency.
The Graph network consists of four key roles: indexers, curators, delegators, and developers, working together to provide data support for Web3 applications. Currently, The Graph has shifted to a fully decentralized subgraph hosting service, with circulating economic incentives between different participants to ensure system operation.
The products of The Graph are also rapidly evolving in the wave of AI. The AutoAgora, Allocation Optimizer, and AgentC tools developed by Semiotic Labs enhance the performance of the ecosystem in various aspects. The application of these tools allows The Graph to further improve the intelligence and user-friendliness of the system with AI assistance.
3.2 Chainbase
Chainbase is a full-chain data network that integrates all Blockchain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:
Chainbase's AI model Theia is a key highlight that distinguishes it from other data service protocols. Theia is based on the DORA model developed by NVIDIA, which combines on-chain and off-chain data along with spatiotemporal activities to learn and analyze encryption patterns, respond through causal reasoning, and deeply mine the potential value and patterns of on-chain data, providing users with more intelligent data services.
3.3 Space and Time
Space and Time (SxT) aims to create a verifiable computing layer that expands zero-knowledge proofs on decentralized data warehouses, providing trusted data processing for smart contracts, large language models, and enterprises.
SxT introduces Proof of SQL technology, an innovative zero-knowledge proof technique that ensures SQL queries executed on decentralized data warehouses are tamper-proof and verifiable. Proof of SQL generates cryptographic proofs that verify the integrity and accuracy of query results, allowing any verifier to independently confirm that the data has not been tampered with during processing.
SxT has been closely collaborating with Microsoft's AI Innovation Lab to accelerate the development of generative AI tools, making it easier for users to process blockchain data through natural language. In Space and Time Studio, users can experience entering natural language queries, which the AI will automatically convert into SQL and execute the query statement, presenting the final results the user needs.
Conclusion and Outlook
Blockchain data indexing technology has evolved from the initial node data sources, through the development of data parsing and indexers, to ultimately becoming AI-powered full-chain data services, undergoing a process of gradual improvement. The continuous evolution of these technologies has not only increased the efficiency and accuracy of data access but also provided users with an unprecedented intelligent experience.
Looking ahead, with the continuous development of new technologies such as AI and zero-knowledge proofs, blockchain data services will become further intelligent and secure. As an essential infrastructure, blockchain data services will continue to play a vital role, providing strong support for the progress and innovation of the industry.