Evolution of Blockchain Data Indexing Technology: From Node to AI-Enabled Full Chain Services

2025-07-24 17:42:52

Evolution of Blockchain Data Indexing Technology: From Nodes to AI-empowered Full-chain Data Services

1 Introduction

Since the first batch of dApps was born in 2017, Blockchain applications have been booming, covering multiple fields such as finance, gaming, and social media. When discussing decentralized applications, have we ever thought about the data sources used by these dApps?

In 2024, AI and Web3 have become hot topics. In the field of artificial intelligence, data is like the source of life, critical for the growth and evolution of AI systems. Just as plants need sunlight and moisture to thrive, AI systems also rely on massive amounts of data to continuously "learn" and "think". Without data support, even the most sophisticated AI algorithms struggle to exhibit their intended intelligence and effectiveness.

This article will conduct an in-depth analysis of the evolution of blockchain data indexing in the process of industry development from the perspective of the accessibility of blockchain data. We will also compare traditional data indexing protocols with emerging blockchain data service protocols, with a particular focus on the similarities and differences in data service and product architecture features of new protocols that incorporate AI technology.

2 The Complexity and Simplicity of Data Indexing: From Blockchain Nodes to Full Chain Databases

2.1 Data Source: Blockchain Node

Blockchain is often described as a decentralized ledger. Blockchain nodes are the foundation of the entire network, responsible for recording, storing, and disseminating all transaction data on-chain. Each node maintains a complete copy of the blockchain data, ensuring the decentralized nature of the network. However, for average users, building and maintaining a node is not an easy task, as it requires specialized skills and comes with high hardware and bandwidth costs. The query capabilities of ordinary nodes are also limited, making it difficult to meet the needs of developers. Therefore, although theoretically anyone can run a node, in practice, users are more inclined to rely on third-party services.

To solve this problem, RPC Node providers have emerged. These providers are responsible for the costs and management of the nodes and offer data services through RPC endpoints. Users can easily access Blockchain data without having to build their own nodes. Public RPC endpoints are free but have rate limits, which may affect the user experience of dApps. Private RPC endpoints provide better performance, but they are less efficient for complex queries and are difficult to scale and achieve cross-network compatibility. Nevertheless, the standardized API interfaces provided by node providers lower the threshold for users to access on-chain data, laying the groundwork for subsequent data parsing and application.

2.2 Data Analysis: From Prototype Data to Usable Data

The data obtained from blockchain nodes is usually raw data that has been encrypted and encoded. While this data ensures the integrity and security of the blockchain, it also increases the difficulty of data parsing. For ordinary users or developers, directly handling this raw data requires a significant amount of technical knowledge and computing resources.

The data parsing process is particularly important in this context. By converting complex prototype data into a more understandable and operable format, users can more intuitively comprehend and utilize this data. The effectiveness of data parsing directly affects the efficiency and effectiveness of blockchain data applications, making it a key link in the entire data indexing process.

2.3 The Evolution of Data Indexers

As the volume of Blockchain data increases, the demand for data indexers is growing. Indexers play a crucial role in organizing on-chain data and sending it to databases for querying. By indexing Blockchain data and providing SQL-like query languages (such as GraphQL API), indexers make data readily available. By offering a unified query interface, indexers allow developers to quickly and accurately retrieve the information they need using standardized query languages, greatly simplifying the process.

Different types of indexers optimize data retrieval in various ways:

Full Node Indexer: Extracts data directly from full Blockchain nodes, ensuring data completeness and accuracy, but requires significant storage and processing power.
Lightweight Indexer: Relies on full nodes to retrieve specific data as needed, reducing storage requirements but potentially increasing query time.
Dedicated Indexer: Optimized for specific types of data or specific Blockchains, such as NFT data or DeFi transactions.
Aggregated Indexer: Extract data from multiple Blockchains and sources, including off-chain information, providing a unified query interface, especially suitable for multi-chain dApps.

Currently, Ethereum archive nodes occupy about 13.5 TB of storage space in the Geth client and about 3 TB in the Erigon client. As the Blockchain grows, the data storage of archive nodes will continue to increase. In the face of such a large amount of data, mainstream indexing protocols not only support multi-chain indexing but also customize data parsing frameworks for different application data needs.

The emergence of indexers has greatly improved the efficiency of data indexing and querying. Compared to traditional RPC endpoints, indexers can efficiently index large amounts of data and support high-speed querying. Users can execute complex queries, easily filter data, and perform post-extraction analysis. Some indexers also support the aggregation of data sources from multiple blockchains, avoiding the issue of multi-chain dApps needing to deploy multiple APIs. By operating in a distributed manner, indexers provide stronger security and performance, reducing the risk of interruptions that centralized RPC providers may bring.

In contrast, the indexer allows users to directly obtain the information they need without dealing with the underlying complex data through a predefined query language. This mechanism significantly improves the efficiency and reliability of data retrieval and is an important innovation in Blockchain data access.

2.4 Full Chain Database: Aligning to Stream Priority

Using index nodes to query data typically means that the API becomes the sole entry point for processing on-chain data. However, when projects enter the scaling stage, more flexible data sources are often required, and standardized APIs struggle to meet this demand. As application requirements become more complex, primary data indexers and their standardized index formats gradually find it difficult to meet the increasingly diverse querying needs, such as searching, cross-chain access, or off-chain data mapping.

In modern data pipeline architectures, the "stream-first" approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift allows organizations to respond immediately to incoming data, drawing insights and making decisions in near real-time. Similarly, blockchain data service providers are also moving towards building blockchain data streams, with traditional indexer service providers successively launching products that obtain real-time blockchain data in a streaming manner.

These services aim to address the demand for real-time analysis of blockchain transactions and provide more comprehensive query capabilities. Just as the "stream-first" architecture innovates the data processing methods in traditional data pipelines by reducing latency and enhancing responsiveness, these blockchain data stream providers also hope to support the development of more applications and assist on-chain data analysis through more advanced and mature data sources.

Redefining the challenges of on-chain data through the lens of modern data pipelines allows us to view the management, storage, and provision of on-chain data from a new perspective. When we start to see indexers like Subgraph and Ethereum ETL as data flows within the data pipeline rather than final outputs, we can envision a possible world where high-performance datasets can be tailored to any business use case.

3 In-Depth Comparison of The Graph, Chainbase, and Space and Time

3.1 The Graph

The Graph network achieves multi-chain data indexing and query services through a decentralized node network, making it easier for developers to index blockchain data and build decentralized applications. Its main product models include a data query execution market and a data indexing cache market, both essentially serving the product query needs of users.

Subgraphs are the fundamental data structure in The Graph network, defining how to extract and transform data from the Blockchain into a queryable format. Anyone can create subgraphs, and multiple applications can reuse these subgraphs, enhancing data reusability and utilization efficiency.

The Graph network consists of four key roles: indexers, curators, delegators, and developers, working together to provide data support for Web3 applications. Currently, The Graph has shifted to a fully decentralized subgraph hosting service, with circulating economic incentives between different participants to ensure system operation.

The products of The Graph are also rapidly evolving in the wave of AI. The AutoAgora, Allocation Optimizer, and AgentC tools developed by Semiotic Labs enhance the performance of the ecosystem in various aspects. The application of these tools allows The Graph to further improve the intelligence and user-friendliness of the system with AI assistance.

3.2 Chainbase

Chainbase is a full-chain data network that integrates all Blockchain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:

Real-time Data Lake: Provides a real-time data lake specifically designed for blockchain data streams, allowing data to be accessed instantly upon generation.
Dual-chain architecture: The execution layer is built on Eigenlayer AVS, forming a parallel dual-chain architecture with the CometBFT consensus algorithm.
Innovative Data Format Standards: Introduce the "manuscripts" data format standard to optimize the structuring and utilization of data in the cryptocurrency industry.
Crypto World Model: Combining AI model technology to create an AI model that can effectively understand, predict Blockchain transactions, and interact with them.

Chainbase's AI model Theia is a key highlight that distinguishes it from other data service protocols. Theia is based on the DORA model developed by NVIDIA, which combines on-chain and off-chain data along with spatiotemporal activities to learn and analyze encryption patterns, respond through causal reasoning, and deeply mine the potential value and patterns of on-chain data, providing users with more intelligent data services.

3.3 Space and Time

Space and Time (SxT) aims to create a verifiable computing layer that expands zero-knowledge proofs on decentralized data warehouses, providing trusted data processing for smart contracts, large language models, and enterprises.

SxT introduces Proof of SQL technology, an innovative zero-knowledge proof technique that ensures SQL queries executed on decentralized data warehouses are tamper-proof and verifiable. Proof of SQL generates cryptographic proofs that verify the integrity and accuracy of query results, allowing any verifier to independently confirm that the data has not been tampered with during processing.

SxT has been closely collaborating with Microsoft's AI Innovation Lab to accelerate the development of generative AI tools, making it easier for users to process blockchain data through natural language. In Space and Time Studio, users can experience entering natural language queries, which the AI will automatically convert into SQL and execute the query statement, presenting the final results the user needs.

Conclusion and Outlook

Blockchain data indexing technology has evolved from the initial node data sources, through the development of data parsing and indexers, to ultimately becoming AI-powered full-chain data services, undergoing a process of gradual improvement. The continuous evolution of these technologies has not only increased the efficiency and accuracy of data access but also provided users with an unprecedented intelligent experience.

Looking ahead, with the continuous development of new technologies such as AI and zero-knowledge proofs, blockchain data services will become further intelligent and secure. As an essential infrastructure, blockchain data services will continue to play a vital role, providing strong support for the progress and innovation of the industry.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

12 Likes