Viewpoint: In the future, AI will be ruled by models, and the importance of models cannot be overestimated

Source: Geek Park

Author: Xing Fu

Original title: "Behind the "Jiang Ziya" model, the evolution of a professional AI team"

Since scientists developed the first "checkers" AI program in 1956, AI has been developed for nearly 70 years. During this period, there have been several ebbs and flows, but one main thread runs through it: that is "modeling"-the proportion of "model" in AI is getting higher and higher. This trend peaked after the emergence of the large language model ChatGPT.

"We firmly believe that the future of AI is the world of models, and we cannot overemphasize models."

On July 22, at the AGI Playground conference hosted by Geek Park, Zhang Jiaxing, chair scientist of cognitive computing and natural language at IDEA (Digital Economy in Guangdong-Hong Kong-Macao Greater Bay Area) Research Institute, said.

In 2021, Zhang Jiaxing led the CCNL Fengshenbang team of IDEA Research Institute to create the largest Chinese open source pre-training model system "Fengshenbang", which is the "forerunner" of the model. They witnessed the "paradigm shift" brought about by large models.

Zhang Jiaxing believes that this transfer includes two keywords, "disappear" and "form". "Disappearance" means that With the arrival of the ChatGPT general-purpose large model, specific types of models that used to do information extraction, question and answer, and text output are disappearing. "Formation" means that the ability to test engineering behind the big model will form a new ecological niche** from the birth of the model to fine-tuning to landing.

IDEA Research Institute CCNL is also laying out in the new ecological niche.

In addition to developing a full-capacity model-at present, the Fengshenbang team has generated a general-purpose large model of "Jiang Ziya" (Ziya) based on LLaMa, which has been applied to scenarios such as digital humans and copywriting. About a month ago, they also trained a series of expert models, such as multimodal models, code models, writing models, dialogue models, etc. The latter can help users write articles, new media copywriting, live broadcast scripts, promotional posters, and even online novels.

Zhang Jiaxing believes that in this huge ecosystem, entrepreneurs can think about where to occupy the ecological niche based on their own strengths. "Anyone who is interested in getting into the field of large models can find their place in it," he said.

The following is the full text of Zhang Jiaxing’s speech at the AGI Playground Conference, edited by Geek Park:

At the AGI Playground conference hosted by Geek Park, Zhang Jiaxing delivered a speech

01. Large Model Era: New Paradigm and New Ecology

This year, when we talk about big models and AGI, we always take big models as a matter of course in AI. Going forward, even if we push back to 1997, a very important thing is that "Deep Blue" defeated "Kasparov". Even that AI system doesn't have a deep learning model in it.

The entire AI development process began in 1956, and it has been 70 years. Although AI has experienced several ebbs and flows, we can find that the development of AI has been proceeding along a line, which is the modeling process of AI - the proportion of models in AI is getting stronger and stronger. Today we firmly believe that in the future AI will be dominated by models, and we cannot overemphasize models.

Picture: Zhang Jiaxing talks about the "modeling" process of AI

We all say that the large model this time is a change in the "technical paradigm", which can be summed up in two key words, "disappear" and "form".

"Disappear" refers to the disappearance of the type. Half a year ago, the entire AI field was flooded with different types of AI structures and tasks. For example, in terms of structure, there are various model structures such as BERT and T5. For example, in terms of tasks, there are various tasks such as classification, information extraction, writing summaries, and questions and answers. However, with the advent of the era of general-purpose large models, this diversity is disappearing.

Right now, the only model structure is GPT, and the only tasks are text input and text output. So the previous AI concepts, such as sentence analysis, keywords and other concepts, have gradually faded out of our field of vision. Moreover, the use of models today is no longer at the discretion of the provider of the technology, but at the discretion of the customer who uses it.

And "formation" refers to the formation of the production chain. The construction of a model requires a huge investment of resources, and almost no one can complete this task from start to finish alone. It requires a huge team and a lot of computing power behind it to polish it. From the initial conception of the model, to the fine-tuning of various stages in the middle, and to the final landing practice, this constitutes a complete production chain.

From the "disappearance" and "formation", we can see the "paradigm shift" of the big model. Sometimes, technological progress is relentless, independent of individual will, and new technological paradigms will replace old technological paradigms.

So, what is the value of large models as this new technological paradigm? In my opinion, it brings four completely new values:

1 Brand New Comprehension

In terms of natural language understanding, the current large model far exceeds all previous models. It seems to really understand the meaning of our every word. While the answers may not be entirely accurate, a whole new level of understanding emerges.

2 Brand New Tools

It is not only a tool for improving efficiency, but also can liberate people from heavy labor. It is also a creative tool that can create things that humans cannot create. For example, last year's Diffusion Model demonstrated the Vinsen graph capabilities.

3 new interface

In the past, we had to write programs to access data and APIs, but now, it seems that we no longer need to write cumbersome codes. We only need to describe in natural language, and the large model can automatically generate codes.

4 NEW ENGINES

The large model is not just a single point of capability, it can be used as an engine to drive information retrieval, dialogue generation, and even story creation.

The large model also brings a new ecology, which is how to integrate with the industry and implement it.

We think big models are not just plain APIs, or immutable models. We emphasize that after the upstream company produces the model, the downstream customers need to carry out further training and run the last mile. In this way, the model can be embedded in each customer's own scenario. As the model performs better, more data is collected, which in turn strengthens the model. This can really promote the development of the entire industry.

In this new ecology, the most upstream is the company that makes the base model, and there are many teams below the base model, which will focus on models of specific capabilities or fields. To continue, it is to cooperate with solution companies, cloud manufacturers and hardware manufacturers to create a variety of products, and finally serve the landing enterprises and governments.

Picture: The new ecology of the large model described by Zhang Jiaxing

From the base model to the real implementation, this involves a lot of links and links, and also gave birth to many new ecological niches. I think everyone can combine their own strengths and think about where they want to occupy in this ecosystem. In fact, anyone who is willing to devote himself to the field of large-scale models can find their place in it.

02. ** Behind the big model of "Jiang Ziya"**

We’ve been a team for two years, and it’s clear from our experience that this paradigm shift has affected us.

Until the end of last year, we were developing a large number of open source models, doing different model structures and task types. In just one year, we have open sourced 98 models, setting a record in the Chinese field.

However, at the end of last year, Wen Shengtu's model suddenly appeared as a hot product. So we started to turn and made the first open source Stable Diffusion model in Chinese, which we call the "Taiyi" model. We hope to keep up with technological paradigm changes for large models.

In the current era of general-purpose large models, what our team is working overtime is to train the best open source base large models for Chinese. This is known as LLaMA2. We trained 20B tokens. Compared with the previously trained "ziya-LLaMA-13B" model, the training speed increased by 38%, which completely solved the problem of unstable "training flight" (abnormal training) during the training process.

Figure: After training 20B token, LLaMA2 solves the unstable "training flight" problem during the training process

After we train this model, it will be completely open source, and there will be no restrictions on commercial applications. At the same time, we promise to continue training this model, hoping to provide the best open source and commercially available model base for the entire large model community.

Under the current technological paradigm, the introduction of ChatGPT this year has excited many people, saying that the general-purpose large model will disrupt all walks of life. However, as time went by, we calmed down and discovered that the big model is actually just a purification and optimization of the existing scene. Therefore, we recognize that there are still many possibilities and opportunities for the application of large models in vertical industries, domains and capabilities.

So about a month ago, our team produced a series of expert models, such as multimodal models, code models, writing models, dialogue models, etc. Many of them have already been released and are at the best level in the field.

We have just recently open sourced the Chinese collaboration model, called "Ziya writing". We hope that this model can become an out-of-the-box assistant to provide support for enterprises and individuals to improve efficiency. For example, government personnel can ask Ziya writing to help write a disaster report, or write a leader's speech at the opening ceremony, because it fits the style of the policy report very well.

In addition, it can also free the creators, operators and marketers of the Chinese community to help write various types of articles, copywriting, soft articles, and even create excellent short stories, or even an ancient fantasy web novel. We can see that it has a very good performance in terms of chapter structure logic and storyline.

We also developed a retrieval package that used only 100 million parameters. It works better than some current solutions in both legal and financial domains, even better than the best vector models that are currently open source. Our toolkit can also be a little helper in the financial industry, helping researchers and analysts.

Why can we produce so many high-quality models?

Behind it is our many accumulations, including a three-stage training system (pre-training PT, supervised fine-tuning SFT, human feedback learning RLHF), including a large amount of high-quality data accumulated, some self-developed algorithms, and its precipitation into our training system.

Each of our models supports both open source and commercial versions, and we authorize our partners to perform training and fine-tuning, allowing them to do private training under their own scenarios.

From the small to the big, the changes from one of our teams also reflect the changes in the current technical paradigm in the field of large models.

03, on-site questions

Figure: The IDEA team accepts on-site questions

**Q: How do you view the future hardware inference architecture? Will the future hardware be "integrated with training and promotion" for a long time, or will there be opportunities for dedicated reasoning chips? **

Zhang Jiaxing: Originally, we used to have two types of chips for training and reasoning, but the current reasoning chip obviously cannot adapt to the current large model.

So at present, basically in terms of hardware limitations, there are more "integration of training and pushing". And the great advantage of integrating training and pushing is that it can reuse computing power. Our reasoning may not always be at full load, so we can make full use of the trough time for training, which is also considered from the perspective of economical time.

In the future, reasoning chips still have their meaning. In some scenarios, such as mobile terminals, edge computing, or vehicle-mounted devices, special customized inference chips are still required. Even in the cloud and servers, if the inference chip can be more optimized towards low power consumption or other aspects, then it still has meaning. I think there should still be dedicated chips for specialized things in the future.

**Q: For some vertical applications, from which angles should we collect data? How to build a high-quality dataset? **

Zhang Jiaxing: In fact, our entire data is also gradually collected. From the very beginning, there are only 20 or 30 data sets. But through training slowly, for example, which part of the ability is missing, we will collect some of this data in a targeted manner, and at the same time we will accumulate some of our own experience, such as some data processing and the like.

Finally, if there is no such thing, we will construct some data ourselves. For example, for multi-person conversations, etc., we have a variety of different types of data sets in it.

**Q: Why are there so many special ability models? Why not boost these capabilities simultaneously on the same model? **

Zhang Jiaxing: We have several considerations. The first is that we have selected the size of the model in advance. After choosing the size of the model, we want the model to have what capabilities. This is a proposition under limited conditions. This is a very large cost advantage.

At this time, I want to put all the abilities into one large model, but these abilities are mutually exclusive in terms of time and space. In terms of space, some abilities are mutually exclusive. For example, when we did logical reasoning questions, such as math questions and writing questions, they were in conflict. In addition, there is a time conflict. At a certain moment, a certain ability is the strongest, but other abilities may not be very strong.

Since downstream scenarios only require a single capability, we simply select certain specific data sets to train certain tasks, which are dedicated models.

**Q: You mentioned that the unstable "training flight" problem was solved, how was this solved? **

Zhang Jiaxing: There is a key point here. First, we have adjusted our training. We have made changes at the source code layer during distributed training. Indeed, the stability of training is much stronger. When we trained Ziya-LLaMA-13B, the curve of that training set was stable. We are a large model team that is very focused on training technology, which is also the guarantee for us to continue to make good models.

**Q: Regarding the discussion of public domain and privatized large models, must the model be privatized? For example, if I want to make a to C application, can I not do privatized deployment? **

Zhang Jiaxing: First of all, we found that our partners have some data security compliance and privacy requirements, and their data cannot be used for training with public models. Second, they need to have a very in-depth scene and customized requirements. Regardless of whether it is a to B product or a to C product, they all hope to use it in their own scene.

At this time, the public large model or the general large model base cannot fully meet their every need, so private training and private deployment have become their must.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)