Baidu data crowdsourcing platform

<Return to the collection of special reports

Baidu and Tsinghua, the world's first one billion pixel dataset!

2021-04-15 Baidu data crowdsourcing

Recently, Baidu Intelligent Cloud Data Crowdsourcing has carried out project cooperation with Tsinghua University, Promote the construction of PANDA, the world's first billion pixel video dataset, to support future research and applications in public security, smart cities, virtual reality and other fields.

This project has completed the data annotation of more than 720 billion pixel images and more than 1.06 million slices Including nearly 20000 groups of object relationships, nearly 200000 interactions, nearly 300000 groups of moving object trajectory points, and semantic segmentation and instance annotation of billions of 3D point cloud data, which greatly enriched the existing data set of PANDA and provided data support for the GigaVision (billion pixel machine vision) theme challenge held by Tsinghua University later.

Tsinghua University GigaVision Challenge
PANDA dataset video presentation

In recent years, computer vision analysis, such as pedestrian detection, track tracking, action recognition, anomaly detection, attribute recognition, has been widely used in many fields, such as unmanned driving, intelligent security, and smart city. Behind the application of AI algorithm, large quantities of high-quality annotation data are indispensable. As the largest AI data service provider in China, Baidu Data Crowdsourcing has the ability to integrate data acquisition, standardization, storage, management and training, and focuses on enabling the development and application of artificial intelligence.

It is understood that, PANDA is the world's first billion pixel video data platform, which breaks through the limit of human visual resolution, endows visual computing with higher quality, more realistic and more comprehensive source data, fills the gap of the international "wide field of view, multi object, high resolution" data platform, and provides an indispensable data foundation for the research of new generation intelligent processing technology. (Comparison of mainstream image and video data sets)

Fang Lu, associate professor of the Department of Electronic Engineering of Tsinghua University and project leader of PANDA dataset, said that the team of Tsinghua University had previously organized the GigaVision 2020 Challenge at the top international conference ECCV 2020 on computer vision around PANDA dataset and attracted wide attention. At present, the team is preparing for the GigaVision theme challenge of ACMMM 2021 conference and the related track of the global AI technology innovation contest.

Studying the complex behaviors and interaction modes of large crowds in the real world is of great significance for AI systems to better understand human behaviors and intentions, and thus improve intelligent decision-making capabilities. The construction of PANDA data platform makes it possible to model and analyze large scenes, multiple objects and complex relationships. In the future, Baidu will continue to cooperate with Tsinghua University to enable the construction and development of PANDA data platform with technology.

As a domestic AI leader, Baidu is also the only AI platform company in China that has formed advantages in the fields of intelligent interaction, intelligent infrastructure and industrial intelligence. Relying on Baidu's years of experience in AI data, Baidu Data Crowdsourcing focuses on external empowerment with data intelligence, is committed to providing high-quality data services, and works with more partners such as governments, enterprises, colleges and universities to jointly promote the high-quality development of the new generation of AI.

Share to:

Baidu and Shanxi government cooperate again: create a data trading platform to release the value of data elements

2021-04-07 Baidu data crowdsourcing

"The accumulated trading volume has exceeded 50 million yuan in the first half year since the launch!" Recently, the first data trading platform in Shanxi Province handed over a beautiful "report card".

According to the WeChat official account of the "People's Government of Shanxi Province", the "Shanxi Data Trading Platform" jointly built by Baidu Intelligent Cloud Data Crowdsourcing and the Shanxi Government has introduced more than 1100 data service providers after more than half a year of trial operation since its launch in July 2020; After data desensitization, 169 AI datasets were launched, 147 API data interfaces were accessed, and the total data volume exceeded 130 million, covering multiple data scenarios such as voice recognition, character recognition, face recognition, automatic driving, and natural language processing; Since the platform was launched, the total transaction volume has exceeded 50 million yuan.

It is understood that Shanxi data trading platform is the first data trading platform in Shanxi Province. It is characterized by AI data, aims to build the largest AI data trading center in China, and aims to build a data fusion ecosystem and cultivate the circulation market of Shanxi data elements. It provides data collection, cleaning, labeling, trading Full stack data service integrating applications, etc.

Relying on Baidu AI, big data, security computing and other product technologies and ecological resource capabilities, the platform has formed four core capabilities in terms of transaction services, functional innovation, resource construction, and transaction compliance.

In terms of transaction services, the platform has built service capabilities covering the whole process of business consultation, scheme customization, resource coordination, project management and control, after-sales service, etc; In terms of functional innovation, the platform is embedded with AI data visualization management, intelligent driving tagging data automatic cleaning and other features; In terms of resource construction, the platform introduces multi industry and multi scene AI data resources, and integrates multi type data resources such as government affairs, enterprises and society; In terms of transaction compliance, the platform has formulated and implemented strategies such as data security level management and transaction process security management in combination with current laws, regulations and industry norms to ensure data transaction security and compliance.

In the future, the platform will also actively explore the integration of blockchain, multi-party secure computing (MPC), trusted execution environment (TEE) and other cutting-edge technologies, provide complete data registration, data security, data fusion and other solutions, and create a new "use as trade" model.

It is reported that this is the second cooperation between the Shanxi government and Baidu.

In July 2017, the Shanxi Provincial Government and Baidu signed the Strategic Cooperation Framework Agreement. According to the agreement, the two sides will rely on the existing policy resources and industrial foundation in Shanxi, give full play to Baidu's technological advantages such as artificial intelligence, big data, cloud computing, and carry out all-round and in-depth strategic cooperation for the economic development, industrial upgrading Urban management and scientific and technological innovation provide support.

Focusing on the digital Shanxi strategy, in accordance with the principle of "government guidance and market leadership", Shanxi Comprehensive Transformation Reform Demonstration Zone (referred to as "Shanxi Comprehensive Reform Zone") has cooperated with Baidu Intelligent Cloud Data Crowdsourcing for many times, focusing on building a basic data service system integrating data collection, cleaning, labeling, trading and application with the data labeling industry as the key entry. In September 2018, the two sides cooperated for the first time to jointly build "Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base", which has now developed into a single data annotation base with the largest scale of domestic personnel and output value.

Relying on the data service capability of the base, Shanxi data trading platform will further integrate the data service industry resources, introduce data ecological enterprises, open up the data service industry chain, activate the upstream and downstream, and promote the formation and development of a new ecology of regional big data industry.

In the future, Baidu will also continue to deepen government enterprise cooperation, expand the cooperation model between Baidu and Shanxi to more regions of the country, work with local governments to accelerate the construction of data element markets, release data value and dividends, and boost the transformation of regional digital economy and intelligent industrial development.

Share to:

The 7th China International Big Data Conference: Baidu Smart Cloud helps cultivate regional industrial ecology

2021-04-01 Baidu data crowdsourcing

At present, a new round of scientific and technological innovation and industrial transformation has swept the world. The digital economy has reshaped social productivity, reconstructed the supply of production factors, and is profoundly changing the way of human production and life. On March 30, the 7th China International Big Data Conference was held in Beijing. The content of the conference is digital, networked and intelligent with data driven as the core, aiming to further promote the deep integration and innovation of big data and the real economy, and deepen the efficient exchange and cooperation of big data industry.

(Chen Shangyi, Chairman of Baidu Technical Committee)

Chen Shangyi, chairman of Baidu Technical Committee, delivered a speech at the conference, saying that data elements will become the cornerstone of promoting the transformation of new and old drivers and the development of the digital economy, and become an important strategic resource. At present, releasing the value of data elements is still facing many difficulties. As an enterprise that has been deeply engaged in AI technology for many years, Baidu insists on taking technology as its belief and has made many innovative explorations in data value release.

Promote data collection And marking ability Release the value of data elements

For enterprise users, Baidu uses its leading AI capabilities to provide multi scenario data standardization solutions and labeling services.

Baidu has leading collection resources in the industry. Its collection subjects cover more than 40 countries and regions around the world, and almost all age groups. It is the first in the industry to establish a complete privacy compliance process that conforms to data regulations of countries around the world, which has been highly recognized by customer security departments.

In addition, Baidu has also made a lot of progress in the practice of data annotation services. It not only has a platform crowdsourcing resource ecosystem of more than 20 million, but also improves the annotation efficiency by up to 60% through intelligent algorithms. At present, Baidu's intelligent distribution of data annotation can effectively support millions of level tasks and hundreds of thousands of user management, and its annotation tools can cover 70+different scenarios, Provide customers with a wide range of labeling services.

Assist Local governments improve digital technology innovation capability Cultivate digital industry ecology

On the basis of serving enterprises, Baidu has further explored a digital economy solution with AI data service industry base and trading platform as the core to help local governments cultivate digital industry ecology.

At the end of 2018, Baidu and Shanxi Comprehensive Reform Zone reached cooperation and jointly established Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base. In July 2020, Baidu and Shanxi Comprehensive Reform Zone reached cooperation again to jointly build a data trading platform featuring AI. At present, through the data service practice of "base+platform", Baidu AI data service has formed an innovative and replicable government enterprise cooperation service model, based on the region, radiating across the country, and helping the government to achieve regional digital ecological development.

At this conference, Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base also won the award of "Industry Influence". So far, the base has nearly 3000 employees of AI data announcers, with an accumulated output value of more than 300 million yuan, and 35 enterprises have settled in. Baidu announced that it will train 50000 AI data markers in Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base in the next five years, and introduce more AI partners. By promoting the cooperation model to more provinces and cities, Baidu will provide more AI jobs to support the development of regional data industry.

In the first document on market-oriented allocation of factors published by the Central Committee in 2020, the Opinions of the CPC Central Committee and the State Council on Building a More Perfect System and Mechanism for Market based Allocation of Factors, the data was incorporated into new production factors and raised to the national strategic level. The Outline of the 14th Five Year Plan adopted by the NPC and CPPCC this year shows that the digital economy with data as the key element will become an important strategic carrier for national innovation drive. Faced with the huge demand for technology and services from national policies and market changes, Baidu, as an enterprise that has been deeply engaged in artificial intelligence technology for many years, will combine its unique advantages of "cloud intelligence integration" to become a force that cannot be ignored in the development of the country to create new formats and models of the digital economy.

Share to:

Baidu went public in Hong Kong and continued to write new science and technology stories with AI

2021-03-23 Baidu data crowdsourcing

On March 23, in Baidu Science Park, several gongs sounded a "code gong" with Baidu autopilot chip, Baidu Kunlun chip and Baidu Honghu chip, announcing Baidu Group's listing in Hong Kong.

In addition to Li Yanhong, chairman and CEO of Baidu, and other company executives, Baidu data announcer, 5G Cloud driving safety officer and Baidu developer.

Baidu is not a strange enterprise. "Baidu once" was a strong mark of the search engine era. However, after the arrival of the mobile Internet era, some people think Baidu is "backward". Now, with the sound of a gong, AI is becoming Baidu's underlying logic, reshaping the enterprise's value chain.

"It is not that we are smarter than others, but that we are more focused, and we are more willing to invest in the long-term and future. Because only by maintaining continuous investment in technological innovation, can we seize the huge market opportunities in Baidu's cloud services, intelligent transportation, intelligent driving and other artificial intelligence fields." Li Yanhong said at the scene.

The Story of Artificial Intelligence

Gonger Guo Mei, one of them, once worked in the Changzhi Coal Mine in his hometown Shanxi. "When he looked up, he saw the mountain, and when he looked down, he saw the coal.". Now she is a new professional representative: data announcer.

The job of the data announcer is to teach AI to understand data and let AI perceive, think and make decisions like people. During the epidemic, Baidu Shanxi data annotation base provided support for the implementation and application of many "scientific and technological epidemic prevention" projects across the country. For example, complete the annotation of face images wearing masks, so that people can also achieve accurate temperature measurement or through the face gate without removing masks.

Guo Mei's experience of "re employment" is a big case of Baidu's deep cultivation of AI, and also a vivid footnote of AI enabling emerging industries and driving industrial transformation.

Dong Liang, deputy director of the Administrative Committee of Shanxi Transformation and Comprehensive Reform Zone, believes that Baidu Shanxi Data Annotation Base has laid a good foundation for the development of artificial intelligence industry in Shanxi Province. Up to now, Baidu Shanxi Data Annotation Base has more than 2000 employees of AI data announcers, 35 enterprises settled in, and achieved an operating income of more than 100 million yuan.

Baidu Shanxi data annotation base has become the single data annotation base with the largest scale of personnel and output value in China. In the future, the cooperation model between Baidu and Shanxi will be expanded to more provinces and cities to support the development of local science and technology industry.

Since entering the 21st century, a new round of scientific and technological revolution and industrial transformation is reconstructing the global innovation landscape and reshaping the global economic structure. More than at any time in history, we need to build a world science and technology power. Never before has science and technology had such a profound impact on the future and destiny of the country and the well-being of the people. From a macro perspective to the commercial field, AI is also changing the old pattern.

IBM Research Institute proposed that enterprises that adopt new AI technology to transform their own business model are called "cognitive enterprises". With the growing popularity of AI, blockchain, automation, the Internet of Things, and 5G, the combination of these forces is bound to reshape the standard business architecture.

In the future, the first movers in the AI era will take continuous technological innovation as the main axis and become AI technology enabled providers and "cognitive enterprises" suppliers. A leading AI company with a strong Internet foundation may become a talker in this field in China.

Rebuilding cognition starts from the engine

Ponytail, oval face, a pair of confident big eyes with a little shyness, another Gonger Guo Jiahui is just an ordinary 12-year-old junior high school student in any way if she hasn't developed an AI application.

Can Junior One Students Play AI? Guo Jiahui gave the answer: not only yes, but also simple. At first, Guo Jiahui was not very interested in AI. Later, under the guidance of her father, "code farmer", she found that the development program was not boring at all - on Baidu EasyDL platform, you don't need to write code, you just need to input relevant data according to the platform guidance, and then you can develop applications through independent training and learning according to the platform's algorithm model. Later, she developed an application to detect mask wearing with artificial intelligence, which attracted more than 3000 calls after it was released to Baidu AI market.

In recent years, the deep learning platform has gradually become an important choice for all walks of life in China to rapidly deploy AI. Behind it is the competition of enterprises for the underlying strategic technology in the AI era. EasyDL has won more developers' favor by virtue of its "zero threshold" advantage.

IDC's Deep Learning Framework and Platform Market Share report shows that as of December 2020, the market share of EasyDL platform ranked first in the machine learning platform market share, and maintained the first place in the market for two consecutive years.

"Simple and Reliable" is Baidu's core value. To make complex things simple, technical strength is essential.

According to the listing prospectus, Baidu is one of the few companies that provide full stack AI, with infrastructure including AI chip (Baidu Kunlun chip), cloud platform (Baidu smart cloud), deep learning framework (EasyDL), core AI functions and open AI platform and other products and services.

From the technical perspective, at present, only Google and Microsoft have the capability of full stack layout in the world. In terms of the overall number of patents, as of October 30, 2020, Baidu had 2682 AI patents, the largest number of AI patents in China, and was also among the top 5 global AI companies.

The technical strength is reflected in the financial report, Baidu's revenue fundamentals are also changing, and the proportion of new business revenue supported by its AI is growing.

Baidu's Q4 financial report in 2020 showed that the revenue from online marketing (18.9 billion yuan) was basically flat year on year, while the non marketing revenue (4.2 billion yuan) increased by 52% year on year. This part of revenue mainly comes from Baidu Smart Cloud, with a year-on-year growth of 67%. From 2017 to 2019, cloud service revenue was 3.005 billion yuan, 6.37 billion yuan and 9.173 billion yuan respectively, with an annual compound growth rate of 75%. In the recently released Q4 financial report for 2020, Baidu's smart cloud business revenue reached 13 billion yuan annually, up 67% year on year.

It is enough to see that Baidu is building a new engine through AI.

"Super long endurance" brought by AI

Another one on the stage Gonger He is Lei Jianwei, the safety officer of "5G Cloud Drive". He used to be a driver of the Armed Police Force. After his retirement, he participated in the Hebei Cloud Driving Project. Lei Jianwei said, "I am like a 'big parent', witnessing the growth and upgrading of autonomous cars step by step, and also witnessing the development of the autonomous driving industry, with a sense of achievement." And intelligent driving is Baidu's innovative business in the future.

According to the prospectus, Baidu has built a three-tier growth engine based on AI. These three sectors represent active and stable basic business, rapidly developing emerging business and leading industry frontier business, which will support Baidu's growth space in the present, medium and long term and future respectively.

• The first is mobile ecology. As a stable basic market and cash flow business, mobile ecology includes more than ten APPs, including Baidu APP, Goodlooking Video, etc. Baidu started to improve its search and liquidity capabilities with AI in 2010, and now it has become an AI driven business.

• The second growth engine is Baidu Smart Cloud. Baidu Smart Cloud extends AI technology to the B-end and G-end scenarios to provide customers with various cloud services and AI solutions.

• The third growth engine, It is the "far away" Baidu is looking at: high potential businesses including automatic driving, intelligent assistant and big health.

Automatic driving as At present, the main business application scenarios of global AI technology are most valued by the capital market. Baidu Apollo has invested in the automatic driving business for more than 8 years, and has successively reached cooperation with Changsha, Guangzhou, Nanjing, Shanghai, Beijing and other places, which is not only the large-scale landing of AI, but also marks the opening of large-scale commercialization space.

In the field of smart speakers, Canalys data shows that in the first half of 2020, the shipment volume of small smart speakers in all categories ranks first in China. According to the official data released by Baidu, in December 2020, the total monthly voice interaction times of small assistants reached 6.2 billion. For Baidu, a small smart speaker is more significant in leading the era of intelligent voice search.

The product applications that can be understood and used from the technical level to the public perspective not only need the perennial accumulation of technology, but also need the continuous efforts of AI newcomers such as Lei Jianwei, Guo Jiahui, Guo Mei, etc.

The only way to innovate is to win people. According to the Analysis Report on the Employment Prosperity of Artificial Intelligence Engineering Technicians released by the Ministry of Human Resources and Social Security in April 2020, the AI talent gap in China will exceed 10 million by 2025.

In 2020, Baidu's R&D incentive expenses will be 4.47 billion yuan, accounting for 66% of the equity incentive expenses. At present, Baidu has trained more than 420 colleges and universities nationwide, more than 1000 first-line AI professional teachers, enabled more than 5000 enterprise developers in total, and generated nearly 100 chief AI architects. （ This article is transferred from Xinhuanet ）

Share to:

Several agents were recognized as high-tech enterprises, and Baidu Shanxi Data Labeling Base launched the third phase of partner recruitment

2021-03-15 Baidu data crowdsourcing

Recently, the agents of Baidu Shanxi Data Labeling Base are a bit busy.

Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base (hereinafter referred to as "Baidu Shanxi Data Labeling Base"), located in Taiyuan City, Shanxi Province, has recently officially opened the bidding for the settlement of the three new industrial zones, and recruited new agents for all public beta partners. Many "old" agents settled in Phase I and Phase II are also busy preparing their own "Phase III expansion plan".

Baidu Shanxi Data Annotation Base was jointly built by Baidu and the Shanxi government and officially put into operation in September 2018. After more than two years of development, the base has become a single data annotation base with the largest scale of domestic personnel and output value, covering various data annotation scenarios such as unmanned driving, voice recognition, human face recognition, content audit, etc.

Several agents were recognized as high-tech enterprises, and Baidu Shanxi Data Labeling Base launched the third phase of partner recruitment

For the settled agents, the base adopts a unified standard management mode, and has established a complete enterprise support policy, including project diversion, enterprise operation cost reduction, enterprise management cost reduction, enterprise brand operation support and other aspects, to help enterprises rapidly achieve scale expansion, business capability improvement, management efficiency optimization, etc.

At present, there are 35 enterprises settled in Phase I and Phase II of Baidu Shanxi Data Labeling Base. With the comprehensive training and policy support of the base, the enterprises settled in the base have made great progress in personnel size, business ability, management level and other aspects. At present, the total staff of the base is nearly 3000, and the cumulative output value is more than 200 million yuan. In addition, by the end of 2020, the Base has applied for and been recognized as a national high-tech enterprise.

Shanxi Linnuo Network Technology Co., Ltd. (hereinafter referred to as "Linnuo Company") was identified as a high-tech enterprise last year, and it is also one of the first agents to settle in Baidu Shanxi Data Labeling Base in the second half of 2018. Li Yingwei, the person in charge of the company, has set foot in the data annotation industry since the end of 2017, and has been exposed to Baidu public beta platform. After entering the first phase of the base in 2018, Li Yingwei began to establish his own data annotation team. At present, the team has nearly 200 people. Guo Mei, a post-80s data announcer, was interviewed and reported by CCTV News Network as a representative of the successful transformation of the traditional industry.

Several agents were recognized as high-tech enterprises, and Baidu Shanxi Data Labeling Base launched the third phase of partner recruitment

With the development of the third phase construction of the base, Li Yingwei is also actively laying out the expansion plan of the team. It is understood that the first batch of 15 people of the company has settled in the Phase III industrial zone, and the plan for the addition of follow-up personnel is also in progress in an orderly manner, "including the first batch, at least 65 people have been determined to settle in Phase III".

When it comes to the changes before and after entering Baidu Shanxi Data Labeling Base, Li Yingwei's biggest feeling is that "it was difficult to fight alone; after entering the base, it has been supported by Baidu, and the enterprise has grown rapidly".

In Li Yingwei's impression, there were thousands of data labeling companies, large and small, in the market around 2017 and 2018, but most of the companies he knew had "disappeared" by now. "This industry seems to have a low threshold, but it is not easy to 'survive' after you really enter. You do not have a stable Party A, your labeling ability is not professional enough, and you do not have scale, so in the market competition, you are in a weak position, unable to receive orders, and can not support people." Li Yingwei said.

The "project diversion" policy of Baidu Shanxi Data Annotation Base for enterprises settled in has solved the problem of sustainable operation of enterprises. The import project covers all types of data annotation business such as 2D, 3D, voice and text, providing enterprises with a relatively stable, high-yield and large-scale project source.

Chong Shaowei, the person in charge of "Yayu Network Technology Service Co., Ltd. in Tanghuai Park, Shanxi Transformation and Comprehensive Reform Demonstration Zone" (hereinafter referred to as "Yayu Company"), said that before entering the base, he was most worried about taking over the project. "Now we don't have to worry about finding projects outside, which is great!" Chongshaowei entered the data annotation industry in 2016, and its team and Linnuo entered the base at the same time.

Last year, Yayu also successfully applied to be recognized as a high-tech enterprise. "Settling in the base and cooperating with Baidu have significantly helped our business growth and management ability. Relying on the base, we also have much development and contribution in computer software copyright application, solving local employment and other aspects. These are good 'endorsements' for our application to identify high-tech enterprises." Chong Shaowei said.

In addition to project diversion, Baidu Shanxi Data Labeling Base has also given strong support to agents in terms of reducing operating costs, management costs, and enterprise brand operation. The settled enterprises can enjoy free office space, free administrative, property and security services, as well as free human resource management platform and production management platform. At the same time, the base also provides a full range of management and operation services for settled enterprises, including human recruitment, talent training, efficiency optimization, award evaluation, brand promotion, etc.

Several agents were recognized as high-tech enterprises, and Baidu Shanxi Data Labeling Base launched the third phase of partner recruitment

Shanxi Tiance Technology Co., Ltd. (hereinafter referred to as "Tiance Company"), which entered the base in the first half of 2019, is a company that has grown from "zero" thanks to various support policies of the base. "It is Baidu and the base that have made the sky survey," said Song Xiangdong, the head of the company.

Song Xiangdong had been engaged in the education industry and human resources industry for a long time before. It was an accidental opportunity. Because he participated in the human recruitment work of Baidu Shanxi Data Labeling Base, Song Xiangdong contacted the relevant person in charge of Baidu's public beta resource group, and also learned about the data labeling industry for the first time. Without much hesitation, based on the principle of "finding the right person and following the right thing", Song Xiangdong and the company's partners at that time "put together", and Tiance Company was established.

As a "new recruit" in the industry, Song Xiangdong was also under pressure when he first entered the base. However, she and her team have received strong support and help from Baidu and its base agent partners, and the company's business has developed steadily. In 2020, Tiance Company was rated as "Excellent Partner" by Baidu public beta for its excellent performance.

"Baidu and the base have given us great help for a start-up company that has no accumulation at all. We just organize our manpower and materials well, and don't worry about anything else!" Song Xiangdong said that the atmosphere of the base is also very good, "The agents I have contacted will help each other with an open mind if they have any needs."

It is understood that the team size of Yayu Company and Tiance Company has reached nearly 200 people at present, and they have begun to gradually promote the personnel settlement plan of Phase III of the base. Song Xiangdong introduced that the company clearly planned to settle 60 people in Phase III before May 30 this year. Chongshaowei also said that this year, the company plans to expand the size of the entire team to 300 people.

At present, the construction of Baidu Shanxi Data Labeling Base Phase III Industrial Zone and the bidding for new agents to settle in are in full swing.

Several agents were recognized as high-tech enterprises, and Baidu Shanxi Data Labeling Base launched the third phase of partner recruitment

Baidu officials said that this bidding is open to all public beta partners, including online pre registration, site visit to the base, periodic assessment and other steps, and regularly publicize the registration and assessment progress, making the whole process open, fair and transparent. The assessment content will be based on the base's own management mode and business needs, from three dimensions of business ability, recruitment ability and management ability.

Among them, the evaluation indicator of recruitment ability is "qualified number", that is, the number of qualified personnel organized by agents during the evaluation period; The evaluation indicator of management ability is "qualified proportion", that is, the proportion of "qualified persons" in the total number of employees (including resigned persons). For the final employment, the agents whose "qualified proportion" is ≥ 80% will be selected and ranked according to the "qualified number", and the top 10 to 20 will be selected. For specific announcements, please search and log into Baidu public beta platform - public beta college - the latest announcement and search "Baidu Shanxi base agent bidding" to learn more.

Share to:

Baidu was selected as the "talent capability evaluation support organization" of the Ministry of Industry and Information Technology to provide AI talent support for China's smart economy development

2021-02-08 Baidu data crowdsourcing

Recently, the Ministry of Industry and Information Technology Talent Exchange Center announced two thousand and twenty-one Baidu was selected as a support institution for talent capability evaluation in artificial intelligence and big data fields in the list of talent capability evaluation institutions in key fields of industry and informatization in.

It is reported that this selection focuses on the direction of emerging industries, covering more than 10 key fields such as artificial intelligence, big data, industrial Internet, blockchain, and intelligent manufacturing, and there are a total of fifty-eight Enterprises eighteen Institutions and twenty-five Professional institutions were selected into the directory of "talent capability evaluation support institutions".

As a selected institution, Baidu will support talent evaluation in the future, including professional services, ability coaching, organization and implementation, and application promotion, based on the actual needs of industrial talents in the fields of artificial intelligence and big data, and accelerate the formation of a talent evaluation system oriented by industrial needs and based on position and ability.

In recent years, emerging industries such as artificial intelligence and big data have developed rapidly, and the cultivation and development of industrial talents are also facing difficulties such as large social demand, high complexity, and urgent talent ability evaluation. According to the Talent Development Report of Artificial Intelligence Industry issued by the Talent Exchange Center of the Ministry of Industry and Information Technology（ 2019-2020 It is estimated that the current effective talent gap in China's AI industry is up to thirty Ten thousand.

Baidu is the world's leading AI platform company and has been committed to AI Cultivation of industrial talents. As the first domestic technology enterprise to deploy AI, two thousand and twenty Baidu was selected by Harvard Business Review as“ two thousand and nineteen Global AI One of the top five companies in the company, becoming the only Chinese enterprise on the list. By virtue of AI , big data and other fields, Baidu continues to make efforts to actively promote the construction and development of industrial talent capacity.

In order to address the explosive growth of talent demand in AI related fields, Baidu announced earlier that it is expected to cultivate talents in the next five years AI personnel five hundred 10000, providing for the development of China's smart economy and smart society AI Talent security.

As the application of artificial intelligence continues to accelerate, some emerging industrial application-oriented talents are "born at the right time". two thousand and twenty year two In June, "AI trainer" officially became a new occupation and was included in the national occupational classification directory. It consists of two types of work: data marker and AI algorithm tester.

2018 In, Baidu cooperated with the Shanxi government to build the "Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base" to cultivate a data annotation team with professional competence. At present, the base has developed into a single data annotation base with the largest scale of domestic personnel and output value, helping to three thousand People successfully achieve career transformation and skill improvement.

Baidu Deepplough AI Data field ten In, we established a complete support system and a clear development plan for the cultivation of industrial talents. Baidu plans to cultivate in Shanxi in the next five years five 10000 AI Data announcer, providing them with skills training, ability improvement, career development and other channels. In addition, Baidu will continue to export the knowledge and experience of talent cultivation in AI, big data and other directions to the outside world, making contributions to the cultivation of talents in relevant industries of the whole society.

At present, artificial intelligence has become an important driving force for a new round of scientific and technological revolution and industrial transformation. Baidu as a domestic AI Leading enterprises will continue to increase their investment in the future, accelerate the pace of talent training, and promote the high-quality development of China's AI industry based on talents and technology.

Share to:

To solve the pain point of enterprise "data asset management", Baidu data crowdsourcing was selected as the "Star River" case of ICT Academy

2021-01-18 Baidu data crowdsourcing

Recently, the results of the 2020 big data "Galaxy" case selection organized by the China Academy of Information Technology (CAICT) and other organizations were released, and the Baidu smart cloud data crowdsourcing smart driving data asset management practice project was selected as the "excellent case of data asset management".

It is reported that the case collection activity was jointly organized by the China Information and Communication Research Institute and the Big Data Technical Standards Promotion Committee of the China Communications Standardization Association (CCSA TC601), which is oriented to three major directions of industrial big data applications, data asset management, and privacy computing cases.

With the in-depth layout of the national "new infrastructure", the artificial intelligence industry has ushered in a broader development opportunity. Automatic driving and intelligent transportation are important tracks. As a leading enterprise in intelligent driving in China, Baidu has accumulated profound technologies, capabilities and resources in the field of intelligent driving.

Based on years of data experience in the intelligent driving industry, Baidu Intelligent Cloud Data Crowdsourcing (hereinafter referred to as "Baidu Data Crowdsourcing") has created a "data asset management practice plan", which can provide complete process supporting products and services such as data collection, labeling, storage, management, training, cleaning, evaluation, etc.

On December 30, 2020, the Ministry of Transport issued the Guiding Opinions on Promoting the Development and Application of Road Traffic Automatic Driving Technology, strongly promoting the further development of the domestic automatic driving industry. Baidu Data Crowdsourcing is committed to accelerating the application of intelligent driving technology through excellent intelligent driving data asset management practices, helping the government solve traffic efficiency problems, and promoting enterprises to achieve intelligent transformation.

In the selected case practice of Baidu, a scientific and technological innovation enterprise focusing on intelligent driving research needs to optimize the algorithm and improve the automatic driving ability from L2 to L4. However, in terms of intelligent driving algorithm training, the enterprise lacks road data covered by multiple scenes, high-quality standard data, and perfect data set management process, which results in slow research and development progress. Therefore, the enterprise chose to cooperate with Baidu Data Crowdsourcing.

After fully considering the internal data resources and business application status of the enterprise, Baidu Data Crowdsourcing has provided it with a full process data asset management solution.

The project has the characteristics of large amount of data, multiple scenarios, high accuracy requirements, and puts forward high requirements for data asset management. In this regard, Baidu initiated the establishment of a special committee for automated driving data asset management, and proposed a set of targeted organizational management and implementation measures, including asset management organizational structure, data standard evaluation methods, data asset management processes, audit inspection and evaluation methods, and data security measures.

In terms of data collection, we completed the data collection of 2000 km roads across Beijing and Shanghai, and finally delivered 1.5 km point cloud segmentation results, 7w frame lane lines, and 80w frame obstacle data, with an acceptance accuracy of more than 99%.

In terms of data annotation, relying on unique and professional annotation manpower and platform, data cleaning and data annotation have been completed in an efficient and high-quality manner; Tens of thousands of corner case scenarios can help find out whether there are any gaps in the scene database built through collection and annotation, and help customers accelerate the upgrading of algorithms in the L4 level automatic driving field.

In terms of data management, through the data management platform, customers can achieve hierarchical management of data, visualization of processed data, and data retrieval for specific tags, helping them build a sound unstructured data governance and management system, so as to more effectively use data, improve model training and algorithm iteration efficiency, Accelerate the landing of its autopilot model.

In the process of the project, relying on Baidu's experience in collecting millions of kilometers, Baidu Data Crowdsourcing provides customers with methods for collecting route planning and screening data to be labeled, which improves the efficiency of road collection and reduces waste of invalid collection and labeling. In addition, access to intelligent algorithms suitable for different scenarios, including automatic pre labeling technology, intelligent auxiliary algorithms and automatic quality inspection algorithms, has greatly improved data processing efficiency and data delivery quality.

The process of data capitalization will bring changes to the enterprise, which will be disruptive and innovative, and even bring about "rebirth" to the enterprise. However, at present, enterprises still face many pain points in AI data collection, data annotation, data management, etc., including the difficulties in high-quality data collection, multi scene data annotation, and multi type data management.

How to establish a data asset management system that conforms to its own data characteristics and combines with its own business is the core issue that enterprises need to focus on at present and in the future.

Baidu Data Crowdsourcing is committed to providing customers with one-stop AI data governance and asset management solutions and helping enterprises to standardize and process data assets management, relying on Baidu's 10 years of AI data experience, leading product technology capabilities, and the largest AI data annotation base in China - Baidu (Shanxi) Artificial Intelligence Basic Data Industry Base, Let data value-added bring economic and social benefits to enterprises.

Share to:

Driven by technology, Baidu Intelligent Cloud Data Crowdsourcing focuses on being an "AI enabler"

2021-01-08 Baidu data crowdsourcing

With the arrival of the digital economy era, data is playing an increasingly important role in providing momentum for the intelligent transformation of all walks of life.

On December 25, Baidu Smart Cloud TechDay and Baidu Technology Open Day - Data Crowdsourcing Special Session with the theme of "technology driven, releasing the value of data elements" was held in Beijing.

As a company that has been deeply engaged in AI technology for many years, Baidu is also a pioneer and driver of AI data collection and tagging. Baidu Intelligent Cloud Data Crowdsourcing is providing AI data services to hundreds of leading enterprises and accelerating industrial upgrading, relying on Baidu's 10 years of AI data experience, leading product technology capabilities and the industry's largest data tagging base.

Data is the "fuel" for the development of AI technology. This year, "data" was included in the production factors by the central government for the first time, which means that the construction of digital China has been accelerated again.

Chen Shangyi, chairman of Baidu Technical Committee, said: "At the beginning of 2010, Baidu began to deploy AI. It is the leading AI enterprise with the earliest investment, the strongest technology and the most complete layout in China, and also the enterprise with the best understanding of data. At present, Baidu intelligent cloud data crowdsourcing has become the largest AI data service provider with the strongest brand and technology in the industry, which can provide the most professional, high-quality one-stop data collection and tagging services for AI developers. "

（ Chen Shangyi, Chairman of Baidu Technical Committee ）

The data crowdsourcing mode is a centralized embodiment of swarm intelligence. Professor Sun Hailong from the School of Computer Science of Beijing University of Aeronautics and Astronautics shared the opportunities and challenges of swarm intelligence for big data industry.

He said that swarm intelligence is one of the core contents of the national new generation AI development plan and provides important theoretical and technical support for the development of big data intelligence industry.

In particular, data crowdsourcing is widely used for big data perception, collection and analysis, and has become an important form of group intelligence to support big data intelligent industry. However, it still faces many technical challenges such as group intelligence resource management, task scheduling and distribution, and result convergence. To solve these challenging problems, it is urgent to deepen cooperation between academia and industry.

（ Sun Hailong, professor and doctoral supervisor of School of Computer Science, Beijing University of Aeronautics and Astronautics)

One stop data annotation service, Leading the development of data industry

The AI data annotation platform built by Baidu intelligent cloud data crowdsourcing enables one-stop management of data from collection, access, cleaning, annotation to quality management, delivery and other processes.

In terms of data collection, Baidu intelligent cloud collection resources cover more than 40 countries and regions, as well as 8 major dialect regions in China. Baidu intelligent cloud data crowdsourcing has achieved the fastest portrait collection speed in the industry, with 30000 portraits and 50000 voices collected every week.

In terms of data annotation, Baidu's intelligent cloud data crowdsourcing has formed four key capabilities: data annotation tools that support the whole scene, process platform management capabilities that support the whole process, intelligent annotation technology and huge resource support capabilities, which can provide one-stop AI data services for the data needs of various AI application scenarios.

Baidu intelligent cloud data crowdsourcing has accumulated the ability of more than 70 kinds of data annotation, and has provided nearly 50000 AI data services for more than 200 product lines of Baidu and hundreds of industry leading customers in the industry in the past decade, with the highest accuracy of 99.99%.

At the meeting, Baidu's intelligent cloud data crowdsourcing team revealed the core technology of the AI data annotation platform. The data annotation platform consists of tool platform, resource management platform and task distribution management platform:

• The tool platform meets the needs of customers' voice, pictures, videos, text, 3D point clouds and other full type, full scene data annotation, supports the drag and drop configuration of multiple elements such as points, lines, boxes, and regions, and supports thousands of different rule projects every year;

• The resource management platform and task distribution management platform create a whole process support system from data access, task allocation, resource scheduling, quality audit, task settlement, etc., to achieve real-time management of millions of tasks and hundreds of thousands of users.

With the help of machine decision-making, the tagging process realizes the automatic flow of personnel and data, gets rid of manual intervention, and gives consideration to efficiency and fairness.

The system can automatically describe a comprehensive, accurate and multi-dimensional user portrait system, recommend the most appropriate standard and reviewer for each data annotation project, and ensure that the most matching personnel are used to release the maximum value of data for customers, It takes efficiency into consideration while ensuring quality.

The data annotation platform is based on Baidu's intelligent cloud AI, big data, cloud computing and other capabilities, and is based on the domain driven microservice architecture and plug-in microkernel architecture to ensure the rapid and efficient operation of the platform and ensure the creation of large-scale high-quality data annotation services for customers.

It is worth mentioning that Baidu's intelligent cloud data crowdsourcing continues to explore cutting-edge intelligent tagging technologies, building strong algorithm capabilities from 0 to 1.

At present, AI algorithm has run through the whole process before, during and after annotation, and is widely used in pre annotation, auxiliary annotation, quality inspection, personnel portrait and other links. It has improved the annotation efficiency by more than 60%, and the automatic detection of annotation errors accounts for 70%, greatly improving the efficiency and quality of annotation.

After the introduction of AI assisted intelligent labeling, the overall efficiency of human skeleton point labeling has been improved by 71%, the efficiency of OCR assisted labeling has been improved by 20%, and the efficiency of 3D continuous frame obstacle pre recognition single frame has been improved by 28.8%.

In addition, cutting-edge annotation technologies such as 3D point cloud based on deep learning, which are jointly developed by data crowdsourcing and Baidu Research Institute, continue to stimulate the potential of AI data, and have made great progress in the field of automatic driving.

The first data service and asset management platform, Improve AI algorithm model iteration

As a highlight of this activity, Baidu Intelligent Cloud released the industry's first data service and asset management platform in the field of intelligent driving, providing integrated intelligent data service solutions for intelligent driving enterprise users.

The data service and asset management platform covers the whole life cycle of AI development of "data collection, data annotation, data management, model training, and model evaluation", helps enterprise users build AI pipelines around data, improves the iteration efficiency of AI algorithm models, and enables data to better drive model development.

The data service and asset management platform will build AI data closed-loop for customers with leading data services to accelerate the realization of customer data value.

In the era of digital economy, data has become a key factor of production. The experts on site agreed that the future data quality, data governance, talent training, process standards, etc. will become the key drivers for the further development of AI data services, and promote the large-scale application of AI technology.

As a pioneer in industry practice, Baidu Intelligent Cloud Data Crowdsourcing will rely on the professional label manpower of "Baidu (Shanxi) Artificial Intelligence Basic Data Industry Base", empower all industries with industry-leading technical strength, and continue to release the deep value of data elements.

Share to:

Promoting the development of intelligent data industry Baidu Chen Shangyi won the "Big Data Technology Communication Award"

2020-12-13 Baidu data crowdsourcing

On December 13, the "2020 Big Data Science and Technology Communication and Application Summit Forum" was held in Hengyang, Hunan, with the theme of "hundreds of years of changes win the future". This forum is an international and authoritative achievement exchange platform in the industry. It is co sponsored by the China Science and Technology Journalism Association, Hunan Association for Science and Technology, and Hengyang Municipal People's Government. At the forum, the "Big Data Science and Technology Communication Award" was announced. This year, 78 scientists and entrepreneurs won nine awards in four categories: special contribution award, work award, group award and individual award. More than 10 academicians, including Li Lanjuan, Li Deren, Chai Tianyou, Chu Junhao, Gu Guobiao, Liu Yunjie, and more than 200 experts and scholars in the field of big data communication and application were invited to attend the forum. Chen Shangyi, Chairman of Baidu Technology Committee, won the "Big Data Science and Technology Communication Award Leader Award".

(Left 6: Chen Shangyi, Chairman of Baidu Technical Committee)

The "Big Data Science and Technology Communication Award" is the first award established by the Chinese Society of Science and Technology Journalism in 2018, which aims to commend groups and individuals who have made outstanding contributions to the application, communication and promotion of big data science and technology, and promote the development of a new generation of information technology and industry represented by big data.

Chen Shangyi has been working in the field of big data for many years. After entering the era of big data and intelligence, he and the Baidu team have actively promoted the development of the intelligent data industry and made outstanding contributions. He once emphasized the importance of intelligent data governance in the process of industrial intelligence. "Industrial intelligence cannot be separated from data governance, and data governance also determines the process of industrial intelligence to a large extent." In addition, as a new industry of intelligent economy, data collection, tagging and trading have also created a new profession - data announcer and artificial intelligence trainer. In this field, Baidu actively and continuously optimized the tagging algorithms, platforms and tools in these emerging data industries, and signed an in-depth cooperation agreement with Shanxi Province to establish a Shanxi data tagging base. It has not only helped local governments cultivate digital industry ecology, but also trained many data announcers. It is estimated that 500 million yuan of direct economic benefits will be generated in the next three years and 50000 people will be employed. At present, Chen Shangyi is leading his team to promote this new industrial model nationwide, making greater contributions to the development of China's big data and artificial intelligence industries.

2020å¤§æ°æ ® ç§æä¼ æå¥ææ éå°ä¹è·âé¢åäººå¥â

In addition to big data, Chen Shangyi has also made important contributions to promoting scientific and technological progress. As an expert in major national science and technology projects, Chen Shangyi has participated in the formulation and implementation of national science and technology policies for many years in a row, and has repeatedly proposed to relevant national departments to formulate policies and introduce laws and regulations to break through the "choke" key technology, which has been commended by the Ministry of Science and Technology, the All China Federation of Returned Overseas Chinese and the Chinese Electronic Society.

As the general manager of Baidu Xiong'an Company, Chen Shangyi also actively promoted the strategic cooperation between Baidu and Xiong'an New Area, participated in the planning and formulation of Xiong'an New Area, and cooperated with Baidu's business teams to deeply participate in the construction of unmanned vehicles, smart towns, and the creation of smart living experience halls, making outstanding contributions to the intelligent construction of Xiong'an New Area.

Baidu is also actively promoting the efficient dissemination of technology, enabling the whole industry and society. Baidu has opened the Baidu Institute of Technology, opening up Baidu's internal technologies, including big data, artificial intelligence, in-depth learning, intelligent driving, and so on, which have accumulated for more than ten years to the whole industry; In addition, Baidu Technology Open Day was established to promote the dissemination, sharing and collaborative innovation of government, industry, university and research knowledge.

Chen Shangyi, who won the "Leading Soldier Award of Big Data Science and Technology Communication Award" this time, said that Baidu hopes to make the society enter the big data era faster and make life better through science and technology communication, and also hopes that friends from all walks of life, especially in the field of science and technology communication, will work together to let science and technology lead us into a better era.

2020å¤§æ°æ ® ç§æä¼ æå¥ææ éå°ä¹è·âé¢åäººå¥â

In the peak dialogue, Chen Shangyi also expressed his own views on the application of big data in technology enterprises. He said that the development of big data has brought great impetus to AI breakthroughs, and sufficient data, computing power and algorithms have brought new opportunities to all walks of life. Taking Baidu as an example, the integration of big data and artificial intelligence is enabling industries such as industry, agriculture, finance, and medical care to accelerate the transformation of industrial intelligence. It is worth noting that strengthening the application security specification of big data is also the direction that major enterprises need to work hard. Only legal and compliant use of data resources can better make data flow, make transportation more convenient, make pharmaceuticals simpler, and make urban management more efficient... make intelligent life accessible!

Share to:

Focusing on high-quality data services, Baidu Data Crowdsourcing won two awards of "China Data Quality Management"

2020-09-17 Baidu data crowdsourcing

Recently, Baidu Intelligent Cloud Data Crowdsourcing won the "2020 Data Quality Excellence Practice Award" and the "2020 Data Quality Excellence Product Award" with a high level of data quality management in the "DQMIS2020 Second China Data Quality Management Award" (hereinafter referred to as the "Award") selection activity.

(Baidu Intelligent Cloud Data Crowdsourcing won two awards of "China Data Quality Management Award")

The award aims to select China's outstanding data quality achievements and industrial practices, and promote the innovative development of China's data quality management industry. The selection activity was jointly organized by the Data Quality Management Think Tank (DQpro) and the organizing committee of the Data Quality Management International Summit (DQMIS) (led by Peking University, State Grid Global Energy Internet Research Institute, Huaju Consulting and other institutions).

Data quality is the core of data management and the basis of data value realization. High quality data plays an important role in industrial development and upgrading. Baidu Data Crowdsourcing, relying on Baidu's 10 years of AI data experience, leading product technology capabilities and the industry's largest data annotation base, is committed to providing AI enterprises with professional, high-quality AI data collection and annotation services.

“ A I Data annotation platform ” ： One stop data management guarantee high quality

The "Baidu Intelligent Cloud AI Data Labeling Platform" (hereinafter referred to as the "Platform"), independently developed by Baidu Data crowdsourcing team, won the "2020 Excellent Data Quality Product Award" in this selection.

(Baidu Intelligent Cloud Data Crowdsourcing won the "Excellent Data Quality Product Award" ）

As a whole process management platform for basic data services, it can realize one-stop management of data from access, cleaning, labeling, quality management, delivery and other processes.

The platform has industry-leading intelligent auxiliary annotation technology, automatic quality inspection algorithm, and mature data quality management system, which can guarantee the quality of delivered data. Among them, using AI technology to provide data aided annotation can greatly improve the efficiency of annotation, and plays an important role in the effective organization and processing of unstructured data.

At present, the service of the platform has covered many fields, including AI enterprises, mobile phone manufacturers, automobile manufacturers and the Internet industry. It can deliver standardized and structured available data for customers, help customers train algorithm models, carry out machine learning, and improve their competitiveness in the AI field.

So far, the platform has collected more than 150 million standard 2D/3D data frames in the field of intelligent driving, with an accuracy rate of more than 99%; Voice data delivery has reached tens of thousands of hours; There are also tens of millions of texts.

Intelligent driving data ：“ Integrated quality management of standard collection ” Assist technology to ground

In this selection, Baidu Data Crowdsourcing won the "2020 Data Quality Excellence Practice Award" with the "Intelligent Driving Data Standardization Integration Quality Management Practice".

(Baidu Intelligent Cloud Data Crowdsourcing won the "Excellence in Data Quality Practice Award" ）

Automobile intellectualization is widely considered as an important part of the future intelligent transportation architecture in the industry, and major car manufacturers have made strategic deployment for L4 level automatic driving.

In this award-winning case, a scientific and technological innovation enterprise is committed to manufacturing safe, reliable and excellent intelligent cars. In order to promote the landing of various intelligent driving models, the demand for data collection and data annotation of the enterprise has soared.

Due to the large amount of data required for this project, the variety of scenarios and the high requirement for accuracy, it poses a great challenge to data quality management. Most annotation teams on the market only have a few single scene annotation capabilities, lack scientific project management processes, and cannot meet customers' requirements for data 。

Baidu Data crowdsourcing team provides the solution of "Integrated Project Quality Management of Intelligent Driving Standards Collection", which adopts hierarchical organization and personnel management, and has sound and complete project system specifications, high professional data quality control standards, and intelligent and safe data quality management implementation process.

The first phase of the project has completed the data collection of 2000 km roads. Relying on the professional labeling manpower of "Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base", the industry leading continuous frame ID prediction and normalization algorithm has greatly improved the labeling efficiency and data quality. The data accuracy rate is up to 99%, and the efficient and high-quality service has won unanimous praise from customers.

In the era of digital economy, data has become a key factor of production. In the future, with the large-scale application of AI technology, data quality will become an important factor in the application of new technologies and the development of enterprises, as well as the "last mile" that affects the efficiency of data analysis and utilization. As a pioneer in industry practice, Baidu Data Crowdsourcing will continue to focus on data quality management, provide AI enterprises with professional and high-quality AI data services, empower all industries with technical strength, and accelerate the development of industrial intelligence.

Share to:

"Data service" boosts industrial intelligence. Chen Shangyi: release data value and jointly build industrial ecology

2020-09-16 Baidu data crowdsourcing

"The acceleration of industrial intelligence is inseparable from data governance, and data also determines AI's intelligent process to a large extent." On September 15, "Everything Intelligence - Baidu World 2020" At the intelligent cloud sub forum held online in the afternoon, Chen Shangyi, chairman of Baidu Technical Committee, explained the important role of intelligent data services in the process of industrial intelligence from the perspective of "data intelligence", and shared the exploration of Baidu intelligent cloud in data collection, labeling and governance.

(Chen Shangyi, Chairman of Baidu Technical Committee: Intelligent data services play an increasingly important role in promoting industrial intelligence)

Chen Shangyi said that data plays a vital role in AI intelligence, but enterprises often face many difficulties, such as data acquisition and processing difficulties. To this end, Baidu has provided the industry with a comprehensive data standardization solution for multiple scenarios and types of customers to help customers release data value.

At the same time, on the basis of serving enterprises, Baidu has further explored a digital economy solution with data annotation base and trading platform as the core to help local governments cultivate digital industry ecology.

Chen Shangyi introduced that Baidu's smart cloud data standardization solution is in the leading position in the industry.

In terms of data collection capability, Baidu has industry-leading collection resources, with collection subjects covering more than 40 countries and regions around the world; Domestic voice data collection covers eight major dialect areas and people of all ages in China.

From the perspective of data annotation capability, the team has annotation tools that support the whole scene, an efficient process management platform, and intelligent annotation algorithms. At the same time, it has built a huge annotation resource to support project implementation, which can provide high-quality and customized data annotation services.

In the process of data collection and annotation, data security and data quality are the most concerned topics in the industry. In terms of ensuring data security, Baidu was the first in the industry to establish a complete privacy compliance process that conforms to data regulations of countries around the world, and was highly recognized by customer security departments. In terms of improving data quality, Baidu has set up a dual process of intelligent audit and manual quality inspection, leading the industry in accuracy. In addition, the team innovatively introduced pre annotation algorithm and auxiliary annotation algorithm, which greatly improved the efficiency and accuracy of annotation.

These capabilities enable Baidu to meet the acquisition needs of almost all scenes, covering voice, picture, video, text, 3D and other annotation types. At present, in typical scenes, up to 30000 people can be collected every week for portrait and 50000 people can be collected every week for voice.

On the other hand, in addition to the support of advanced intelligent technology, industrial development still needs strong human resources support in the face of huge data processing volume. Chen Shangyi said that Baidu Smart Cloud has built a labeling human resources system with the largest number of people and the most professional in the industry through online crowdsourcing and offline self built labeling bases.

"At present, there are more than 200000 online crowdsourcing personnel, more than 300 offline signing and labeling agents, and 20000 professional labeling personnel." Chen Shangyi introduced, "In addition, Baidu and the Shanxi government established a Shanxi data annotation base in 2018, with 2300 full-time announcers who are stable and professional and can undertake difficult annotation tasks such as automatic driving, voice, image and portrait."

"The huge annotation resources provide us with the strongest annotation capability in the industry. Today, we annotate more than 500 hours of voice data every day, more than 20000 pieces of image data, and more than 40000 pieces of automatic driving road data." Chen Shangyi said.

Following the joint construction of the data annotation base, Baidu is now working with the Shanxi government again to build the "AI data trading platform of Shanxi Comprehensive Reform Zone". This is the first big data trading platform in Shanxi Province.

"We hope to build a data trading platform featuring artificial intelligence unstructured data, accelerate regional data circulation and open sharing, and release the value of data elements." Chen Shangyi said, "We are committed to building the data open platform into a new infrastructure for regional digital economy development, and using data as a new incubator for regional innovation and entrepreneurship."

Liu Yong, Deputy Director General of the Department of Industry and Information Technology of Shanxi Province, attended the sub forum and recognized the achievements of the cooperation between the two sides. He said that in recent years, Shanxi Province has vigorously implemented the big data strategy. Lou Yangsheng, the secretary of the provincial Party committee, and Lin Wu, the governor of Shanxi Province, have made great progress in the development of Shanxi's big data industry.

"Next, we will take the labeling industry as the traction, gather the development potential of artificial intelligence, focus on building a basic data service system integrating data collection, cleaning, labeling, trading and application, and take the lead in developing a new path in transformation." Liu Yong said that he sincerely welcomes Baidu and other enterprises to join hands with Shanxi, Create and share a bright future of big data innovation and development.

Chen Shangyi said that in the future, Baidu will unite with local governments and enterprises to gather superior resources from both sides and cultivate data service capabilities To solve the problems faced in the development of regional digital economy, such as the lack of digital environment, the difficulty in the circulation of data elements, and the difficulty in data value mining, promote the openness, sharing and circulation of data, reduce the threshold for enterprise technological innovation, and build new infrastructure for the development of digital industry.

"The digital economy, with data as the key element, will become an important strategic carrier driven by national innovation. In the wave of digital economy development, Baidu Smart Cloud will work with peers to build a data ecology and promote intelligent development of the industry." Chen Shangyi said.

Share to:

Baidu and Shanxi Government Sign Cooperation Agreement Again to Promote the Landing of Data Economy

2020-06-22 Baidu data crowdsourcing

The establishment of AI data trading platform will continue to expand Baidu's business in Shanxi, help Shanxi's data service enterprises expand their business scope, and promote the opening and sharing of data resources.

On June 6, Baidu Intelligent Cloud Data Crowdsourcing reached a cooperation agreement with the Shanxi Provincial Government. The two sides will further deepen cooperation, accelerate the construction of major transformation projects in Shanxi Province, and jointly build an AI data trading platform for Shanxi Comprehensive Reform Demonstration Zone. Lou Yangsheng, secretary of Shanxi Provincial Party Committee, attended the signing ceremony and delivered an important speech.

On the same day, Shanxi Satellite TV reported Lou Yangsheng, secretary of Shanxi Provincial Party Committee, who said, "We should adhere to the application orientation, actively strive for the layout of national key laboratories and major scientific research equipment and devices, use the courage and wisdom of the goal of" first-class ", boldly climb scientific peaks, and attract first-class talents and teams with first-class platforms and first-class topics. We should adhere to the achievement orientation, innovate the system and mechanism, break down the seniority system, implement the system of "unveiling the list and taking the lead" for key scientific research projects, and dare to change the lead in major fields, subdivisions, and future industries. We should give full play to the main role of enterprises and scientific research institutions, strengthen platform construction, attach importance to the transformation of scientific and technological achievements, turn achievements into products, turn products into industries, and become the pillar of transformation and development. "

(Lou Yangsheng, Secretary of the CPC Shanxi Provincial Committee)

As a leading data service provider in the industry, Baidu Intelligent Cloud Data Crowdsourcing is committed to providing AI enterprises with a series of professional data services such as AI data collection, governance, annotation, and data set optimization. The person in charge of Baidu smart cloud data crowdsourcing said that Baidu smart cloud data crowdsourcing has a large number of customer landing cases and rich industry experience. This time, helping to build the AI data trading platform in Shanxi Comprehensive Reform Demonstration Zone is undoubtedly an important chapter of Baidu smart cloud data crowdsourcing to accelerate industrial intelligence.

In order to implement the spirit of the National "Two Sessions", further promote the digital economy of Shanxi, and realize the transformation of Shanxi from coal resources to data resources. The government of Shanxi Province proposed that we should be bold in innovation and advance, accelerate the construction of major transformation projects, and provide strong support for pioneering a new path in transformation and development. As an important carrier of data trading behavior, the data trading platform can promote the integration of data resources, standardize communication behavior, reduce transaction costs, and enhance data mobility. It has become one of the important measures taken by the Shanxi government to realize the development of digital economy.

Share to:

Leading Baidu intelligent cloud data crowdsourcing: welcome key opportunities again under the new infrastructure

2020-06-18 Author | Produced by Zhenting | Xinmang X

How far is AI from us?

Two years ago, the rate would have felt out of reach. But today, the process may be beyond your imagination.

"Now, one in 10 enterprises uses 10 or more AI applications," said MMC Ventures, a British organization.

According to Salesforce Research, 83% of IT leaders said AI&ML was changing customer engagement, while 69% said they were changing their business.

In particular, during the epidemic, the application of various AI enabled devices and products maximized the landing of AI.

AI is vigorously changing life and work in all aspects, which has become a strong consensus.

Behind the strong development of AI, there is a key role that cannot be bypassed, that is, data.

The importance of data to AI is self-evident, so data is the "fuel" of AI algorithm, and data is the description of "oil", "soul" and many other images in the AI era.

Furthermore, the business value, social value and business value of enterprises that provide AI basic data services around AI are further highlighted, providing strong supply for enough and good data.

We are seeing the existence of Baidu Intelligent Cloud Data Crowdsourcing. As the largest AI data service provider in China, Baidu has worked hard in this field, sparing no effort to contribute its professional ability and value, and has continued to shine in taking social responsibility to solve employment problems.

In the new era background of accelerating new infrastructure construction, AI, as an important component, has contributed to the rapid growth of the data crowdsourcing industry.

Baidu's smart cloud data crowdsourcing, which has already taken the lead, welcomes another key opportunity.

Open the "beautiful new world" of AI data services

"A lot of data is really too important for AI."

This is a point made by Li Ming, a senior product operator of Baidu Smart Cloud crowdsourcing, on Baidu Smart Cloud TechDay.

If it is necessary to give priority to the algorithm, computing power and data of the three elements of artificial intelligence, in his opinion, data is the most important.

Because the foundation of artificial intelligence is training, a large number of scenes and data are needed for the artificial intelligence algorithm to learn. Only through a large number of training, can the neural network better summarize the rules, apply them to new samples, and then make intelligent judgments and answers.

The significance of high quality and rich multidimensional data for AI is obvious both in business and in the process of dimension upgrading to AI.

According to the latest White Paper on China's Artificial Intelligence Basic Data Service Industry by iResearch, the rise of the AI economy provides a long-term basis for basic data services. The industry has entered a growth period and the pattern is gradually clear.

With such a data, we can fully feel that the market size of AI basic data services will exceed 10 billion in 2025, and the industry's compound annual growth rate will be 23.5%. From the perspective of overall growth, the industry is relatively stable, and the continuous development of the downstream AI industry will form a long-term positive.

From the initial situation of low threshold, multiple players swarming in, good and bad, to the landing stage of AI, vertical scene data has become the main demand, the requirements for data type, quality, etc. have been significantly improved, and the strength of leading enterprises has gradually become prominent.

No matter from its own attributes or the development trend of the industry, data services are undoubtedly in a rising period, becoming a time of increasing uncertainty in the economic environment. The unique existence of the scenery here is like a "beautiful new world" being gradually opened.

"Leading Big Brother" Baidu Data Crowdsourcing

"The first market share for three consecutive years"

"Annual revenue growth rate exceeds 50% ”

"More than 220 product lines within the service company ”

"Intelligent driving, mobile phones, the Internet, and full coverage of AI developers' head customers ”

According to iResearch: Research Report on China's Artificial Intelligence Basic Data Service Industry in 2019, Baidu Intelligent Cloud Data Crowdsourcing has now become the largest AI data service provider in China.

Seeing this series of achievements, most enterprises must be beyond their reach.

The above is just a transcript from Baidu Intelligent Cloud Data Crowdsourcing, the "No. 1" in the AI data services of the tuyere industry.

As a professional and high-quality AI data service provider in the industry, Baidu Intelligent Cloud Data Crowdsourcing has provided data services to Baidu's internal and external customers since 2011.

Behind a series of proud achievements, it is bound to not be easy to get. What factors have contributed to Baidu's leading position in crowdsourcing of intelligent cloud data in an industry full of fierce competition and imagination?

We try to explore its core variables.

Baidu intelligent cloud data crowdsourcing's hard core capability

There is such a view in the book "Flash Expansion": huge new opportunities are usually created because technological innovation creates new markets or disrupts existing markets. Baidu's achievements in crowdsourcing of intelligent cloud data coincide.

"In fact, the most important thing is the accumulation and innovation of our technology", said Li Ming, a senior product operator of Baidu Intelligent Cloud Data Crowdsourcing.

It is understood that the AI basic data annotation and collection service platform with the first "brand, scale and technology" in the industry has been formed.

This new world must be described in concrete details. Specifically, you can feel the confidence and strength from the leading position.

In terms of the service capacity of standard collection, the self built base has 2300 full-time marker; The channel agent resource pool spread all over the country and 22 countries around the world, with more than 50000 standard collectors; In addition, there are 20 million crowdsourcing Internet users; It has achieved full coverage of mainstream labeling scenarios in the market, meeting more than 95% of the labeling needs in the market.

In addition, it has an industry-leading tool platform, realizing process standardization and intelligent tools. Even customized services are standardized for them.

At the same time, in the whole annotation process, the algorithm was added, and then invalid data was screened through an automated algorithm, which greatly improved the efficiency and quality of the entire annotation and audit.

This is also in line with the prediction that iResearch will enhance the continuous learning ability of the data processing platform, continuously learn manual annotation by the machine, and improve the replacement rate of pre annotation and automatic annotation ability for manual work.

Baidu data crowdsourcing has also been fully considered and deployed in the data security construction that cannot be bypassed. The security and compliance of data are mainly guaranteed from four aspects: data compliance, customer compliance, user and resource compliance, and privacy compliance.

For example, full-time employees signed a series of measures and many details, such as confidentiality agreement, direct line connection, restriction of external network, computer USB encryption, video monitoring, regular patrol of personnel, to conduct full process control and ensure data security and data compliance.

Just based on the maturity of the above comprehensive capabilities and the ultimate outbreak of perfection, Baidu's smart cloud data crowdsourcing customers have comprehensively covered all the head customers in the four major fields of smart driving, mobile phone industry, and Internet and AI developers.

Taking automatic driving as an example, the industry urgently needs a dedicated data platform with abundant and diverse data. For this reason, Baidu Intelligent Cloud Data Crowdsourcing and Intelligent Driving Lab have cooperated to complete hundreds of thousands of high-resolution image annotation, which covers semantic annotation, dense point clouds, three-dimensional images, three-dimensional panoramic images, and complex environment, weather and traffic conditions, Baidu Apollo Scape has the world's most complex automated driving high-precision data set, providing more rich and complex data application scenarios for global automated driving developers to train, learn and evaluate.

In addition to open source data sets, Baidu Intelligent Cloud Data Crowdsourcing can also provide customized data services for vertical industries.

On May 28, 2020, in response to the needs of Shanghai International Automobile City, Baidu Intelligent Cloud Data Crowdsourcing launched the scheme of "private tagging platform+dedicated team of the base", customized development was made in combination with the automatic driving tagging scene and organizational management needs of the International Automobile City, and the capabilities of Baidu's leading tagging platform were extracted and deployed for privatization.

These comprehensive and systematic deployment, continuous innovation and iteration, as well as years of focus and accumulation, have contributed to Baidu's position in the world of intelligent cloud data crowdsourcing.

Coincident with new infrastructure, another great opportunity

This year, new infrastructure has become a high-frequency word.

With the acceleration of this new infrastructure, the AI industry is bound to enter a period of rapid development.

The market's basic demand for massive data will become larger and stronger in the process of AI accelerated application, which will further stimulate the growth of the market's basic data demand and usher in a new good development opportunity for the further development of Baidu's intelligent data crowdsourcing.

We can all understand that once a standardized enterprise occupies the commanding height of its ecosystem, the surrounding stakeholders will recognize its leadership, and talents and capital will flow in.

Like a snowball, Baidu's strong AI gene, coupled with the demand for unlimited expansion of the entire industry, will help its future development and imagination.

Enlarge the sense of achievement: the way of employment support at key nodes

In addition to the unlimited expansion and amplification of its own business capability and business value, we also see another key role of Baidu's intelligent cloud data crowdsourcing, that is, its role in corporate social responsibility.

Because of its business nature and scale effect brought by its own size, it has created many employment opportunities. Under the epidemic situation, it is even more rare and valuable to directly solve social problems.

In the first quarter of 2020, it successfully helped more than 120 enterprises, more than 3300 labeling staff, and realized online production recovery. It helps a large number of labeled personnel to obtain employment while realizing stable business and timely meeting customer needs.

Among them, Baidu Intelligent Cloud's data annotation base in Shanxi has more than 2000 full-time announcers, actually helping 2000 local people, including recent graduates and other industry transformation personnel, to find jobs successfully.

Facing the future, it is estimated that in five years, through the leading and exemplary role of Shanxi's labeling base, it will provide more than 50000 jobs for the local people and drive the AI basic data related industries to gather in Shanxi.

We even saw a sentence from an ordinary announcer: "Data annotation makes me feel that I can keep pace with the world."

We can clearly see such an image: through our own business capabilities, through various ways, we spare no effort to promote the development of public welfare and solve social problems.

That's what Xinmang X said

The acceleration of new infrastructure construction, the rapid development of the overall AI industry, the landing of AI applications, and the rise of emerging AI application scenarios.

As a leader with deep and dedicated accumulation, Baidu intelligent cloud data crowdsourcing has also ushered in an unprecedented development opportunity. Continuous technological innovation, taking advantage of the trend, has naturally become a deterministic event with high probability. （ Note: Some of the pictures in this article are from the Internet, thanks to the original author. If there is infringement, it can be handled by background private message communication)

Share to:

Baidu intelligent cloud data crowdsourcing, safer and higher quality data capability, creating intelligent "eyes" for automatic driving

2020-05-25 Baidu data crowdsourcing

In recent years, the automatic driving technology has attracted much attention from the capital and industry market, and more and more automobile enterprises, parts suppliers and solution suppliers are participating in it. With the two-way support of funds and policies, the industry has developed rapidly and almost started a prairie fire.

The biggest technical bottleneck is undoubtedly the perception ability. In addition to the support of algorithms and hardware, the quality of training data also plays a decisive role - whether the data volume is large enough, whether the labeling quality is good enough, and whether the covered scenes are comprehensive enough have become one of the important criteria to indirectly measure the technical quality of an autonomous driving company.

It is against this backdrop that Baidu Intelligent Cloud Data Crowdsourcing has taken the lead in launching the overall AI data solution of "private tagging platform+base tagging team" for the automatic driving industry to help platform service-oriented enterprises build complete data infrastructure services, of which "Shanghai International Automobile City" is a typical representative.

The policy is favorable, and the data and platform capabilities should also keep up

In recent years, local governments have continued to increase investment in the construction of infrastructure for autonomous driving. Through policy support, autonomous driving has been launched, creating an automobile industry ecology and improving urban competitiveness.

In Shanghai, an international city of automobiles, the policy layout of automatic driving has already taken some measures. In 2018, the Administrative Measures of Shanghai on Road Testing of Intelligent Connected Vehicles (Trial) was officially released, making Shanghai the first city in China for open road testing of automated driving, providing important infrastructure for the automated driving test of SAIC, BMW and other enterprises. In 2019, the "AI+Traffic Scenario Plan" was launched in Shanghai International Automobile City, aiming to build a semi open demonstration area for automatic driving normalization operation with Shanghai Auto Expo Park as the carrier, and provide support for industrial development in infrastructure construction and test scenarios.

As the first industrial demonstration area to carry out the demonstration and promotion of ICV in China, its planning starts from the perception and decision-making levels to create an overall solution for hardware, software, data and road testing. Among them, the decision-making level is the most critical but also the most complex. Algorithm training requires a series of supporting construction, including training data and scene database evaluation data at the data level, as well as the software level deep learning data annotation platform and management training platform. However, due to the high precision and magnitude of autopilot data, complex labeling rules, and the high difficulty in research and development of software platforms with business scenario applicability characteristics, the industry often chooses professional AI data companies to provide data and platform services.

How to provide platform capacity building based on business characteristics, while ensuring the quality and safety of data annotation, and achieve intelligent "double eyes" of automatic driving has become a difficult problem for the auto city and even the entire automatic driving industry.

Supporting Industrial Park of Shanghai International Automobile City

Give consideration to data security and quality

Baidu Intelligent Cloud Data Crowdsourcing is the best choice for Shanghai International Automobile City.

As a professional and high-quality AI data service provider in the industry, Baidu smart cloud data crowdsourcing has provided data services for Baidu's internal and external customers since 2011. Especially in the field of automatic driving, it has successfully annotated hundreds of millions of frames of data and accumulated rich industry experience. According to iResearch: Research Report on China's Artificial Intelligence Basic Data Service Industry in 2019, Baidu Intelligent Cloud Data Crowdsourcing has now become the largest AI data service provider in China.

After Shanghai International Automobile City found Baidu Smart Cloud data crowdsourcing, the two parties hit it off and soon established the direction of cooperation: build software capabilities from the in-depth learning data annotation platform, and achieve data security and high-quality annotation through the platform and Baidu annotation base. "Among many service providers, choosing to cooperate with Baidu Smart Cloud data crowdsourcing mainly focuses on Baidu Smart Cloud's data experience and product technology capabilities in this regard, as well as the data annotation security scheme it provides can well meet our needs," said Li Lin, deputy chief engineer of Shanghai International Automobile City.

In response to the demand of Shanghai International Automobile City, Baidu Smart Cloud Data Crowdsourcing launched“ Privatization marking platform+base exclusive team ”The scheme is customized and developed in combination with the automatic driving labeling scene and organizational management needs of the International Automobile City, and the capabilities of Baidu's leading labeling platform are extracted and deployed for privatization.

Among them, Baidu Intelligent Cloud Data Crowdsourcing "Privatization Marking Platform" It supports dozens of annotation scenes such as 2D, 3D, continuous frames, and fusion annotation, and introduces AI pre annotation and automatic quality inspection algorithms. Through scientific verification of Baidu's tens of thousands of projects, it leads the industry in labeling efficiency by 20%, and also has comprehensive task, data, labeling personnel management functions, effectively supporting enterprises to do labeling management. At the same time, due to the privatization feature of the platform, data can not be exported to ensure data security.

How to ensure data security? On“ Base dedicated team ”On the other hand, Baidu and the Shanxi government have jointly built the largest data annotation base in the industry, with more than 2000 annotaters who have received years of professional training. According to Baidu's data security level regulations, the base has taken a variety of strict security control measures, such as signing confidentiality agreements, working in closed rooms, real-time camera monitoring, USB sealing, etc., to ensure data security from the source of people, while achieving high-quality and efficient delivery. In this regard, Shi Jialiang, head of Baidu's smart cloud data crowdsourcing business, said: "Data security has always been our concern, and it is also the development of the entire AI industry

Baidu Intelligent Cloud Data Crowdsourcing Platform Security Label Scheme

Baidu Shanxi artificial intelligence data annotation base introduction video

Enabling and co building promotion Acceleration of industrial upgrading

At present, the cooperation between the two sides on the platform and data has been implemented. The deployment of the deep learning annotation platform has enhanced the software facility capability of the International Automobile City. The mode of "platform deployment+base annotation" has greatly improved the data processing capability of the Automobile City while ensuring data security. A large number of high-quality data based on automatic driving scenarios are continuously output from Baidu Shanxi Label Base, and rely on the Auto City platform to support the maturity of industry algorithms.

At the same time, Baidu Intelligent Cloud Data Crowdsourcing is also constantly opening up its automatic driving data collection and tagging capabilities, and building a full set of product capabilities based on data tagging, storage, management, training, cleaning, and evaluation according to industry needs. It has successively carried out in-depth cooperation with several local governments at the AI data level to help the transformation and upgrading of local industries.

Shanghai International Automobile City has platform resources such as the National Intelligent Connected Vehicle Pilot Demonstration Zone and numerous public laboratories, providing more learning, communication, research, testing and data analysis opportunities for autonomous driving enterprises. The two sides cooperate and communicate, and continue to innovate and empower the industry in terms of products and ecology. There is no doubt that with the joint efforts of the industry, the intelligent future of the automobile industry is coming.

Share to:

2019 China AI Basic Data Service Industry White Paper

2019-09-16 IResearch Consulting

Core summary:

After a period of brutal growth, the AI basic data service industry has entered a growth period, and the industry pattern has gradually become clear. The upstream of the AI basic data service provider is the data production and outsourcing provider, and the downstream is the AI algorithm research and development unit. The AI basic data service provider provides an overall data resource service through its data processing ability and project management ability. However, the AI algorithm research and development unit and AI middle office can also provide some data processing tools, and there is a widespread intersection between upstream and downstream industries.

In 2018, the market size of China's AI basic data services was 2.586 billion yuan, of which data resource customization services accounted for 86%. It is expected that the market size will exceed 11.3 billion yuan in 2025. The market supplier is mainly composed of AI basic data service providers and algorithm research and development units in the form of self built or directly obtained outsourcing labeling teams, of which the supplier is the main support force of the industry.

Data security, standard collection ability, data quality, management ability, service ability, etc. are still the pain points of the demander. It is required that the AI basic service provider has a clear and specific security management process, can deeply understand the requirements of algorithm labeling, can provide focused and high-quality services, can actively cooperate, and quickly respond to the requirements of the demander.

With the growing demand for algorithms, relying on manual annotation cannot meet the market demand. Therefore, it will be a trend to enhance the continuous learning ability of the data processing platform, continuously learn manual annotation by machines, and improve the replacement rate of pre annotation and automatic annotation ability for manual. In the long term, more and more long tail and small probability events will generate increased data demand. Machine simulation or machine generated data will be a good way to solve this problem. Early research and development of corresponding technologies will also become a moat for AI basic data service providers in the future.

Overview of AI basic data service industry

Definition of AI basic data service: it refers to providing services in the form of data collection and annotation for AI algorithm training and optimization

AI basic data service refers to data collection, cleaning, information extraction, annotation and other services provided for AI algorithm training and optimization, focusing on collection and annotation. At the beginning of the outbreak of the concept of artificial intelligence, algorithms, computing power, and data were the most important three elements, and entered the landing stage. Intelligent interaction, face recognition, unmanned driving and other applications became the most popular, AI companies began to compete for the ability to combine technology and industry, and data, as the "fuel" of AI algorithms, is a necessary condition to achieve this ability. Therefore, the basic AI data services that provide data collection, annotation and other services for machine learning algorithm training and optimization have become an indispensable part of this AI boom. If computer engineers are AI teachers, then basic data services are teachers' teaching materials.

Development history of AI basic data services

The industry has entered the growth period, and the industry pattern is gradually clear

With the outbreak of the domestic AI boom, a large number of AI companies got financing. In order to continuously improve the accuracy of the algorithm, the demand for data standardization also broke out unprecedentedly, which once led to the prosperity of the industry. However, the early AI basic data service threshold was low, and the players were mixed, which made the industry standard vague and the service quality uneven. As competition accelerates, AI company's quality requirements for training data are also constantly improving, and when the industry landing becomes the main theme, the demander's demand for customized data standard collection in vertical scenarios becomes the mainstream. Many small AI basic data service companies fail to meet the requirements in terms of data quality and standard collection ability, or are eliminated, or attach to large platforms. The industry pattern is gradually clear, and the strength of leading companies is gradually highlighted. With the growing demand for algorithms, the current means of machine aided annotation and manual main annotation need to be improved to enhance the continuous learning and self-learning capabilities of the data processing platform, increase the machine's ability to annotate dimensions, and improve the accuracy of the machine's data processing. It will become the focus of the industry in the next stage that the machine undertakes the main annotation work. In the future, more and more long tail and small probability events will generate increased data demand, and the cost performance ratio of human-computer collaborative tagging mode is insufficient. Machine simulation or machine generated data will be a good way to solve this problem. Early research and development of corresponding technologies will also become a moat for AI basic data service providers in the future.

Industrial value of AI basic data services

At present, supervised deep learning is the mainstream, and tagging data is the basis of its learning

Artificial intelligence is a science that studies how to simulate human cognitive ability through machines. Machine learning is the main means to realize artificial intelligence at this stage. Machine learning methods usually learn rules or judgment rules from known data and establish prediction models. Among them, deep learning can form more abstract high-level attribute categories by combining low-level features, automatically learn effective features from information and classify them without manually selecting features. With the advantages of automatic feature extraction, neural network structure and end-to-end learning, deep learning has the best learning effect in the field of image and voice, and is the hottest algorithm architecture today. In practical applications, the deep learning algorithm mostly adopts the supervised learning mode, that is, it needs to label data to feedback the learning results. Under the training of a large amount of data, the error rate of the algorithm can be greatly reduced. Today's applications such as face recognition, automatic driving and voice interaction all use this kind of method for training. There is a huge demand for all kinds of annotation data. It can be said that data resources determine the height of AI today. Because the demand of AI algorithm with supervised learning for labeled data is far greater than the existing labeling efficiency and input budget, weak supervised learning and small sample learning without supervision or requiring only a small amount of labeled data have become the direction of scientists' exploration, but at present, no matter from the perspective of learning effect and use boundary, they cannot effectively replace supervised learning, AI basic data service will continue to release its basic support value for AI.

Main product forms of AI basic data services

Customized service is the main form of service, and data set products focus on voice track

At present, domestic AI basic data services are mainly data set products and data resource customization services. Data set products are often standard data sets produced by AI basic data service providers based on their own accumulation, mainly voice data sets, and the main body is Mandarin voice, English voice, dialect voice, etc; In order to ensure the advantages of the algorithm, customers use more customized services. Customers put forward specific needs, and data service providers either annotate the data provided by customers directly, or collect and annotate the data. Large demanders, in order to ensure the security of data, often provide their own tagging platforms in the form of Web to the executors to control the overall project. Some AI basic data service providers also provide customers with private platform construction services, or make their own platforms compatible with Party A's systems; In addition to the above two forms, some AI basic data service providers also expand to algorithm services, providing algorithm training, model building and other services.

Development background of AI basic data service

The rise of AI economy provides long-term good fundamentals for basic data services

In 2010, a major breakthrough was made in the field of speech recognition and computer vision, and the concept of AI began to emerge in China. By 2015, China ushered in an AI entrepreneurial boom, with unicorns emerging and financing records being constantly broken. From 2012 to August 2019, there were 2787 investment and financing events in the field of artificial intelligence, with a total financing amount of 474 billion yuan. Artificial intelligence has become the hottest financing hot spot. Baidu, Alibaba, Tencent, JD, Huawei and other technology enterprises have also added. Since 2017, industrial landing has become the mainstream of the AI industry, and the AI enabled real economy has maintained a rapid development trend, involving industries including security, finance, retail, transportation, education, medical care, marketing, industry, agriculture, enterprise services and many other fields. The explosive growth of the downstream has provided long-term good fundamentals for the development of AI basic data services.

The amount of data grows exponentially, and the application of unstructured data depends on cleaning labels

PC、 The rise of the Internet and consumer grade mobile devices announced the arrival of the data era. The development of the Internet of Things has enabled a large amount of data generated by offline businesses to be collected. The amount of data is growing exponentially. According to IDC statistics, the amount of data produced annually in the world will surge from 16.1ZB in 2016 to 163ZB in 2025, of which 80% - 90% is unstructured data. In the past, computers mainly processed structured data, while artificial intelligence models were good at processing unstructured data. In China, more than 2 million hours of voice data and hundreds of millions of pictures need to be annotated every year.

Current situation of AI basic data service market

AI basic data service industry chain

AI basic data service provider is the core link of the industry

Artificial intelligence basic data service industry map

The upstream and downstream industries generally intersect

The upstream of the AI basic data service provider is the data production and outsourcing provider, and the downstream is the AI algorithm research and development unit. The AI basic data service provider provides an overall data resource service through its data processing ability and project management ability. There are two categories of AI basic data service providers as a whole. One is that they have their own labeling base or full-time labeling team, and these enterprises also participate in the upstream part of the industry to directly provide capacity resources. The other is that they focus on the development and project implementation of data products by relying on crowdsourcing or outsourcing mode. Some downstream AI companies have their own tagging tools, and they can also obtain some general tagging tools through the AI middle office. At the same time, some enterprises with large data needs have incubated their own data service teams. On the whole, there is a cross relationship between upstream and downstream industries.

Investment and financing of artificial intelligence basic data service industry

Financing scale is concentrated in tens of millions, and most of early financing projects

From the perspective of financing scale, most of the financing in the AI basic data service market is at the level of tens of millions. From the time dimension, the amount of financing obtained by AI basic data service providers was relatively high in 2015, marking the beginning of the industry and its recognition by capital. From the perspective of the number of enterprises obtaining financing, there are not many players obtaining financing at present, and the capital market is not active. From the perspective of financing rounds, most of the financing is still concentrated on early financing. Currently, listed enterprises only list one Datang on the NEEQ (without considering the basic data service providers incubated within technology companies). The gross profit rate of AI basic data services is generally high, but in order to keep matching with the cutting-edge algorithms in the AI market, it needs to invest a lot of research and development costs to upgrade the research and development of data processing platforms and tools, so it still has a strong dependence on financing.

Business model of AI basic data service industry

Production, customer acquisition and deployment drive development

The AI basic data service industry is a typical ToB business with a relatively stable business model. In terms of production, production and operation are mainly realized through self built marking bases or marking teams, building crowdsourcing platforms, purchasing supplier outsourcing services (BPO) and other modes. Most enterprises mainly adopt crowdsourcing and outsourcing modes. Baidu Data crowdsourcing, Besay and other enterprises have self built marking bases or full-time marking teams, which are useful for training high-quality staff Perfect team management has positive significance; In terms of customer acquisition, we mainly enter the market through word of mouth, academic conferences and exhibitions, agency channels and other modes, and have high requirements for sales personnel to be familiar with market trends and customer needs; In terms of implementation and delivery, there are two types of private deployment and public deployment, which can flexibly respond to customers' personalized needs for data security, delivery cycle and cost.

Market scale of AI basic data services

In 2025, the market size will exceed 10 billion yuan, and the annual compound growth rate of the industry will be 23.5%

In 2018, the market scale of China's AI basic data services was 2.586 billion yuan, of which data resource customization services accounted for 86.2%, dataset products accounted for 12.9%, and other data resource application services accounted for 0.9%; The compound annual growth rate of the industry is 23.5%, and the market size is expected to exceed 11 billion yuan in 2025. From the perspective of overall growth, the industry is relatively stable, and the continuous development of the downstream AI industry will form a long-term positive.

AI basic data service breakdown structure

The pure label service is the main part, and the service provided by the supplier accounts for 79%

In 2018, China's AI basic data service market will focus on voice, vision NLP mainly provides annotation services, while collection and annotation services account for a relatively small proportion. This is because there are more data provided by the demander, but this does not mean that the demand for data collection in the market is weak. On the contrary, after the implementation of AI technology, a large number of data needs in emerging vertical fields have emerged, but these data collection are difficult, Suppliers who can provide relevant collection tools and services will gain competitive advantages. The market supplier is mainly composed of enterprises' self built or directly obtained outsourcing teams and suppliers, and suppliers are the main support force of the industry, accounting for 79% 。

Market pattern of AI basic data services

The industry will be upgraded to a high concentration, with CR5 accounting for 26% of the market share

At present, the artificial intelligence basic data service industry CR5 accounts for 26% of the market share, and the industry concentration is relatively moderate. It is neither an oligopoly market nor a fully competitive market. On the one hand, Baidu Data Crowdsourcing, Haitian Raytheon, Datahall and other enterprises entered the market early and accumulated more customer resources. On the other hand, downstream enterprises used the open dataset training model before, The demand for high accuracy of data has a short history. Affected by the lagging ecological transmission effect, the market threshold is not significant. SMEs with weak capital and R&D strength still have strong development soil. However, in the future, with the development of downstream enterprises, the cost of using outsourcing teams directly is low and the data security is controllable. Some basic needs will be self-sufficient by downstream enterprises. The existing stock market of external data service providers is facing decline, so they must undertake difficult, cutting-edge and unique tasks, which requires their own investment in high-precision The research and development of specialized data processing tools and basic research of artificial intelligence algorithms are aimed at grasping customer needs and developing incremental markets. Therefore, capital and R&D strength have become a high industry threshold. At the same time, affected by the cooling of the capital market in recent years, a number of small and medium-sized manufacturers are facing business contraction. Moreover, some manufacturers, such as Besay, have begun to merge in the industry, Referring to the development of overseas data service market (overseas industry giant Appen has acquired other enterprises for many times), M&A will also become a market trend, and under the influence of multiple factors, the industry concentration will increase.

Analysis of AI basic data service scenarios

View the status quo of basic data service market

Portrait and OCR data are the mainstream of view basic data services

Without considering automatic driving, the view basic data service market reached 660 million yuan in 2018, and portrait and OCR data are the mainstream of view basic data services, especially portrait data accounts for 42.9% of the market. OCR accounts for 27%, and other human body recognition data, commodity identification data, industrial quality inspection data, medical image data and other new scene data are scattered, accounting for 30.1% of the market in total.

View basic data service technology trend

Judge the data demand according to the direction of algorithm research and development, and mine the incremental market

According to the direction of data use, it can be divided into three categories: new algorithm model building and research and development, adding new modules based on existing algorithms, and customized optimization during solution delivery. Among them, the data requirements for new algorithm model building and research and development and adding new module types based on existing algorithms can be judged and predicted according to the cutting-edge research and development direction of corresponding machine vision algorithms. For example, as far as the smart city scenario is concerned, face recognition and video structure for the Han people are relatively mature. In the actual application scenario, it is also necessary to optimize for ethnic minorities and other ethnic groups to improve the overall algorithm accuracy. In addition, cross camera tracking has become a hotspot in scene research and development. How to label the corresponding cross camera data will also have a great impact on algorithm training, The depth camera can help the computer understand the three-dimensional surveillance video, and can also better solve the problem of view data collection under complex lighting conditions. It will also become an important research and development direction in the future. To sum up, the collection and annotation services of multi-ethnic and multi-ethnic data, cross camera data, and 3D data will bring incremental space for the development of the view basic data service market, OCR、 In the same way, other fields such as mobile phones and retail can also tap the incremental market in the direction of algorithm research and development.

Application scenario of automated driving basic data service

The algorithm is not yet mature, there is a long-term demand for data, and the gap is still

The automatic driving system above L3 level mainly includes five parts: perception, positioning, prediction, decision-making and control. Its demand for computer vision technology is much higher than ADAS. The system needs to extract, process and fuse the point cloud and image data collected by sensors such as radar and camera, build the vehicle driving environment, and provide a basis for prediction and decision-making, This is a great test for the accuracy and real-time performance of the algorithm. At present, the vision technology of automatic driving mainly applies supervised deep learning, which is an algorithm model based on the derivation of functional relationship between known variables and dependent variables, and requires a large number of annotation data to train and tune the model. In world-class driverless competitions, the organizers often provide nearly 100 million pictures and hundreds of thousands of labeled pictures for the training of participating teams; During road testing or real road driving, complex environmental problems such as mixed people and vehicles, dense distribution, variable behavior and so on need massive real road condition data to continuously optimize the algorithm to ensure the normal availability of driverless vehicles. At present, automatic driving is developing rapidly in China, and there are many participants such as AI companies, technology companies, high-precision map manufacturers, car manufacturers, etc. The demand for data collection and annotation in this field has become one of the main items of AI basic data services, and the application of automatic driving algorithms still needs to be optimized. The data demand gap is still in place, and the market is far from saturated.

Current situation of basic data service market for automated driving

In 2025, the scale of standard acquisition will exceed 2.4 billion, and technology companies and car factories are the main demanders

The basic data of automatic driving mainly include road traffic images, obstacle images, vehicle driving environment images, etc. The demand side is dominated by technology companies, automobile manufacturers and high-precision map manufacturers. In 2018, the basic data service scale of the automatic driving industry was 576 million yuan, which is expected to exceed 2.4 billion yuan in 2025, accounting for 49%, 47.2% and 3.8% respectively, The total task volume of industry data exceeds 100 million, and the task volume of 2D image annotation and 3D point cloud annotation is basically 2:1. Among them, the algorithm of high-precision map manufacturers is relatively mature, and the degree of data automatic annotation can reach about 90%, with less outsourcing demand; Automatic driving technology companies, represented by Baidu and Tucson Future, have always been the main buyers of basic data services in this field. On average, the cumulative demand for algorithm training image data of each company is more than 10 million. With the acceleration of the implementation project process, there will be more demand for segmentation scenarios; In recent years, automobile manufacturers have significantly invested in ADAS and automatic driving. SAIC, Geely and other manufacturers have invested hundreds of millions of yuan annually. The demand for data collection and labeling is also increasing year by year. It is expected that automobile manufacturers will become the main demand in the next three years.

Current situation of intelligent interactive basic data service market

Far field voice interaction has become the mainstream demand, and Chinese data still occupies the core of the market

In 2018, the market size of voice interaction related data services reached 1.35 billion yuan. Voice interaction is mainly divided into near-field interaction, mid field interaction and far-field interaction. The demand for data services such as intelligent AV home, interactive robots and car machines in the middle and far fields accounts for 68% of the basic data services for intelligent interaction, becoming the mainstream demand for basic data services for intelligent interaction, Therefore, low noise environmental services for far-field voice interaction have strong development potential and bargaining power. In terms of service languages, Chinese (including dialects) services account for 71% of the market share, while foreign language resources are relatively scarce. It is difficult to collect and label, and the cost is relatively high. At present, it accounts for 29% of the market share.

Technical trend of intelligent interactive basic data service

Realize compound data annotation across voice recognition and semantic understanding

At present, in the construction of intelligent interaction systems, enterprises have relatively complete technical capabilities in terms of simple voice recognition or synthesis, and more pain points in terms of context understanding, multi round dialogue, emotion recognition, fuzzy semantic recognition, intention judgment, etc. According to the development of intelligent interaction system algorithms, they iterate and design NLP data products that meet the needs of algorithms, It helps to promote the development of intelligent interactive systems from the data level. In particular, the effect of the dialogue system is highly dependent on the quality and scale of the annotated data. However, due to the dual constraints of the annotated data and model capabilities, the dialogue process is still unable to get through the entire interaction process of voice and semantics. However, the realization of composite data annotation across voice recognition and semantic understanding can help reduce the information miscommunication between voice information and text information, The enhancement of the effect of the whole dialogue process can have a positive impact, which will increase the possibility of exploring intelligent interactive basic data services.

Analysis of AI basic data service requirements

AI basic data service customer positioning

Customers are divided into four categories: AI companies, technology companies, scientific research institutions and industrial enterprises

From the demand side, AI companies and technology companies account for the main share. AI companies focus more on a certain type of basic data services such as vision and voice, while technology companies combine the advantages of the group and make efforts to AI as a whole. Different departments will have multiple types of data needs, while scientific research institutions will have a relatively small demand. In addition, traditional industry enterprises, such as automobile manufacturers, mobile phone brands, security manufacturers and other traditional enterprises, have begun to develop technology around their own business, and have also begun to generate demand for AI basic data, and the magnitude is gradually increasing, which will release more market space in the future.

Types of core requirements for AI basic data services

The three stages of AI application generate differentiated demands for basic data services

The application of AI algorithms by enterprises will go through three stages: R&D, training and implementation. Different stages also have differentiated requirements for AI basic data services. R&D demand is the data demand generated during the R&D and expansion of new algorithms, which is generally of large magnitude. At the initial stage, standard data set products are used for training, while at the middle and later stages, professional data customization and standard collection services are required; Training demand is to optimize the accuracy, robustness and other capabilities of existing algorithms by labeling data, which is the main demand in the market. It focuses on customized services, and has high requirements for the accuracy of algorithms; The algorithm in the business requirements of the landing scenario is relatively mature, and the data collection and annotation involved are more suitable for specific businesses, such as the paint identification data in aircraft maintenance. There are strong requirements for the annotation ability and the service awareness of suppliers to actively propose optimization suggestions.

Pain points of AI basic data service demand

Five pain points of demand determine the service standards of AI basic data service providers

At present, when selecting data services, demanders often encounter pain points such as data security, standard collection ability, data quality, management ability, and service ability. For data security, the demander hopes that the basic data service provider has a clear and specific security management process, and pays more attention to data transmission, storage, and data destruction after project closure. In terms of the ability to collect standards, the demander's algorithm is getting closer to the business. It is hoped that the data service provider has the ability to collect standards in fields with certain threshold, such as automatic driving and industry, and can understand the customer's intention, cooperate with the labeling, and even put forward labeling suggestions; According to the market reaction, when most data service companies deliver projects for the first time, the accuracy rate of the data is generally low, which requires rework once or twice. Therefore, the demander prefers companies with less invalid data and high accuracy rate. For execution efficiency, generally AI basic data service providers can complete within the project cycle, but companies with weak management ability can hardly concentrate on serving customers with high quality when considering multiple projects. At the same time, the quality and reputation of the execution team are also important factors. Service awareness is a soft power, which requires the active cooperation and quick response of AI basic data service providers to the requirements of the demander.

Trends and suggestions of AI basic data services

The awareness transition of enterprises from passive execution to active service

Data collection and labeling based solely on the demands of customers' various projects is passive implementation, with low subjective initiative and limited industry boundaries. The products and services of various companies tend to be homogeneous and the competition is sticky, which restricts the development of AI basic data services. Through the research on the demander, it is found that in addition to the core concerns of security, quality, efficiency, etc., more and more demanders have a demand for active services from data service companies. It is hoped that the data companies can better understand the algorithm technology, better understand the demand scenarios, and even participate in the research and development of algorithms, and give optimization suggestions on data standards collection, This also brings opportunities for data service providers to form differentiated competition. Especially in the AI landing stage, a set of integrated solutions for AI basic data integrating research, consultation, design, collection and labeling can be formed in the vertical scene, which will achieve breakthroughs in revenue and business boundaries.

Share to:

2019 Baidu Cloud Intelligence Summit: Data Intelligence Boosts Industrial Upgrading

2019-08-30 Baidu data crowdsourcing

On August 29, ABC SUMMIT 2019 Baidu Cloud Summit was grandly opened in Beijing National Convention Center. As the most influential industry conference in the field of ABC, the conference, with the theme of "AI industrialization, accelerating industrial intelligence", showcased the transformation of Baidu ABC from 1.0 to 3.0, and the transformation of AI from standardization, process, and scale to industrial exploration and practice.

The sub forum of data intelligence ecology was full

Data intelligence to promote industrial AI upgrading

Data is fundamental to the development and empowerment of AI. At the conference's data intelligence ecological sub forum, industry experts thoroughly interpreted data intelligence and its development in industrial ecology, shared the application of AI basic data services in multiple typical vertical scenarios such as automatic driving, intelligent environment, intelligent terminals, and provided reference for data intelligence enabling industry ecology.

Gao Guorong, General Manager of Baidu Smart Cloud Data Intelligence, gave a keynote speech

Gao Guorong, general manager of Baidu Intelligent Cloud Data Intelligence, pointed out in the keynote speech "Data Intelligence Drives Industrial Transformation and Upgrading" that the practice of data intelligence in cities, industry, transportation, manufacturing and finance has fully proved that as much AI as there is, there is as much data intelligence. Data intelligence is a refinery in the era of artificial intelligence, and has become the core driving force for industrial intelligence upgrading. Baidu Intelligent Cloud uses advanced technologies such as artificial intelligence, cloud computing, big data and Internet data advantages to dig into the "data dilemma" in industrial intelligence, solve technical bottlenecks such as difficult data acquisition, weak management, low security, and complex application scenarios for AI applications, and enable urban economic brain, industry marketing reform, industry restructuring and upgrading, etc, We will steadily promote the penetration of AI in various industries, help improve efficiency in various fields and industries, and enhance people's feelings.

Data intelligence becomes the core driving force of industrial intelligence upgrading

Finally, Gao Guorong emphasized that Baidu Smart Cloud has opened the whole life cycle of AI data services, and realized one-stop data intelligent services for AI industry scenarios from data processing, data development, data applications and other links, helping AI industrialization and accelerating the process of industrial intelligence in China.

On site guests are serious Listen to the speech

AI basic data service, supporting the development of AI industry

Shi Jialiang, the head of Baidu's smart cloud data crowdsourcing business, explained in his speech that in the era of data as the king, Efficient and secure access to massive structured data has become another core competitiveness of AI enterprises after technical barriers such as algorithm computing power 。 Baidu's intelligent cloud data collection resources cover 40 countries and regions around the world, including eight major dialect areas in China, and people of all ages from 15 to 60; In the collection process, automatic intelligent audit and three rounds of manual quality inspection are combined to meet the data delivery requirements of different customers. At present, Baidu intelligent cloud data crowdsourcing provides more than 30000 portraits/week and 50000 hours/week of voice acquisition capability. The collection service is highly customized, and customers praise the industry first.

Shi Jialiang, Baidu Intelligent Cloud Data Crowdsourcing Business Leader, talks about data collection

Shi Jialiang emphasized that Baidu Intelligent Cloud Data Crowdsourcing has four magic weapons in data annotation: the most comprehensive annotation tool, the most efficient process platform, the most intelligent automatic annotation and the richest resource capability. We have more than 200000 labeling users active on the crowdsourcing platform, 20000 professional labeling manpower, and built Baidu (Shanxi) data labeling base in 2018. Through professional training and centralized management of labeling personnel, a batch of labeling teams with rich labeling experience and ability to overcome difficulties have been selected based on the production capacity. At present, the base has 2000 full-time professional labeling personnel, covering labeling scenarios In AI key fields such as intelligent driving, computer vision and speech recognition, the accuracy rate of vertical scene annotation exceeded 98%.

Shi Jialiang said that with the development of the AI industry, it can be predicted that the requirements for data quality and scenarios will become more stringent and complex in the future, but Baidu Intelligent Cloud Data Crowdsourcing is confident, relying on its own AI The comprehensive strength of brand first, scale first and technology first in the basic data industry can Continue to contribute to the AI industry.

Baidu (Shanxi) Artificial Intelligence Basic Data Industrial Base Phase II Launching Ceremony

2019 Baidu (Shanxi) artificial intelligence data annotation base Phase I Award for Outstanding Agents

Pay equal attention to quality and safety, and comprehensively help AI development

At the meeting, Yang Fei, chairman of the Technical Committee of Baidu Intelligent Cloud Quality Department, said in describing the practice plan of intelligent driving data integration that the landing of intelligent driving scenes requires a lot of high-quality data. The intelligent driving integration data plan provided by Baidu Intelligent Cloud covers the four links of data "acquisition", "standard", "management" and "training", including data collection, data annotation, data management Products and services such as data training and defect mining ensure data quality, improve data management efficiency, shorten model training cycle, and realize data driven model iteration.

Sharing of intelligent driving data integration practice scheme

Data security is also the focus of the industry. Shen Jian, senior product manager of Baidu Smart Cloud, pointed out when interpreting the relevant laws and regulations on data security that while data plays a role in AI, we must pay attention to the compliance and legality of data acquisition and data processing, and protect the security of information privacy. The data security mechanism provided by Baidu Smart Cloud, from data collection, data flow, data handling, and layer upon layer technology and process control, assures the security of customer data from the source, so that data can be safely used by AI.

Data security specification interpretation and practice sharing

The future is promising, and data intelligent access acceleration

AI is moving towards a new stage of industrialization. As the fuel of AI, the value of data needs to be collected and labeled before it can be awakened. The iResearch consulting report pointed out that since 2017, the AI enabled real economy has maintained a rapid development trend. Data intelligence has effectively integrated technology, business and data, accelerated the development of industrial intelligence, promoted the innovation of enterprise model, and promoted the implementation and application of AI technology in security, finance, retail, transportation, education, medical care, marketing, industry, agriculture and other fields, It is estimated that the market size of AI basic data service industry will exceed 6 billion yuan in 2022.

Prospect analysis of AI basic data service industry

Perhaps in 30 years' time, looking back, AI will be another technology that has a profound impact on human beings no less than the Internet, and its power will once again completely change human production and life. However, at present, AI is still full of unknown exploration, and the road is long. But all participants and builders of data intelligence are working hard and looking forward to this day with confidence.

Builders and participants of artificial intelligence

Share to:

Expert column | Basic data service, the key to AI's intelligence

2019-08-07 Baidu data crowdsourcing

Today, artificial intelligence has gone deep into daily life, bringing convenience to people. People can't help sighing that the promotion of AI from concept to product to daily life is too fast! What is behind the rapid development? Are engineers burning brain cells and fast developing algorithms? All right, but don't forget the basis of AI - data.

This article will reveal how Baidu's intelligent cloud data crowdsourcing service has become the cornerstone of AI, and do a good job in data collection, labeling and quality control for it. At the same time, it reveals how the data crowdsourcing team started from scratch and has gradually become the first brand, scale and technology in the AI basic data industry.

Data is the foundation of AI development

It is often said in the industry that "there is as much artificial intelligence as there is". To build an algorithm model, massive labeled data needs to be injected to train the machine, so that the machine can learn to achieve the purpose of "intelligence". The business "data collection and annotation" that the data crowdsourcing team is doing serves this need.

Data annotation is to help machine learn the characteristics of cognitive data. For example, if we want to develop a face recognition product, we should first let the machine "know" the face, but it is impossible to recognize a face image directly given to the machine. We need to label the face image first, and label it with facial features. When the machine is fed with a large number of labeled images for learning, we will give the machine a face image, The machine will know that this is a human face.

Data is the foundation of AI development. In the words of Shi Jialiang (Baidu Intelligent Cloud Data Crowdsourcing Business Leader), "AI was the same as babies at the beginning". AI needs data for growth, just as babies need food. However, these "food" cannot be directly consumed by AI and needs to be processed later. What the crowdsourcing team is doing is helping babies get food and process food.

Multi mode development, the largest scale in the industry

Generally speaking, there are two business models of crowdsourcing platforms, crowdsourcing model and outsourcing model. The crowdsourcing model has the advantage of fast response. Once the platform task is released, someone responds to the order immediately, and there is no middleman to earn the price difference, so the cost is low. However, the crowdsourcing model has an obvious disadvantage, that is, it is difficult to control the quality, and poorly trained personnel will inevitably have the possibility of "random labeling". The outsourcing mode is to outsource the annotation task to a special data annotation team, which can ensure high data quality. However, compared with the crowdsourcing model, the response speed is slow and the cost is high.

Single use of any business model has obvious drawbacks and is not feasible. To this end, the crowdsourcing team has signed a large number of downstream suppliers on the one hand, and has built its own data annotation base to cultivate professional crowdsourcing personnel on the other hand. Both modes ensure the activity and quality of labeling personnel.

At present, there are more than 500 downstream agents signed with the crowdsourcing team, and more than 20 million crowdsourcing users on the platform. Among them, there are 100000 to 200000 professional labelers. The ability of such downstream agents is hard to surpass in the industry, and even many competitive products in the industry are downstream of the crowdsourcing team.

In addition, in 2018, the data crowdsourcing team established its own labeling base in Shanxi, and now there are more than 1500 people. It is estimated that the number of people will exceed 2000 by the end of the year. The crowdsourcing team will be fully managed by itself, and supervise the labeling quality and efficiency in the whole process.

The huge crowdsourcing team and its upstream position in the industry give the crowdsourcing team an absolute advantage in cost performance. Shi Jialiang said, "The reason why the cost performance ratio is so high is inseparable from the internal product accumulation and development scale. In addition, technology and management are also key, We have a complete set of online management systems, which can reasonably schedule users, so as to ensure that our products can help customers reduce costs while ensuring quality 。”

Technology plus management, equal emphasis on quality and efficiency

Of course, in terms of cost performance, providing users with high-value data services is also the first consideration.

Customers with basic data processing needs of AI are all enterprises in the AI field, and their development mainly depends on three capabilities: computing power, algorithms and data. For computing power, the whole market is basically free of barriers, and all hardware is universal; For algorithms, each company has its own advantages and disadvantages. However, in a short period of time, a company's algorithms will have qualitative changes, and it is impossible to greatly improve or transform them, which makes data the focus of competition for each company. Obtaining larger scale and higher quality data is the value that the crowdsourcing team brings to customers, so that we can ultimately improve the effect of AI applications, and improve its accuracy and recall rate.

Data crowdsourcing has its own set of product mechanisms to ensure the quality of data services 。 During labeling, the crowdsourcing team supervises the whole process, and the system developed by itself can automatically analyze the behavior of labeling personnel. For example, when a photo is labeled with a face, the system will monitor all the time of labeling, the time interval of each marking, the movement track of the mouse and other details, so as to judge and predict whether the labeling of this photo is correct, Whether there are omissions, etc.

Data quality is also related to multiple quality inspections in the later stage. Data is not directly handed over to the customer after being marked, but needs to go through two or three rounds of quality inspection procedures, which need to go through automated sampling procedures, automated plus manual sampling procedures, etc., which can greatly ensure data quality.

Open up upstream and downstream, and develop business in depth

At present, there is still a question in the industry. If AI gradually matures, will the demand for data services gradually decrease?

“ For a long time in the future, data services are just needed 。 For example, the crowdsourcing team's demand for data services has been growing in the two key areas of intelligent furniture and driverless driving this year. In addition, Baidu is an AI company, and the power of the brand makes Data crowdsourcing gives customers greater confidence in data security, data privacy and project delivery time 。” Shi Jialiang said.

After the business volume is guaranteed, data crowdsourcing will consider further developing the business in depth, extending from labor-intensive labor data labeling to upstream and downstream. Upstream expansion is data collection, especially vehicle road information collection, Baidu, as one of the few enterprises with mapping qualification in China, has obvious advantages over other leading Internet companies "Baidu is almost the only company in this market". The downstream extension is to provide software and platform services in data management, data model training, data application and data iteration.

Since 2010, the crowdsourcing team has focused on collecting disorderly and chaotic data, cleaning and labeling, and supporting the birth of numerous precision intelligent products. In addition to supporting Baidu's internal business, it has also externally enabled data processing capabilities to benchmark customers in various industries such as the Internet of Things, unmanned vehicles, and intelligent voice, and its service capabilities have won unanimous praise.

In the future, Baidu Intelligent Cloud Data Crowdsourcing will open up the whole life cycle of AI data services, realize one-stop data support services for AI commercialization scenarios from data acquisition, processing, model training and other links, and help AI enterprises improve product competitiveness.

Share to:

Baidu (Shanxi) Artificial Intelligence Basic Data Industry Project was listed as the key promotion project in the big data field of Shanxi Province in 2019

2019-07-29 Baidu data crowdsourcing

Baidu (Shanxi) Artificial Intelligence Basic Data Industry Project is a professional and centrally managed AI data annotation base that is fully prepared by Baidu Intelligent Cloud Data Crowdsourcing Team (Baidu public beta) under the support of Shanxi Comprehensive Reform Demonstration Zone. At present, the base has nearly 10000 square meters of office space, with 1500 professional markers and auditors, which is expected to increase to 2000 in 2019. At that time, the base will become the largest single carrier for data annotation in professional fields in China.

Baidu (Shanxi) data annotation base is located in Tanghuai Industrial Park, Shanxi Comprehensive Reform Demonstration Zone

At present, the base business comprehensively covers the labeling and processing services of unmanned vehicles, voice, face, image, NLP, mapping and other data types, and has perfect methods of personnel management, project management and quality management. For key key customers, the base can be configured with an exclusive labeling team of 10 to 200 people to provide long-term, stable, professional and high-quality services for key customers in a closed site and exclusive network environment to ensure customer data security and project quality delivery on schedule.

Based on the stable professional tagging personnel of Baidu (Shanxi) data tagging base and the industry-leading quality assurance mechanism, Baidu's intelligent cloud data crowdsourcing business has been able to continuously serve the industry and internal product lines, provide high-quality data tagging and cleaning services, assist in the improvement of AI algorithm, and continuously output efficiency, quality The service ability of attaching equal importance to safety has been highly recognized by internal and external customers.

Interior view of Baidu (Shanxi) data annotation base (part)

Since its establishment in 2018, the base has received visits and guidance from many leaders including the governor of Shanxi Province. As the brand window of Baidu in Shanxi, Baidu (Shanxi) AI data annotation base actively responds to the relevant policies of Shanxi Province to accelerate the development of data annotation industry, attracts young talents to employment, cultivates multi-level data annotation talents, builds the advantages of AI development in Shanxi Province, and drives the comprehensive transformation and upgrading of industries related to industry, medical care, transportation, etc.

On July 21, 2019, Luo Huining, Secretary of Shanxi Provincial Party Committee, and his delegation visited the base

On June 28, 2019, the delegation led by academician Zhou Ji of the Chinese Academy of Engineering visited the base

Wang Ligang, Executive Vice Mayor of Taiyuan, Shanxi, and his delegation visited the base on June 20, 2019

On May 7, 2019, Lou Yangsheng, Governor of Shanxi Province, and his delegation visited the base

On March 26, 2019, Wang Yixin, Vice Governor of Shanxi Province, and his delegation visited the base

In the future, the Base will launch further cooperation programs with Shanxi Transformation and Comprehensive Reform Demonstration Zone. With the support of Shanxi Provincial Government, Baidu will lead the construction of professional data annotation industrial park, build online data trading platform, build professional data sets such as unmanned vehicles and dialect voice, and continue to help the development of Shanxi's data annotation industry.

Share to:

How does intelligent data crowdsourcing play a catalytic role in the process of industrial intelligent upgrading?

2019-07-04 Baidu data crowdsourcing

On July 3, 2019, Baidu AI Developers Conference opened in Beijing National Convention Center. The conference set up a main forum entitled "Baidu Intelligent Cloud ABC+X, Accelerating Industrial Intelligent Development", and dozens of sub forums represented by "Intelligent Cloud and Internet of Things Forum", which will last until the 4th.

The important role of data in the AI era

With the continuous advancement of new technologies such as the Internet of Things and 5G, China's big data industry market has maintained a rapid growth trend and gradually penetrated into all walks of life, promoting China to become an intelligent power. There is no doubt that data is the fuel of the AI era, which determines the use effect of AI applications and is an important basis for accelerating the intelligent upgrading of industries. As a leading AI data service platform in China, Baidu Data Crowdsourcing is committed to creating first-class and complete AI data services to meet the personalized needs of customers in various industries and help the intelligent upgrading of various industries in China.

In the face of the data dilemma that enterprises cannot solve by themselves in the process of industrial intelligent upgrading, Baidu Data Crowdsourcing can provide one-stop, customized data acquisition and processing scheme design and implementation services according to customer needs in specific fields and scenarios, and deliver standardized and structured available data to customers. The data type covers the most comprehensive application scenarios, meeting the types of text, image, audio, video, web page, etc.

The birth of available AI data for three-step catalytic optimization

In the "Intelligent Cloud and Internet of Things Forum", Gao Guorong, the head of Baidu's intelligent cloud data crowdsourcing business, delivered a keynote speech entitled "Intelligent data crowdsourcing accelerates the intelligent upgrading of the industry", deeply analyzed the intelligent optimization of Baidu's data crowdsourcing in data collection, data annotation and data use, and effectively catalyzed the birth of available AI data.

Gao Guorong, head of Baidu smart cloud data crowdsourcing business, delivered a keynote speech

Data acquisition is the first step in the birth of AI data. Baidu Data crowdsourcing adopts seamless collection of multi-dimensional multimedia data and matches the most stringent privacy compliance mechanism, which meets the requirements of data regulations in various countries and has been highly recognized by many customer security departments. But what is more worth highlighting is the more intelligent and efficient quality detection steps in the data acquisition process. Baidu Data crowdsourcing firmly believes that quality is the lifeline of AI data. Before conducting three rounds of manual audit on the collected data, it introduced the intelligent pre audit technology independently developed. In this way, it not only effectively saves manpower and improves efficiency, but also makes the accuracy of the final collection results as high as 100%.

Baidu and Shanxi government jointly build the process of data annotation base

Following the data collection, Baidu's data crowdsourcing tagging business has the characteristics of full scene, high quality, high efficiency and strong professionalism, which can provide high-quality, fast and professional full scene tagging services for the intelligent needs of various industries. According to Gao Guorong, Baidu Data crowdsourcing has the strongest tagging resources in the industry, which is mainly represented in the combination of crowdsourcing resources and self built tagging bases. Baidu Data Crowdsourcing cooperated with Shanxi Provincial Government to build a data annotation base, which was listed as the "key promotion project in 2019" by Shanxi Provincial Department of Industry and Information Technology. Wang Yixin, vice governor of Shanxi Province, once proposed that the base construction fund should not be less than 100 million yuan when inspecting the data annotation base. On this highly valued government enterprise cooperation annotation platform, Baidu Data Crowdsourcing strongly supports intelligent auxiliary annotation technologies such as automatic prediction of continuous frames, automatic edge fitting of object segmentation, breaking the blind spots and bottlenecks of traditional visual annotation, and significantly improving annotation efficiency, or more than 20 times.

Baidu Intelligent Auxiliary Annotation Technology

From the perspective of data annotation quality, Baidu 2D visual inspection algorithm, 3D point cloud detection algorithm and other automatic quality inspection algorithms effectively ensure the annotation quality. According to valid data, the proportion of label errors detected by this series of automatic monitoring is about 70%.

Baidu automatic quality inspection algorithm

After data acquisition and processing, data application in driving model iteration is also very effective. Baidu Data Crowdsourcing adopts the method of intelligent data mining to evaluate the built model, discover the obvious defects of the model in time, and effectively guide the model iteration. Gao Guorong explained in detail with the example of face recognition: in the collected data sources, Baidu Data Crowdsourcing found the main defect of the current model by drilling down the evaluation results, that is, the overall recognition accuracy in dark scenes is not enough. Therefore, the announcer added more face annotation data of dark scenes to the iterative process, and finally this long tail problem was effectively solved, and a satisfactory landing effect was achieved.

Evaluation results show that face recognition is not accurate in dark environment

Intelligent data crowdsourcing catalyzes industrial intelligence upgrading

To sum up, Baidu Data Crowdsourcing is a leading data platform for AI developers in China. It integrates online crowdsourcing resources, offline agent resources and data ecological industrial park on the whole chain of AI data development, and coordinates with efficient, high-quality and professional data collection and labeling to form the integrity of data management, model management, model training, model evaluation and resource scheduling The normative data platform can effectively reduce the AI development cycle by using data, computing resources and accelerating model iteration, which can accelerate the intelligent upgrading of all walks of life.

Share to:

Closed loop labeling scheme of urban roads based on pure vision

2019-06-26 Yang Xue

At the just concluded world top computer vision conference CVPR2019, Wang Liang, Chairman of Baidu Apollo Technical Committee Baidu Apollo Lite, a pure vision closed-loop solution for urban roads, was disclosed. This solution uses a 10 way camera to achieve 360 degree real-time environmental awareness.

Compared with those requiring laser radar, millimeter wave radar, car camera, etc The sensor fusion scheme and the pure vision closed-loop scheme have the following advantages: First, the data obtained is most similar to the real world perceived by human eyes. Second, the camera installation cost is low and the problem of vehicle inspection non-compliance is avoided; Third, the video data collected by the camera contains more information. In this issue, we will talk about how 10 pieces of basic structured data required for this scheme are generated.

Baidu has 500 professional collection vehicles loaded with intelligent devices, which can cover the driving roads of major cities in China. After the video data collected by the acquisition vehicle is transmitted back to the platform, the data is first cleaned and 10-200 frames/second are extracted.

The frame drawing picture is transferred to the data annotation link to annotate obstacles, positioning elements, traffic lights and other elements. The labeling process is mainly divided into two parts. The first step is to label the single way frame drawing data segment by segment continuously, and the segments are associated by overlapping frames, which not only ensures the continuity of the frame drawing image, but also reduces the difficulty of labeling. Through the intelligent prediction algorithm, the first frame is manually labeled, and the algorithm automatically recognizes the subsequent frames, which greatly improves the efficiency and accuracy of labeling. After the data annotation is completed, the annotation results are normalized according to the overlapping frames, and then go to the second step.

The second step is to carry out association annotation on the data of 10 cameras. According to the road scene and the complexity of annotation requirements, determine the number of cameras for association annotation at a time, generally 2-4. Associative annotation can not only ensure 360 degree look perception, but also verify the annotation quality of the first link again. In this link, the same elements of the associated camera are identified by the intelligent vision algorithm for associated pre labeling, and the pre labeling results are checked and corrected manually.

Each annotation link has a corresponding prior/posterior algorithm. In addition to the three person fitting strategy in the audit phase, the quality of annotation data can be guaranteed in many ways. After all annotation phases are completed, it supports exporting annotation data in multiple formats to meet the needs of different customer algorithms.

Baidu data crowdsourcing - intelligent driving data solution

Have a Class A surveying and mapping qualification collection team and equipment, and be able to complete data collection under the designated urban road scene;
Support multi type annotation services, such as obstacle framing, tracking, semantic segmentation and 2D/3D fusion annotation, and provide long-term stable training data with high concurrency and high quality;
Research and develop the data development and management platform for unmanned vehicles, realize the storage, management and application of intelligent driving data sets, support model iterative training and propose corresponding solutions according to model defects;

Share to:

AI data service system architecture changes

2019-06-19 Wang Guanghao

overview

Baidu public beta（ http://zhongbao.baidu.com/ ）As the largest AI data annotation platform in China, it has been established for 8 years since 2011. With the continuous development and expansion of the business, the entire site architecture has also undergone earth shaking changes. Based on some experience and accumulation over the years, this paper describes in detail the history of architecture changes that belong to mass testing.

Only by constantly summarizing can we find the way forward. This article is down to earth, looking back at the long history and looking up at the stars.

Stage 1 Single point based website architecture

In the early days of general websites, it is common for a single computer to include all applications and databases. To be honest, this kind of environment configuration is not recommended if conditions permit. Of course, sometimes when our machines are relatively tight, applications and databases are deployed on the same machine. What is the cost?

That is the inevitable fate of downtime!

Common applications need to execute scripts, and some scripts may have memory leaks or large memory usage. Databases themselves are large memory users. Once the machine's memory is overloaded, Linux will be smart to kill the database, leaving you at a loss.

Therefore, in consideration of disaster recovery of machines, it is recommended to deploy the database and applications separately at least.

As for deployment, the classic LAMP mode has been mentioned before. The containerized docker in the early stage of crowd test station building is not very mature, so the machine building is done through script. Now there is no doubt that Docker building is very convenient, fast, and easy to manage. It is also not easy to cause compilation and debugging crashes due to system version problems. However, it is recommended to reinstall the various web components used if possible, and simply understand the effects of various compilation configurations for emergencies.

The current overall architecture can be shown as follows:

Stage 2: Database read/write separation

Applications that can run are good applications, but the machine will inevitably have problems, so disaster recovery of the database itself is particularly important.

With the development of business, the database will inevitably encounter data errors, even physical downtime and other problems caused by error codes or misoperations. Therefore, database disaster recovery is the most important.

The mysqldump provided with MySQL can easily export data for data recovery. If possible, you can further back up the binlog, so that you can achieve second level data recovery. However, it should be noted that the database lock table will appear when mysqldump. If a single database is used, your service will be goodbye.

At this time, it's MySQL's own master+slave!

What are the benefits of slave libraries?

In a brief summary, there are two points:

The slave database is mainly used for reading services, which can greatly reduce the service pressure of the master database.
The specially deployed backup slave database can safely back up data.

However, after the introduction of slave, there are a lot of fidgety problems:

First, you need to set up a read/write agent service in front of the database. We used the self developed dbproxy component in the factory, which was completely unaware of its existence. If it is open source, you can use mycat. Some frameworks even support the configuration of masters and slaves.

Secondly, after the introduction of master-slave, the inconsistency between master-slave must become a problem to be considered in business code. A common error scenario is to read the data directly after writing it to the master database. Due to the inconsistency between the master and slave instantaneously, the case cannot be read. Of course, it is recommended to minimize the implementation of business code in this way, but some special scenarios may not be avoided. It is recommended to handle this case by adding transactions to operations or forcing access to the master database connection.

With the slave database, I feel more and more stable:

Phase 3 Load balancing+multi application servers

As the number of visits continues to increase, a single server can hardly meet the demand. Generally, you will choose to increase the machine to exchange money for stability. However, adding machines can not be added simply because of the following problems:

1. The first question is what technology is used for load balancing:

The reverse proxy server is preferred. The request is forwarded to a specific server by the reverse proxy server according to the algorithm. Common Apache and nginx can configure forwarding rules to other machines. The deployment is quite simple, but the proxy server may become a performance bottleneck, and there are also single points of problems.

Another more low-level solution is to use IP layer load balancing. After the request arrives at the load balancer, the load balancer implements the request forwarding and load balancing by modifying the destination IP address of the request. The overall performance is better than reverse proxy, but there is also a single problem.

Of course, in more complex cases, DNS and other methods will be selected for load balancing, rather than expansion.

2. The second problem is to choose the cluster scheduling algorithm.

First, the most common rr polling scheduling algorithm and wrr weighted scheduling algorithm are simple and practical.

Secondly, forward in hash mode. The user IP and other information are commonly used as hash values to ensure that each time the user accesses the same server.

Finally, the data is distributed based on the number of connections. The basic one is the least lc connection, that is, the server with fewer connection requests. Wlc weighted least connections, on the basis of lc, add weights to each server. The algorithm is: (number of active connections * 256+number of inactive connections) ÷ weight. Servers with small calculated values are preferred.

Of course, there are more complex algorithms that can be applied. I will not introduce them here.

3. Finally, unlike a single server, session sharing needs to be considered.

General frameworks provide session sharing configuration based on Redis or database, which can be used after simple configuration. However, it should be noted that in the case of large traffic, there is a risk that the number of connections between a single Redis and a single database will be full, and further capacity expansion is required.

In our actual use, we usually directly use the BLB of the open cloud architecture. It provides two types of load balancing methods: http layer and tcp layer. The wrr method can be used for load balancing. At the same time, it has heartbeat detection, effectively eliminating invalid services.

So far, a cluster has begun to take shape:

Phase 4 Database Split

At this stage, there are two possible problems: a single database has maintained hundreds of tables, which is very big; The data of a single table has reached the level of 10 million, and the query has performance problems. For these two situations, horizontal splitting and vertical splitting need to be introduced:

Vertical splitting means to split different business data in the database into different databases. For example, we will split different scenarios such as labels and questionnaires, and open the database separately. Therefore, we can avoid single database performance dragging down the whole station.

The new problem is how to handle cross database transactions. At present, we generally use code control, and some important logic supports its own independent rollback in each library.

Horizontal data splitting is to split the data in the same table into two or more databases. It is generally used to solve the performance problem of a single table that is too large, and to facilitate capacity expansion.

However, how to split is a point that needs careful design. At present, for example, the mycat module can forward sql to the database according to the configuration, so as to achieve the goal of splitting the database.

At the beginning of our business, these components were just starting. We initially used a simple monthly library disassembly design. Simply put, the task is placed in different month databases according to the release month of the task. According to the expiration time, the cold data is poured into the read-only database to compress the storage capacity.

However, with the increase of business volume index level, the capacity of a single database is gradually out of control. We further adjusted the strategy of sub databases. At present, we have designed a more fine-grained sub database strategy, which is based on the task database mapping table. At the beginning of task creation, there is a sub database algorithm to allocate the corresponding database, and then the life cycle of the task will be based on the assigned database for CURD.

With sub databases and sub tables, our business framework is as follows:

Phase 5 Application and module splitting

In the last stage, we have split the database. In fact, the splitting of business code should be carried out simultaneously with the splitting of the database.

Similar to the database, we split several modules, including questionnaires and annotations, according to the business module. The business code itself is different, and this split is quite logical, but the pain is that the business code will need a lot of common logic, such as some common string and array processing. One suggestion is to put these general logic into the component of the framework to achieve the purpose of public use.

In addition, some public service modules (such as user information) may need to be deployed and maintained independently in the ideal situation. However, as a transitional phase, considering the workload of development, we will copy the common modules to each cluster during deployment, and plan to split them independently in the subsequent phase.

After business splitting, we have multiple subsystems:

Phase 6 Data Cache

With the increasing complexity of the system, it will gradually be difficult to use MySQL to process many application scenarios, such as the following cases:

Users often submit some verification codes and other information. If they use the database to store these short-term large amounts of data, it seems that they have killed the chickens with a knife;

Some complex paging information is difficult to calculate directly with the database, and needs to be calculated by combining data in memory. However, for this case, the paging operation is also inevitable, so a large number of database values need to be read frequently in memory.

For these problems, it will be more comfortable to introduce caching NOSql. At present, Redis is commonly used.

The verification code information can be directly stored in Redis using the key value method, and the expiration time of the key can be set to prevent Redis from storing too much cold data.

For complex paging information, the page ID information can be stored in Redis. The paging information in Redis can be taken directly when changing pages, without further calculation.

Redis can also be used for some message queue, session storage and data cache functions. It is an essential layer of data storage scheme.

After the introduction of Redis, the modules are roughly as follows:

Stage 7 Micro service

With the process of business splitting, we will find that the organizational segmentation of modules is particularly advanced, which is also the stage we are currently facing and solving. A common design idea is the microservice architecture: each service in the system has its own processing and lightweight communication mechanism, which can be deployed on a single or multiple machines to achieve rapid capacity expansion.

An excellent microservice system will have the following characteristics:

Loose coupling: due to service autonomy, there is a certain encapsulation boundary, and service invocation interaction is through publishing interfaces. This means that applications are not interested in how services are implemented.

It is easy to test, can be developed in parallel, and has high reliability and good scalability.

How to build a microservice system is not a simple sentence. If necessary, please read it in depth. The microservice architecture we are currently implementing is shown below:

It looks very similar to the previous phase, but the main difference is that in the previous phase, all clusters actually deploy a full amount of code, just by distinguishing routing methods to ensure that requests enter different clusters. For example, the external test service needs to call the function of the annotation service, and only needs to directly call its code to work.

In the microservice stage, the code of each service itself should be as simple as possible, and almost never cross each other. The call between each other needs to use the interface for communication.

summary

The website architecture has been developing for a long time, and today's advanced technology will inevitably be outdated sooner or later. So when building the architecture, we should be down-to-earth and understand the causes and consequences of each design, so as to consolidate the foundation and look up to the stars.

Never upgrade for the sake of advanced architecture. If you don't think clearly about how to split and design, you will be beaten to pieces by a burst of courage.

reference resources:

On the Evolution of Web Site Architecture:

https://www.cnblogs.com/xiaoMzjm/p/5223799.html

Technical Architecture of Large Websites: Core Principles and Case Analysis by Li Zhizhi

Mycat Authoritative Guide

Share to:

3D point cloud annotation in unmanned driving data scene

2019-06-06 Han Peigen

In driverless technology, the environment awareness system acts as the "eye" of the driverless vehicle, mainly acquiring the external environment information through the external sensors loaded by the driverless vehicle, modeling it, and accurately and quickly transmitting the geographical information and obstacle information of the vehicle to the computer control system.

The driverless system is usually equipped with a variety of sensors, including laser radar, millimeter wave radar and vehicle camera, as shown below:

[LIDAR] Lidar

It is a sensor used to accurately obtain three-dimensional position information. Its role in the machine is equivalent to the human eye. High frequency laser can obtain a large number of (106-107 order of magnitude) position point information (called point cloud) in one second. The lidar has a long detection distance and can accurately model the surrounding environment in real time, but the cost is relatively high.

[RADAR] millimeter wave radar

Millimeter wave radar is mainly based on the target's ability to reflect electromagnetic waves. It has strong ability to penetrate fog, smoke and dust, and can adapt to severe weather such as sand, dust, fog, etc. Its cost is cheaper than Lidar. At present, it is widely used in automatic emergency braking system. However, the detection distance is directly restricted by the loss of frequency band, and the perception of pedestrians is weak.

[CAMERA] Car camera

The on-board camera is used to capture the information around the vehicle. Its general principles are as follows: 1) image processing, converting pictures into two-dimensional data; 2) Pattern recognition, through image matching, such as vehicles, pedestrians, lane lines, traffic signs, etc; 3) Use the motion mode of the object, or binocular positioning, to estimate the relative distance and relative speed between the target object and the vehicle.

At present, 3D modeling of the surrounding environment of the vehicle is mainly carried out by laser radar in unmanned driving, so as to provide basis for driving decisions of the unmanned vehicle.

This issue focuses on 3D point cloud image annotation based on lidar generation.

3D point cloud image annotation

3D point cloud annotation is to mark the target object through the 3D frame in the 3D image collected by the laser radar. Target objects include vehicles, pedestrians, advertising signs and trees, as shown below:

When the laser radar is equipped with the vehicle camera, 2D images corresponding to the point cloud image can be generated for comparison and reference.

Baidu public beta currently has a 3D annotation tool set, which supports annotation scenes including 3D point clouds, 2D-3D fusion, 3D continuous frames, etc. The annotation tool is mainly divided into three modules, namely 2D image, point cloud information and annotation frame, as shown below:

2D image: Map the box marked in the point cloud to the 2D image.

Point cloud window: 3D point cloud image annotation operation window.

Box three view: Map the selected box in the point cloud to the three views to display more detailed information.

Dimensioning rules

After the top view drop-down box in the point cloud (as shown in the left figure below), automatically generate a solid box (as shown in the right figure below) according to the algorithm, and then fine tune the size and direction of the corresponding solid box to make the box meet the requirements.

Box requirements:

1. Frame fitting: The six sides of the frame should fit the marked object. There should be no gap of more than 3px in the frame, and there should be no point outside the frame that belongs to this object.

2. The frame direction is parallel: The direction of the frame should be parallel to the direction of the body, and pay attention to the direction of the head.

3. Box type: When the 3D frame is marked, the corresponding 2D image will automatically frame the object marking type and head direction that can be confirmed by the 2D image for the position.

As the label data provider of Baidu's driverless business, Baidu Public Beta has a label tool set including 3D point cloud, 2D-3D fusion, and 3D continuous frames. It has accumulated a lot of experience in 3D point cloud data label, and provides high-quality training data for customers through efficient label tools and a streamlined quality management system.

Share to:

Baidu Data Crowdsourcing: Exploration and Practice of AI Data Quality Management (II)

2019-06-04 Zhang Xiaoxiao

Last issue we said Baidu data crowdsourcing Five dimensional stereo quality control system:

Automated prior filtering
AI bonus automatic audit
Carry out self inspection of project manager
Multi round crowdsourcing quality inspection
Baidu project manager spot check and small flow pre delivery

Today, we will focus on the automated audit with technology empowerment.

Automatic audit, as the name implies, is an automatic screening process for non manual procedures.

As one of the leaders in AI research and application in China, Baidu has accumulated many AI technology platform based applications/interfaces within the company. Baidu data crowdsourcing can also take advantage of the accumulation of these technologies and the open trend to feed AI technology back to the data collection stage.

According to the stages of approval filtering, automatic approval is divided into prior filtering and posterior approval.

Prior filtering

Quality control has existed since the beginning of data acquisition. Baidu has its own collection tool, which can flexibly set filter conditions before actual collection. It can comprehensively judge the face, model and other information, filter duplicate users, and solve the problem of sample overlap that may be caused by the traditional crowdsourcing subcontracting mode; Non target users can be filtered through machine information collection and big data portrait tags based on Baidu account; Even in the data submission stage, basic filtering is carried out locally on the validity of data parameters, data repeatability, etc. The above measures not only greatly improve the data quality in the acquisition stage, but also reduce the redundant acquisition and quality inspection work by 20%+, greatly improving the efficiency of the entire acquisition stage.

Posterior review

For example, face recognition, face duplication check, audio blank truncation detection, etc. have been successively added to the automatic audit framework to filter and screen obviously unqualified data and overlapping samples. Greatly improve the audit efficiency, reduce the amount of manual quality inspection, and even complete the quality inspection requirements that cannot be completed by manual quality inspectors.

In addition to different audit stages, automation technology plays different roles. I think you will also be curious about what automation technology/artificial intelligence technology has been or will be applied in our automatic audit.

1. Face duplication check and face recognition

Calling Baidu's internal platform API interface for face recognition, we made a trade-off between recall and accuracy under the limitation of existing algorithm accuracy, automatically filtered out completely duplicate faces (the same sample user), and submitted other face data with medium and high similarity to manual secondary judgment. At present, this algorithm has a relatively excellent performance on Asian faces, and has also made an application attempt in other races such as European whites in 2018. And compare the accuracy of machine algorithm and manual judgment for duplicate checking of large-scale face data, and the experiment shows that the machine algorithm is obviously better. This gives us more confidence in the attempt of AI technology to feed back data business. Face duplication checking and face recognition have different applications in the acquisition prior and posterior stages. The prior stage directly helps the project executive manager to determine whether users are repeatedly involved in the project, while the posterior stage has more flexible applications, such as gender discrimination.

2. Duplicate checking of commodity barcode

For the collection of goods on sale, we added barcode identification and duplicate checking to the prior framework for the first time. This makes it possible for us to carry out commodity collection nationwide at the same time. It avoids the waste of collection resources and audit resources and inefficient input of project management personnel due to scattered collection personnel, difficulty in information synchronization, difficulty in category splitting and monitoring, etc. At the same time, the data quality has been further improved.

3. Audio blank detection and truncation detection

During voice acquisition, it is inevitable that users upload blank audio or cut audio before/after due to improper operation and other reasons. Technically, there are very mature means with high accuracy, which can easily identify whether the audio file is blank, or whether there is no appropriate amount of blank before and after the audio file is directly truncated. This technology was applied in the automated post test audit at the beginning of 2018, and has played a very good role in improving the audit efficiency.

4. File parameter filtering

Crowdsourcing collection is certainly difficult to manage as well as a professional collection team. The cultural level of the staff is uneven, the professional quality is high or low, and the equipment used is diverse. Maybe we think that the very simple data parameter requirements make it difficult to control when putting into crowdsourcing collection. The size, proportion, pixel, volume, format of the picture... the sampling rate, duration, decibel of the audio... the duration, frame rate, format of the video... These file parameters require that we can certainly improve the collection efficiency through the optimization of collection tools (software). However, on the one hand, Android phones have complex compatibility problems, on the other hand, we are also required to re filter the collected and uploaded data due to the unavoidable offline centralized collection and retrieval. Imagine that we can flexibly configure the file parameter requirements within the whole post test framework. After the first file is generated after collection, the system will automatically filter out the unqualified data day and night, and we will save the project manager's workload, quality inspection manpower and time. It also further improves the efficiency of collection and audit.

5. Systematic support for complex audit rules

For complex approval rules, we often disassemble them. In manual approval, only simple single direction judgment is made (to reduce the difficulty of manual approval and the rate of false judgment), while complex approval result rematching is written back by the system. Our posteriori framework has also begun to try to support the insertion of R&D customized scripts, which is a highly flexible and extensible automated quality inspection system.

In the future, as more and more AI technologies are used in the life cycle management of AI basic data, Baidu Data crowdsourcing will continue to provide key support for AI enterprises to reduce management and operation costs, improve data quality, and maximize the value of data assets.

Share to:

Baidu Data Crowdsourcing: Exploration and Practice of AI Data Quality Management (I)

2019-05-28 Zhang Xiaoxiao

In the past two years, the wave of artificial intelligence has swept the world. Traditional Internet companies have invested resources into the AI industry, and a large number of AI start-ups have poured in and emerged. Three elements of AI technology: computing power, algorithm and data. There is no need to say more about computing power. In line with Moore's Law, the performance of GPU and TPU has improved by leaps and bounds, and China's "core" is also rising. It is difficult for domestic companies to open a significant gap in computing power. However, deep learning is still the most popular algorithm at present. The accuracy of deep learning algorithm needs a large number of high-quality data for training. At present, any great product in the AI field needs a huge amount of training data support, and data is the basic element of AI luminescence.

At present, there are basically the following data sources:

Network public resource capture
Academic, government, enterprise and other industry data sets purchase (or free access)
Self built team collection
Crowdsourcing manual collection/labeling
Acquire data of self owned products

With the in-depth development of our AI products, the algorithm accuracy in simple scenarios tends to be the same and the numerical value is high. However, in complex and difficult scenes, the accuracy of the algorithm has opened a significant gap. AI company pays more and more attention to the algorithm training of specified scenarios. The algorithm has more personalized requirements for data. Many existing data sets of network capture or can not meet the needs of current enterprises. For example: dim, backlight, strong light, occlusion, etc. in image, noisy environment, office environment, car interior, etc. in audio. These are business scenarios that are difficult to cover or filter out from existing data.

Similar to this situation is the data generated by using the products that the company has put on the market. Although it costs no money to collect, due to a large number of redundant data, it will take tens or hundreds of times of manpower to clean and re label the data. So, take some mature companies for example, even if their database adds hundreds of thousands of millions of pictures and audio data every day, they will not want to pick out useful data from this batch of data for algorithm training, what's more, this also involves privacy and other legal issues.

Self built acquisition teams not only need a long preparation period, but also often face high labor and equipment costs and continuous management investment. Especially with the iteration of products, the change of data requirements also brings higher requirements to the self built team, which is a "loss making deal" for most enterprises.

Crowdsourcing manual collection has become the only choice for enterprises to obtain large quantities of high-precision data at low cost.

It is generally acknowledged that crowdsourcing has many advantages, such as low labor cost, wide distribution range, rich coverage of scenarios, etc. In contrast, crowdsourcing is difficult to manage personnel, poor support for difficult data collection, and uneven personnel quality, which leads to difficult data quality control... It is also a defect of crowdsourcing. In fact, more and more professional data companies solve the above problems by building their own project execution teams and cultivating a large number of experienced and excellent project managers. Slightly effective, but far from enough to meet the accuracy requirements of the data required by AI algorithm.

If you are reading this article and have been engaged in AI algorithm related business, you may have a great feeling about the data accuracy. Maybe it is just a few percent or even a few tenths of a percent difference, which determines the success or failure of the product. Take the intelligent speech market as an example. The accuracy rate of speech recognition is even more than 98%, not to mention the accuracy rate of its algorithm training data. We often hear such voices from customers, and we need the accuracy rate to be above 99.x%.

As a crowdsourcing data company, we also often communicate with customers about why we finally chose Baidu Data crowdsourcing.

"Data quality has obvious advantages". This is the answer we often hear.

At the beginning of the second quarter of 2019, we will launch a series of articles to decrypt Baidu's efforts in quality assurance of data crowdsourcing collection business.

What is the current collection business quality inspection mode of most data companies on the market?

——After receiving the project, the project will be subcontracted to multiple project executive managers or other small resource companies/studios. After the data is returned, the company's internal quality inspection team will conduct manual sampling or full inspection. It seems to be a reasonable data collection and quality inspection process, but it is actually a very rough and original quality control method. First, let's read how many "pits" there are!

When the project is subcontracted to other small resource channels, it means that the collection objects may overlap, which is difficult to eliminate and avoid. However, the data review process only judges the accuracy of data, and this part of overlapping objects cannot be detected. Take this "moisture" data to train the algorithm, and get half the result with twice the effort.
Relying on single manual quality inspection faces two problems, one is efficiency. Due to the limited number of internal quality inspectors, which means the company's maximum concurrency, the quality inspection team will be stretched to the limit when faced with large-scale data collection and quality inspection needs or sudden and urgent business needs.
The second and most important problem is to rely on a single manual quality inspection. The data accuracy depends on the manual judgment of a quality inspector. Manual work can lead to fatigue, misunderstanding, and occasional distraction... "Manual quality inspection" is exactly the means that requires in-depth construction to achieve quality assurance effect, but many quality inspection teams are only very basic in building business processes.

Baidu data crowdsourcing has been established for more than 7 years, and has very rich experience in crowdsourcing data business. Especially different from most traditional data companies, Baidu Data crowdsourcing has been pointing its finger at crowdsourcing since its creation, starting from a small self owned data collection and labeling team. As an old brand of crowdsourcing in China, we encountered various difficulties in crowdsourcing earlier, and constantly accumulated solutions, optimized business processes, precipitated technologies and products, and built a leading domestic crowdsourcing business system.

Taking the quality control of collected data as an example, Baidu Data Crowdsourcing is the first and only company in China to conduct multi-dimensional quality control of collected data 。 The system has rich quality control measures, comprehensive process coverage and industry-leading data quality. The quality control measures mainly cover the following five directions:

Automated prior filtering
Quality control has existed since the beginning of data acquisition. Baidu has its own collection tool, which can flexibly set filter conditions before actual collection. It can comprehensively judge the face, model and other information, filter duplicate users, and solve the problem of sample overlap that may be caused by the traditional crowdsourcing subcontracting mode; Non target users can be filtered through machine information collection and big data portrait tags based on Baidu account; Even in the data submission phase, basic filtering is performed locally on the validity of data parameters, data repeatability, etc
The above measures not only greatly improve the data quality in the acquisition phase, but also reduce the redundant acquisition work and quality inspection work by 20%, greatly improving the efficiency of the entire acquisition phase.
AI bonus automatic audit
As one of the leaders in AI research and application in China, Baidu has accumulated many AI technology platform based applications/interfaces within the company. Baidu data crowdsourcing can also take advantage of the accumulation of these technologies and the open trend to feed AI technology back to the data collection stage. For example, face recognition, face duplication check, audio blank truncation detection and so on have been successively added to the automatic audit framework to filter and screen obviously unqualified data and overlapping samples. Greatly improve the audit efficiency, reduce the amount of manual quality inspection, and even complete the quality inspection requirements that cannot be completed by manual quality inspectors.
Carry out self inspection of project manager
Again, quality inspection and control have existed since the beginning of data collection. Before the data is truly turned to the manual quality inspection team for review, it is first seen by the project manager who executes the project. The significance of this link lies not only in how much invalid data can be filtered out by the executive project manager, but also in that he can timely find the existing data problems, modify the implementation strategy, actively communicate and actively adjust, reduce the investment in the wrong direction, reduce the manpower investment in invalid implementation and quality inspection, and improve the project efficiency and data quality.
Multi round cross crowdsourcing quality inspection
In the unavoidable manual review process, people who have been exposed to data business know that data production, cleaning and labeling are ultimately inseparable from people. No matter how far AI technology has developed, if it wants to make progress and further improve its accuracy, it must have high-precision personnel input. What is the difference between the manual quality inspection of mass testing and other teams? That's too much, from processes to tools, from people to systems. Let me have a chance to share it next time.
Baidu project manager spot check and small flow pre delivery
The whole process of standard acquisition (review) is online, the data flow is timely, and the internal is transparent, giving flexible space for project delivery. The first day of data collection can be pushed for review the next day and the review results can be issued as soon as possible. Baidu project managers can extract small batches of data from the system at any time, spot check the quality, and transfer it online to customers for confirmation. Timely find problems and make subsequent adjustments. This greatly avoids mass data repair or even re purchase due to poor communication or demand changes. Reduce customer waiting costs and even capital and labor losses.

In addition to the data quality control of each batch, Baidu Data Crowdsourcing is also striving to create a more viable data collection and delivery ecosystem. The evaluation data such as data quality and efficiency collected each time will follow the project executive manager and his channel resources for life and become the basis for their subsequent comprehensive evaluation. The difficulty and scope of projects that subsequent project managers can undertake also depend on the accumulation of previous experience. On the one hand, in the implementation of the entire acquisition project, we focus on the project manager to carry out a positive cycle of survival of the fittest. On the other hand, we also actively spread the business value orientation of "attaching importance to quality", "attaching importance to performance" and "attaching importance to communication". This will also become the root of Baidu data crowdsourcing to collect long-term vitality.

Share to:

Intelligent bidding platform for agents to build a fair and open crowdsourcing tagging ecosystem

2019-05-05 Zhong Ping

As a leading AI basic data service provider, Baidu data crowdsourcing team is committed to providing the most professional one-stop data annotation and collection services for AI industry customers such as intelligent driving, computer vision, voice recognition, etc.

The high-precision model of AI intelligent algorithm relies on massive training data support. In the data production chain of Baidu's data crowdsourcing team, there are more than 100 cooperative agents and tens of thousands of employees involved in data annotation production. Under this employee base, the selection and control of agents is undoubtedly the top priority of Baidu in building a mature data crowdsourcing model solution.

Intelligent bidding to create an open and transparent cooperation ecosystem

In order to build a more transparent and efficient agent bidding environment, Baidu Data crowdsourcing team has independently developed a fully automated project intelligent bidding system. When the project is started, the bidding system will launch a time limited simulation test according to the actual work scenario marked with data. Agents with intention to bid can sign up for participation and organize employees to carry out project practice in the simulation system, After the simulation test, the system will combine the automatic audit algorithm to calculate the productivity and quality indicators of all agents participating in the simulation, select the agents whose indicators meet the standards in the simulation test according to the pre-set bid winning conditions of the project, and calculate and allocate the labeling data quota that each bid winning agent can undertake according to the actual performance value in the test.

Resource circulation, escort the rapid growth of new agents

In order to ensure the stable growth of newly established agents on the platform, the crowdsourcing team has introduced a resource circulation mechanism in the project system to provide new agents with Maximize the opportunity to undertake the project. When the project has been undertaken When the number of agents reaches a certain level, the system will start the cycle and score the project performance based on the delivery times, delivery quality, acceptance rate and other indicators of all agents in the cycle. The agent with the lowest score will lose the project qualification at the end of each cycle. If you want to continue to undertake, you need to re participate and pass the simulation test, The introduction of this mechanism ensures that the project will not be monopolized by large agents, giving new agents more room for growth.

summary

Baidu Data crowdsourcing has a 10000 person agent resource pool. It has cooperated with the government to build the largest downstream agent ecology in the industry - Baidu (Shanxi) artificial intelligence data annotation base. Through standardized quality control processes, professional software and hardware facilities meet the level needs of different customers for data security, helping enterprises in intelligent driving, computer vision Speech recognition and other vertical fields improve the quality of algorithms and continue to empower the artificial intelligence industry.

Share to:

Expert column | Jiang Zhijian: data annotation scheduling system design

2019-04-30 Baidu data crowdsourcing

introduction

He who gets data gets AI. Baidu Intelligent Cloud - Data Crowdsourcing Platform, established in 2012, can meet customers' data needs through an efficient crowdsourcing model, collect a large amount of raw data, and deliver standardized and structured available data for customers through data processing. Help customers train algorithm models, carry out machine learning, and improve the competitiveness in the AI field.

Several stages of data annotation development

Stage 1: embryonic stage

At the initial stage of Baidu Intelligent Cloud Data Crowdsourcing, it mainly undertook the evaluation of some product lines within Baidu and the accumulation of annotation data related to model training of the algorithm strategy team.

Stage 2: Development period

With the continuous investment of each business line in machine learning, there is more and more demand for data annotation, which lasts about three years. During this period, Baidu Data crowdsourcing completed the accumulation of original methodology and related technologies.

Stage 3: Outbreak

On September 1, 2016, at the Baidu World Conference that year, Robin announced that AI was the core of Baidu's core. With the establishment of the core position of AI in the company, the expectation and attention to AI in the market are becoming more and more intense. When everyone believes that AI is the next wind outlet after the mobile Internet, the data annotation industry at the bottom of AI ushered in an unprecedented outbreak.

Stage 4: Maturity

In 2018, the total financing scale of Chinese AI companies reached more than 100 billion yuan, and the market for data acquisition was about 10 billion yuan - 30 billion yuan. With the gradual entry of AI into various companies, the data annotation industry has ushered in a mature period, whether in the strategic development goals of the Internet or traditional enterprises.

Several Key Elements of Data Annotation

Marked by: The announcer is the first productive force. How to improve the ability and efficiency of the announcer is the core problem to be solved in the whole data annotation field.

Data: How to release data, process data and ensure data quality is another core problem to be solved in the whole data annotation field.

Dimension tools: Provide annotation rules and interaction methods. The annotation tool is the most important thing to liberate the productivity of the announcer.

To sum up, The essence of data annotation is that a suitable announcer processes a piece of data according to the specified rules through annotation tools.

Then, how to distribute the data to the announcer for processing?

Evolution of scheduling system

The annotation scheduling system is to solve the problem of serial connection of several key elements of data annotation, that is, to distribute the data to the announcer for processing.

At different stages of data annotation development, our positioning and requirements for annotation scheduling systems are also different.

Budding stage

The labeling requirements and process in the embryonic stage are very simple. Generally, objective multiple choice questions or subjective questions are the main ones, and only a platform needs to be provided to enable the labeling staff to find the data they are interested in for active labeling. At the same time, the data is released manually by the operation students. At this stage, it is basically unnecessary to label the scheduling system

Development period

1. Background: With the further increase of data demand, the traditional manual delivery method has been unable to meet the needs of data annotation. Therefore, developing a system that can automatically release tasks is a direction of technology in this stage. This stage is also the embryonic form of the labeling scheduling system.

2. Solution: Full process automation

Outbreak period

1. Background:

Changes on the demand side:

a) With the growing demand for labeling in the fields of unmanned vehicles, vision, voice, etc., the problem types and processes of labeling are becoming more and more complex

b) After the maturity of the model is improved, more annotation data is used to improve the model effect rather than simply accumulate the original data, so the demander has higher and higher requirements for data quality

Change of the marker:

a) With the prospect of the industry becoming clearer, more and more new generation of announcers are gushing into the sunrise industry of data annotation

2. To sum up, the main contradictions at the current stage are:

a) Management requirements for data quality

b) Management requirements for a large number of people

3. For the above problems, the business solutions are as follows:

a) Traditional data annotation generates the final result through multi person fitting. For example, for a multiple choice question, the system will not consider C as a correct option until three people have selected C. But there are often some bad cases. Therefore, in addition to marking, the review process is added. Let auditors with stronger professional knowledge join in. For unqualified data, an effective repair is a means to quickly improve data quality.

b) For the management of a large number of personnel, the introduction of virtual organizations by adding some levels is similar to the "guild mechanism".

4. Solution: Audit phase and corresponding personnel management mechanism - guild.

mature period

1. Background

The business continues to scale, and customers' dependence on data annotation continues to strengthen. Data annotation has entered the customer's R&D closed-loop, and the requirements for data quality have reached the acme.

To further improve the quality of data annotation. It is not only through more refined means to control the whole annotation process, but also needs to further solve the problem of uneven ability levels of announcers.

2. Solution:

a) Introduction of data scheduling system: expand the label data processing stage to realize the fine management of data flow between stages

b) Introduction of personnel scheduling system: fine management of the annotation life cycle of the announcer

3. It can be seen that the data processing stage of the current annotation has been refined to the following extent:

4. Data dispatching system

5. Personnel dispatching system

Main objectives and implementation means of labeling scheduling system

From the above system evolution perspective, we have a general understanding of the development history of data annotation scheduling system. The following mainly introduces several main goals and specific implementation ideas of some current labeling and scheduling systems.

generality

1. Universality of scheduling objects

Data scheduling: supports the flow of data in all dimensions

a) Single data: the smallest scheduling unit of the labeling system

b) Task dimension: task is the aggregation of n pieces of data, marking the smallest management unit of operation

c) Batch Dimension: Batch is the aggregation of n tasks, and the smallest management unit is the customer dimension

2. Business model abstraction

3. Universality of circulation strategy

a) Input:

The decision data source can be the current online real-time database or the hourly data warehouse built offline
Original data (batch, task, single data)

b) Calculation: decision calculation configuration, make decisions according to the determined data+strategy, and output the final flow direction

c) Output: flow configuration, preset flow according to calculation results

High availability

1. Module deployment diagram

2. High availability SLA definition

The module ensures that 99.9% of the requests are correctly scheduled and that 80% of the decision delay is less than 60 seconds.

3. Hot load of strategy

Since the service SLA needs to be guaranteed, when the policy is updated, the hot update method is used to load the corresponding policy. The policy version number is used for policy upgrade and rollback control.

4. SLA based monitoring module construction

Complete SLA based indicator monitoring based on request logs+process data, and set corresponding thresholds for simple system self recovery.

summary

With the rapid development of labeling business, the focus of labeling scheduling system has gradually developed from purely manual to fully automated. At the same time, through continuous architecture adjustment, strengthen universal design to meet more complex external business changes. Next, on the premise of ensuring that the process change requirements are met, we will gradually explore how to improve the efficiency of the entire data delivery through the optimization of the micro scheduling process.

Share to:

Expert column | Min Nan: building high-quality intelligent driving data set to provide "data fuel" for automatic driving

2019-04-28 New Smart Drive

Perception technology is a key part of intelligent driving. Especially in the case of complex domestic road conditions, the breakthrough of perception technology cannot be completely solved through algorithm iteration or technical innovation.

In this case, manually annotated data with rich semantic information can make the algorithm better understand and recognize the image information and obstacle information transmitted by vision cameras, laser radars, millimeter wave radars and other sensors.

At present, every R&D team is faced with a problem: how to efficiently transform massive data from original data to annotated data with rich semantic information.

The sensor collects data from the real world and completes the process of data production. After a certain amount of calibration and structured and unstructured storage process, the original data needs to be manually annotated to produce data with labels and semantic information, so that the data can be used by the algorithm.

On the contrary, if the sensor cannot mine enough useful data in the real world, it needs to deliberately produce and collect such data to improve the accuracy of the algorithm.

Theoretically, the more accurate the data annotation results are, the better the algorithm results will be. Therefore, data collection and annotation are very important.

Enterprises and developers generally adopt two approaches:

Self built team

Self built teams need to spend a lot of energy to maintain their own annotation teams. Usually, it is also necessary to develop or even maintain a common tool or platform for data annotation for a long time. Only in this way can the data marking work be carried out systematically for a long time and the supplement work of time effective data be carried out.

Business outsourcing

Compared with the self built team, the business outsourcing model also has its difficulties. At present, the R&D and selection scheme of automatic driving is constantly evolving, and the professional ability requirements for data annotation are constantly improving. The industry's demand for labeling has been evolving: from the labeling demand of the most primitive 2D images, to the labeling demand of 3D point clouds, to the semantic segmentation of full pixels, and even the labeling capability of multi-sensor fusion obstacles. The evolving requirements have posed great challenges to the ability of the data annotation team.

Therefore, enterprises need to constantly develop new annotation tools, and even find teams with evolving annotation capabilities. Baidu Intelligent Cloud Data Crowdsourcing hopes to provide partners with a solution that is better than the above two solutions in cost and efficiency.

About Baidu Smart Cloud - Data Crowdsourcing

Baidu Smart Cloud - Data Crowdsourcing was founded in 2011 with the goal of providing AI data collection and annotation services for Baidu's internal R&D teams and business teams.

At present, Baidu Intelligent Cloud Data Crowdsourcing has undertaken the data annotation needs of most teams, including Baidu Intelligent Driving Business Group. In the second half of 2017, Baidu's smart cloud data crowdsourcing officially opened its experience and ability to the public, becoming a comprehensive training data service platform.

Baidu Intelligent Cloud - Data crowdsourcing can efficiently distribute and manage data annotation tasks of large-scale data through customized process management, quality management, and resource/personnel management, while ensuring data quality and data security.

Baidu Smart Cloud - Application of data crowdsourcing in intelligent driving industry

The data output of intelligent driving sensors is generally divided into the following three types:

The first is obstacle detection, tracking and obstacle fusion under multi-sensor.

Baidu Smart Cloud - data crowdsourcing has been engaged in obstacle labeling for intelligent driving since 2015, in addition to the most basic obstacle labeling capabilities of monocular and binocular cameras, fisheye cameras and panoramic cameras; Baidu Smart Cloud Data Crowdsourcing also has the ability to label different laser radar point cloud data with different harnesses from 4 to 128 lines, and also has the ability to label obstacles with multi-sensor fusion, including the ability to label obstacles with laser radar and camera fusion, laser radar and millimeter wave radar sensor fusion; In terms of data annotation of V2X, Baidu Smart Cloud data crowdsourcing team also has relevant annotation experience.

The second sensor outputs the environment perception outside the vehicle and lane information.

In terms of external environment perception and lane information, Baidu's intelligent cloud data crowdsourcing data annotation platform has also accumulated a wealth of annotation schemes, handling a large number of data types such as lane detection, parking space recognition, road information, traffic signs, positioning elements, driving areas and semantic segmentation (including the Apollo platform outdoor scene collection).

The third is the perception of the interior environment and the interaction of the driver's driving intention.

Perception of the environment inside the vehicle, Baidu Intelligent Cloud Data Crowdsourcing has a very typical ability to detect the behavior of fatigue driving, including the key point marking and facial expression detection of the driver's face, as well as the location perception of passengers in passenger vehicles.

About capacity scale

In cooperation with the Shanxi Provincial Government, Baidu has established a huge labeling center in Taiyuan. In combination with experienced online crowdsourcing manpower, Baidu's smart cloud data crowdsourcing labeling team has more than 5000 people, and the daily peak capacity of 2D data such as obstacles and lane lines has reached about 40000 frames, and the number of point cloud obstacle labeling has reached about 10000 frames.

Under the rhythm of large-scale production, it is a challenging problem to ensure that the annotation staff's understanding and implementation of annotation rules are consistent, and ensure data quality. Baidu Intelligent Cloud - data crowdsourcing has made continuous exploration and iteration on this issue.

First of all, Baidu Smart Cloud - Data Crowdsourcing established standard processes such as training and examination for standard setters and reviewers. In addition, in the annotation tool, Baidu Smart Cloud Data Crowdsourcing also integrates intelligent algorithms. For example, the annotation algorithm of consecutive frames can intelligently predict and label the obstacle category in the next frame based on the obstacle category manually labeled in the previous frame.

Intelligent algorithms can greatly relieve the pressure of labeling personnel, and labeling personnel only need to make some modifications on the basis of algorithm recognition, which can greatly reduce the possibility of introducing artificial errors caused by manual participation and subjective judgment in the labeling process.

Baidu Intelligent Cloud - Data crowdsourcing will go through a manual review and automated script detection process after each piece of data is labeled, which can effectively ensure that the labeling results comply with the labeling rules.

In addition, data security is also an aspect that Baidu Intelligent Cloud - Data Crowdsourcing attaches great importance to. In addition to the standard contract terms and confidentiality agreement, there are also technical means to ensure that Baidu Intelligent Cloud - data crowdsourcing will carry out task encapsulation, data encryption, dedicated line transmission, and patent anti climbing.

For customers with special requirements for data security, Baidu Smart Cloud Data Crowdsourcing has prepared a privately deployed annotation platform, a dedicated data annotation team, and a closed annotation site to ensure data security.

A variety of solutions can meet customers with different levels of data security requirements. During the whole process of the project, the project manager and business manager of Baidu Intelligent Cloud Data Crowdsourcing will conduct the whole process of docking. Generally, the customer only needs to provide marking rules and data to be marked, and the marking results can be accepted after the completion of the project.

summary

Data is the fuel of artificial intelligence, and its importance in the field of intelligent driving is beyond doubt. Most enterprises attach great importance to data, but they all face the dilemma of lacking effective access to a large number of high-quality data channels. Therefore, in view of the complicated road conditions in China and the late start of intelligent driving in China, Baidu Intelligent Cloud Data Crowdsourcing can constantly propose new ideas for intelligent driving through years of experience, as well as efficient management solutions and professional software and hardware facilities established over the years.

Share to:

Baidu data crowdsourcing, China's AI data quality leader

2018-12-19 Wu Zexian

On December 14, 2018, the "AI Way to Win by Gathering Data" salon hosted by Baidu Data Crowdsourcing was held in Sanya. Several representatives from Baidu's internal product lines, leading enterprises in the industry, and the academic circle of artificial intelligence attended the salon and held in-depth discussions on the status and trends of basic data services in the AI industry.

Zeng Hongyun, general manager of Baidu public beta data business

First of all, Zeng Hongyun, general manager of Baidu Data crowdsourcing data business, delivered a speech with the theme of "artificial+intelligence, leading the new quality standard of the data industry". He said that at present, governments in all regions and industries are actively embracing AI, The scale of AI data demand will become larger and larger, and the use scenarios will become more diverse. At the same time, the requirements for data quality will become more and more stringent. How to improve the accuracy of data in unit time is the core demand of the industry, and Baidu data crowdsourcing has unique advantages in data quality control.

Baidu Data Crowdsourcing is an AI data service platform that has grown up from inside Baidu. Since 2011, it has been committed to serving the collection and labeling needs of Baidu's internal product lines for AI data, and has served 131 internal product lines, covering AI mainstream technology fields such as computer vision, speech recognition, natural language processing, knowledge mapping, etc. With years of rich experience in internal projects and the support of Baidu's internal technical capabilities, Baidu data crowdsourcing can complete various projects with high quality.

Taking portrait collection as an example, Baidu Data Crowdsourcing will develop a collection plan and conduct small flow test after receiving customer demand, feed back the small flow test results to customers and communicate with customers about acceptance standards, and start formal collection after repeatedly running in and confirming the final plan. Before collection, Baidu Data Crowdsourcing will use Baidu's face recognition technology to take photos of the candidates, store them in the warehouse, and de duplicate them to ensure the uniqueness of the candidates. At the same time, Baidu data crowdsourcing is also very strict in the protection of user privacy, requiring that all recipients must sign a data authorization agreement before they can start collecting. After the collection results are real-time transmitted through Baidu's micro task APP, they will be subject to MD5 technology detection to prevent repeated submission, and then Baidu's labeling base personnel will conduct multiple rounds of comparison and quality inspection, Finally, high-quality data will be fed back to partners.

Baidu data crowdsourcing partners (part)

It is precisely because of the strong competitiveness of Baidu data crowdsourcing in the four dimensions of customized services, business scale, data quality and data security that Baidu data crowdsourcing can carry and meet the 99% scenario needs of AI customer landing applications. On the report card of 2018, Baidu Data's crowdsourcing annual revenue was 225 million yuan, and its partners covered well-known mobile phone brands represented by Huawei, Xiaomi, OPPO, and ViVO, well-known automobile manufacturers represented by Weilai Automobile, Xiaopeng Automobile, Momenta, and Mercedes Benz, well-known AI companies represented by Kuangshi, Aibi, and Yuncong Technology, and well-known AI companies represented by Tencent, Netease Ebay is a large internet enterprise. It has successfully supported the launch of new products by many partners, and cooperated with the government to establish a labeling base to solve the employment problem of more than 9000 people.

Round table discussion on "industry data demand analysis and trend outlook"

In the next round table discussion, AI representatives discussed the data needs, the pain points of data accuracy, and the prospects for future data processing platforms.

Wang Wenjun, professor and doctoral supervisor of Intelligence and Computing Department of Tianjin University

In terms of data demand, Professor Wang Wenjun, professor and doctoral supervisor of the Intelligence and Computing Department of Tianjin University, said that since his research academic direction is smart city and public security, the data demand mainly comes from government data, operator data and open source data. Zhang He, the senior product manager of Xiaomi AI Lab, discussed the current data needs of Xiaomi from the commercial field. First, the voice data of Xiaoai's smart speakers, and second, the image data based on Xiaomi's mobile phone photo function optimization. As for the public data set, which is the data source channel, it believes that the public data set cannot achieve algorithm differentiation because of the low barriers to data acquisition. In the highly competitive market environment, customized data is needed to build differences in technology and products, so they are more inclined to cooperate with Baidu data crowdsourcing platform for customized collection and labeling.

Yang Fei, Chairman of Baidu Technical System Technical Committee

In terms of data accuracy, everyone has the same opinion. Yang Fei, chairman of Baidu Technical System Technical Committee, believes that the AI era is a data driven era, and data quality plays a very important role in improving the accuracy of algorithm models. He cited automatic driving as an example. In the past cooperation with Baidu Data crowdsourcing, The high-precision data provided by Baidu Data Crowdsourcing has played a great role in improving Baidu's unmanned vehicle algorithm model. Xiaomi's Zhang He also said that what Xiaomi values most is the accuracy of data. "Xiaomi has also used other crowdsourcing platforms before, but because these platforms cannot meet the quality requirements, this year Xiaomi invested most of its data budget in Baidu's data crowdsourcing, and Baidu's data crowdsourcing has completed many projects with high quality."

Zhang He, Senior Product Manager of Xiaomi AI Laboratory

For the future data platform expectations, Zhang He hopes that the future platform can have an international collection capability, continuously improve the collection speed, and support more customized labeling requirements from the perspective of Xiaomi mobile phone internationalization needs and update iterations. Baidu data crowdsourcing's collection capacity in 22 countries around the world and the deployment of the privatized tagging platform can just meet the needs of Xiaomi. Baidu Yang Fei hopes that the data platform can significantly increase productivity by combining with some technical means, and upgrade the "shovel" in the AI era to "excavator", thus reducing the cost of labeling. The construction of automatic tagging capability is just what Baidu Data crowdsourcing platform is currently vigorously developing. In the future, Baidu Data crowdsourcing will further improve accuracy and reduce tagging costs through the combination of "automated machine tagging+manual tagging". Professor Wang of Tianjin University saw the cooperation between Baidu and the Shanxi government in labeling bases and other aspects, and expected Baidu data crowdsourcing to have the opportunity to cooperate with the Tianjin government to accelerate the landing of artificial intelligence industry in Tianjin.

Baidu Data Crowdsourcing, with its perfect process management and advanced technology and platform capabilities, is now in the leading position in the industry. In the next two years, Baidu Data Crowdsourcing will continue to focus on the AI strategy, continue to improve the platform's professionalism in the field of AI data, commit to being the leader of China's AI data quality, and add coal and water to the AI era.

Share to:

Baidu Artificial Intelligence Basic Data Industry Project Settles in Taiyuan Comprehensive Reconstruction Demonstration Zone

2018-07-02 Baidu data crowdsourcing

On June 28, Baidu and Shanxi Comprehensive Transformation Reform Demonstration Zone successfully signed a contract on "Baidu (Shanxi) Artificial Intelligence Basic Data Industry Project"! Gao Guorong, QA Director of EBG&TG, signed the agreement with relevant leaders of the Comprehensive Reform Zone on behalf of the company.

Shi Jialiang, Baidu's senior technical manager, the person in charge of crowd testing business, Duan Chao, the director of the government affairs cooperation department of Baidu's Public Affairs Department, Zhang Jinwang, the deputy secretary general of Shanxi Provincial Government, the secretary of the Party Working Committee and the director of the management committee of the comprehensive reform demonstration zone, and others witnessed the signing of the cooperation agreement. After the meeting, Gao Guorong held cordial and friendly talks with Zhang Jinwang, the director, and arranged the next step of work. Liu Yong, Deputy Director of the Management Committee of the Comprehensive Reform Demonstration Zone, presided over the signing ceremony.

With the increasing application of AI, the data annotation industry, as a very important part of the AI industry chain, has become the supporting industry of the entire industry chain. Baidu as BAT The search leader in the three giants of Chinese Internet companies: Baidu, Alibaba, and Tencent has built an AI architecture based on this, and the demand for data annotation business is growing strongly. Therefore, the data annotation industry can not only bring economic benefits to enterprises and governments, but also bring benefits to social development. At the same time, it can also solve the problem of regional employment.

Gao Guorong said that in the future, data annotation will develop from single perceptual annotation to multi-level cognitive annotation, from low threshold popular annotation to high threshold professional annotation, and from human driven annotation to technology driven annotation. This series of development shows that labeling enterprises can't just stick to the old mode of development. How to quickly transform from a human intensive general outsourcing mode enterprise to a technology driven comprehensive labeling enterprise development is a problem that currently plagues the development of enterprises. Entering Baidu Data Labeling Industrial Base can help enterprises optimize the original operation mode, Accumulate and establish an operation system of data annotation, and cultivate "artificial intelligence trainers" in the AI era. Joining Baidu's data annotation industry base can not only enjoy Baidu's exclusive business support, but also provide training, operational activity support and initial business introduction for recruited data annotation enterprises. At the same time, it can seize the opportunity to become a leader in the future data annotation industry.

Baidu Artificial Intelligence Basic Data Industry Project is set up in Tanghuai Industrial Park, a comprehensive reform demonstration area. The business office building has a floor area of 3000 square meters, which can accommodate 1000 people working at the same time. In the future, it will establish a hierarchical data annotation industrial cluster from ordinary annotation enterprises to professional annotation enterprises, create a data standard industrial highland in the era of artificial intelligence, and form a new business model. Phase I plans to introduce the first batch of data standard industry alliance enterprises to build Baidu data labeling industry sample enterprises; Phase II relies on the industrial park to attract labeled industrial partners to settle in, with the participation of upstream and downstream enterprises such as the Internet of Animals, BIM (building information model) and related extended industries, and carries out innovative enterprise incubation work, so as to drive the development of artificial intelligence industry in Shanxi Province and promote employment.

Share to: