Netease Qiyu Service Governance Practice - OSCHINA - Chinese Open Source Technology Exchange Community

HWNNEI 2-core 2G3M only costs 58 yuan in the first year for easy deployment of massive AI applications/e-commerce/site building/development testing

As we all know, business architecture evolves gradually. With the development of business and organization, the architecture is constantly changing, and this change is often reflected in the division of business domains. Dynamic adjustment is a process, which is generally taken apart first and then treated. Simple splitting will introduce dependency and coupling problems. This article will focus on the service and module boundary problems that occur in the evolution of business architecture and the practice of solving these problems.

1. Business architecture evolution

Due to the complexity of Qiyu itself, it was a microservice architecture at the beginning of the system design. This architecture has undergone significant changes with the development of the organization and business.

Fundamentally, these changes are the process of system splitting from "splitting by function" to "splitting by business domain".

Split by function

In the early days, because everyone was a team, the content that everyone was responsible for was defined according to the function. This approach is simple, intuitive and consistent with the principle of "single principle".

At the same time, due to the data oriented and process oriented programming (that is, what data is required to be assembled by itself, and the public logic is extracted and reused by sharing Jar packages), the coupling between services is not high.

Single responsibility, high cohesion and low coupling have supported Qiyu to iterate versions and functions at a very high speed in the early stage of business development.

Split by function

Split by business

With the continuous development of business, Qiyu has gradually formed several large independent sales business lines, as well as some relatively small but highly independent supporting businesses.

At this stage, it can be clearly felt that the early services divided by function can no longer meet the needs of organizational development. The most typical situation is that each business group will change the services of the basic service domain when developing functions.

At this time, the principle of "single responsibility" has been preserved, but the principle of "high cohesion and low coupling" has been destroyed. This leads to a large number of code coupling, unreasonable service dependency, and publishing dependency. These problems will affect the stability and maintainability of the online system and slow down the R&D efficiency.

In order to solve the above problems, we started the Seven Fish Service Governance Project.

Split by business

2. Service classification

Before identifying the rationality of coupling and dependency, we need to grade services. Without grading, there would be no entry point for technical optimization such as service priority and dependency inversion, and no basis for resource and schedule arrangement.

The saying that the core module is not the core module has a long history. According to this idea, the importance of services can be graded.

In Seven Fishes, we define the service hierarchy as follows: (Note that middleware, database and other infrastructure are not included here)

P0: System level basic services. If they are down, they will lead to a large area of perceived service exceptions, usually with a small number (management and query of underlying shared data, etc.)
P1: Core business basic services and core functions. If there is downtime, the main process of a core business is unavailable
P2: Non core business applications and core business non core functions (data reports, system notifications, etc.)
P3: Internal support business (operation background, operation and maintenance background, etc.)

After defining service classification, we have the following basic principles:

Lower level services cannot directly call upper level services
The stability and availability of the lower layer service cannot be limited by the upper layer service
Keep logical isolation between services at the same level as far as possible
The underlying service only provides basic capabilities and keeps the model stable

3. Boundary problem and solution

On the basis of service classification and module division, we identified the following problems in our daily development:

Code coupling : Because we have gone through the process from function splitting to business splitting. In the intermediate stage, some services carry the business of multiple business domains. Although these services are divided into a business group after splitting, the code coupling still exists, and the owner needs to modify the code according to the requirements of other groups. Code coupling can cause the following problems:

The owner cannot fully control his own code and plan, and responding to the needs of other groups may disrupt his own schedule;
When the owner is unable to schedule, the non owner is asked to carry out the development in order to catch up with the time, which causes problems due to insufficient familiarity;
There is dependency on the launch, which will introduce the issue of publishing permission and publishing order.

Unreasonable dependence : It is also divided into three aspects: reverse dependency, circular dependency and strong weak dependency

Reverse dependency: the underlying service depends on the upper service. The call inversion causes the stability of the lower layer service to be affected by the upper layer service.
Circular dependency: A depends on B, and B depends on A. Of course, it may be A ->B ->C ->A with intermediate services. This will cause the publishing sequence to get out of control.
Unreasonable strong weak dependency: weak dependency in business but strong dependency in service invocation. In business, the downtime of a service should not affect the core functions, but the actual result is that the core functions are unavailable after the service downtime.

In the process of governing the basic service boundary of Qiyu, the following technical means are used to optimize the service. In order to optimize a scenario, multiple means may be jointly used to achieve the goal.

Border governance

Let's start from the scenario and briefly introduce these technical means.

4. Border management practice

split

The splitting can be divided into the following situations:

If there is no common code, it is generally the independent function of each business party, which can be removed directly;
The shared code can be divided into two situations: the shared part belongs to basic capability and the shared part belongs to business logic:
- The shared basic capabilities can be extracted into Jar packages or independent services
- If the business logic is coupled
If the underlying model cannot be disassembled, it reveals that there are problems in business domain division;
For the purpose of presentation, it can exist as an aggregation service without affecting the division of business domains;

In the early days, all page interfaces were hosted in the same service, so the application had to be classified as P0. Most of the interfaces are business internal settings and data viewing, so they can be removed directly because there is no common code.

split

In addition, all pages depend on a series of basic data, such as enterprise information, customer service information, permission information, etc. This is a situation where the needs of the presentation depend on a certain basic capability globally. Therefore, we separate the page basic data query function into a separate service. Since all pages still rely on these data, this service is still P0.

With this split operation, the original hodgepodge of P0 is broken into a P0 with single function, simple logic, stable code, and a series of P1 and P2 services.

Load on demand+weak dependency degradation

For scenarios that rely on multiple business parties, there are usually strong or weak dependencies.

Weak dependency: Scenario A depends on service B, but A is not strongly related to B. That is, if B is unavailable, A's main process can still run.
Strong dependency: A depends on B, and scenario A is strongly related to B. That is, if B is unavailable, the main process of A will not work.

For strong dependencies, availability must be guaranteed, and load on demand must be achieved to minimize unnecessary risks. Weak dependency allows unavailability, but in order to prevent unfriendly prompt after weak dependency unavailability, a downgrade scheme needs to be provided.

In the previous example, all the basic data that the page depends on are loaded together before splitting. One data loading failure may cause all data to return failure. Although not all the basic data in the business can be used, in fact, it is strongly dependent on all the basic data.

However, since a large amount of data is shared, it is not cost-effective to write a separate data encapsulation interface for each page. So we introduce GraphQL into the new data loading service to solve this problem.

Load on demand

Weak dependency degradation

GraphQL requires that the data be split into basic units, and the query statement is assembled to query the server. The query statement contains both atomic data items and the final desired data format.

Compared with writing a separate data interface for each page to meet the needs of on-demand loading, this has many advantages:

Query reuse of atomic data
Load on demand
Flexible and adjustable data format
Easy to expand
It provides rich data operation and assembly capabilities
Cross front-end technology stack

GraphQL is not expanded too much here. If you are interested, please refer to https://graphql.org/ 。

The downgrade is usually completed by Hystrix or Sentinel. There is no too much expansion here.

Boundary change

There are often many ways to divide business areas, but sometimes the most suitable way to divide business areas is not necessarily the most realistic and reliable way.

From the perspective of code maintainability and online stability, it is sometimes necessary to re divide the boundaries. Here are some reference principles:

Reduce the number of P0 applications
Business logic with stable model, large call volume and global impact can be put together
After adjustment, the model boundary needs to have a clear business meaning to facilitate understanding and maintenance.

Qiyu's "enterprise information management" and "order and service package" began to be divided into two services. However, it is found in daily work that: the enterprise management calls a large amount and the model is stable; The order logic is complex and has many changes, but most of the calls are small. Only the "service package query" calls are large and the model is stable.

We migrated the function of "service package query" to "enterprise information management", and conceptually changed the "enterprise information management" module to "enterprise runtime management". By changing the boundary, we split two P0 services into one P0 and one P1, and at the same time, we ensured that complex and volatile businesses would not affect stable underlying services.

Boundary change

Domain model optimization

If the domain model is coupled to data in other domains, the code must also be strongly coupled. However, as long as it can be determined that the business domain division is OK, the coupling can be decoupled by optimizing the domain model.

The common practice is to use the KV table to store the associated data of other fields, and update the KV table asynchronously with event driven, so that the current domain model can not focus on the business meaning of the data.

If you don't care about business meaning but only store data, the underlying model can be generalized and stable, and reverse dependency and code coupling can be completely eliminated.

In addition to the basic user information, the user table of Qiyu also stores the data of the business side, such as "recent and last contact time". Obviously, this level of coupling will lead to the pollution of the User model. However, this information is necessary to display User information in business functions.

Domain model optimization - 1

Considering that the current user needs to display the "latest contact time", it is possible to display the "latest work order time", "latest SMS time" and so on later. If you continue to adapt to the requirements and change the code, code coupling will occur.

Make adjustments from the model level, add the UserInfoExt table, provide extended information storage in the form of key value pairs, and the business system updates data by actively updating the K-V value. This ensures the stability of the User model layer, the optimization of the call relationship, and the complete decoupling of the code layer.

Domain model optimization - 2

Ability push up

To optimize the domain model.

The optimization cost of domain model is very high, and there are not necessarily resources to complete this reconfiguration in practice. Especially when it comes to changes in the underlying model of P0 level applications, the risk is often very high.

In the scenario where the underlying data and the upper data need to be displayed together. The logic of association presentation can not be carried on the underlying model, but can push the assembly process of this part to the upper business system, so as to decouple the data of the underlying model.

In the above example, since User is one of the core services in the whole world, the risk of transformation is very high. Finally, we did not adopt the domain model optimization scheme. Instead, the ability is pushed up here, and the ability of P0 level is pushed to P1 level.

The user core model deletes the "last contact time", and the process of obtaining information is pushed to the User Gateway service. Although the User Gateway belongs to the basic business domain, it is only responsible for providing the user data required by the page. Downtime does not affect the underlying sessions, work orders and other data flows, so it belongs to the P1 level service.

Ability push up

event driven

When the business process is coupled with the process of other business domains. There are two possibilities:

Rely strongly on the results of the upper business process;
It does not depend on the results of upper business.

If it does not depend on the results of the upper layer business, the process and core nodes can be broadcast through the way of life cycle events, so that the upper layer business can independently complete the subsequent process.

To ensure that life cycle events can be consumed successfully and trigger business logic. The middleware layer needs to ensure the reach and idempotence of messages. At the same time, if the execution fails in extreme cases, a message compensation execution mechanism needs to be provided.

At first, the enterprise registration process of Qiyu was a simple serial process. If any of the settings in the middle was not completed, the enterprise could not complete the initialization, which led to the failure of enterprise registration.

This kind of failure is not cost-effective. Even if a business is not initialized well, you can try other businesses without losing a potential customer completely.

event driven

We model the enterprise registration lifecycle, broadcast the "Enterprise Creation" event, and use event driven to complete the entire registration process. The advantages of this are:

The initialization of the new business will not affect the existing official website code, thus decoupling the code;
Initialization failure of some businesses will not affect the overall registration process, and the strong dependence on a single business will be relieved;
As a P1 level service, the official website directly calls only P0 enterprise and customer service management services, so there is no reverse dependency.

Asynchronous call

Connected to event driven.

When the underlying process in the business process depends on the results of the upper business, there are two ways to solve this dependency:

Transform the domain model to remove strong dependencies. Although it is thorough, it often costs a lot.
Directly call and depend on the result; This creates a reverse dependency of the call.

Asynchronous invocation is to solve the problem of code coupling and reverse dependency caused by direct invocation. Get the results of the upper layer business in an event driven way, rather than directly calling and obtaining the results. Unlike ordinary message drivers, asynchronous calls rely on the returned results; Unlike direct calling, it does not depend on the callee's interface.

In Qiyu, to delete customer service, you need to check whether there are unfinished calls, conversations, work orders, etc. Because deleting customer service belongs to basic customer service management, it is in P0 level service. In order to verify business information, P1 level services must be called.

If the domain model reconstruction method is adopted, and the business layer is asked to inform the "customer service management" of the "deletion" and update it in real time with the business process, the reverse dependency can be removed. However, this method needs to intrude into the core processes of each business party, and the current business logic needs to be modified, so the cost is too high.

Asynchronous call

In Qiyu, we designed and implemented an asynchronous calling component. Assuming that the customer service pays attention to the results of three services A, B and C, ABC will register the deletion event with the registry. Each deletion process will get the following list, and then broadcast the deletion message. After receiving the message, the business party will return the results to the registry. The deletion of customer service depends on the notification mechanism of the registry to obtain the results and decide whether to complete the deletion.

Since the process is asynchronous, there will be a Timeout waiting process. There are two modes. One is the strong dependency mode, and Timeout means the operation fails; The other is the weak dependency mode. Timeout can still operate successfully.

The advantages of this are:

You can decouple the code layer. Suppose you add a new business party D that needs verification, and register the concerned deletion event on the D service.
This transformation has no impact on the existing business logic and core processes, and the scope of change is limited.
It is not called directly, so there will be no reverse dependency.

Anticorrosive coating

Receive event driven and asynchronous calls.

Ideally, the business side can respond to the driving events of the core system to complete the complete business process.

However, in fact, because the third party is not under the control of the team, the development schedule is uncontrollable and the development motivation is not strong. In order to continuously promote the optimization and update of the team, it is necessary to rely on the business side's code to be all encapsulated together and separated from the P0 level service to prevent pollution to the core model and core processes.

When we do registration decoupling, we need the business side to respond to the life cycle events of enterprise registration. When decoupling customer service deletion, the business side needs to integrate asynchronous calling components.

This leads to our development relying on other business teams. For this reason, we added a separate anti-corrosion coating application, migrated the logic of responding to life cycle events and integrating asynchronous drop and call components, and finally completed the transformation.

In this way, our development can be successfully completed on time. At the same time, the coupling with the business system is limited to a single application, which limits the scope of corruption. Later, the business side migration becomes very easy.

5. Summary

Split, load on demand, weak dependency degradation, boundary change, and event driven are the starting points for governance. With the deepening of governance, many problems cannot be solved by simple splitting and change. Only through domain model transformation can we find a way to completely solve the service.

However, the cost of model transformation is often very high. In real operation, we have to use anti-corrosion coating, capability push up, asynchronous call and other means to ensure that the transformation can actually proceed, rather than being trapped in endless scheduling and testing.

After sorting out the boundary relationship, the upper layer service may affect the stability of the lower layer service. The main scenario is the system pressure caused by uncontrolled calls. This belongs to the scope of degradation of fuse current limiting, and will not be discussed in detail here.

More technical dry goods, welcome to follow the WeChat official account of [Smart Enterprise Technology+]

Netease Seven Fish Service Governance Practice

1. Business architecture evolution

2. Service classification

3. Boundary problem and solution