The Alibaba Cloud service team, based on the customer's business characteristics, combined with the best practices of Alibaba Cloud's microservice governance and stability governance system and customer cases, helped customers develop plans for application lossless publishing, grayscale publishing, capacity planning, flow limiting and degradation, and worked with customers to complete the implementation of the plans.
Lossless application publishing can effectively avoid the problem of front-end exceptions during application publishing, improve the user experience, and make the system more smooth when changing. In order to speed up the landing and reduce the workload, make full use of the ability of ACK and MSE, realize the elegant exit of applications by configuring preStop and MSE to go offline without loss, protect the safe start of applications by configuring health checks and enabling MSE online traffic preheating function, and effectively avoid the traffic loss caused by application release. Customize cloud efficiency pipeline for customers based on their current situation The application release scheme combining ACK and MSE microservice governance capabilities can realize lossless online and offline in the process of application release, without adding additional operations, which reduces the burden on technicians.
Gray publishing can roll back fast stream cutting in time when major exceptions occur after the application goes online, and can try and error the new version with small traffic. Using MSE to provide agent access, a field in the header based on HTTP is colored for traffic, and Kubernetes' declarative deployment is used to insert a grayscale identifier into the application version, so that only colored traffic can enter the version marked with a grayscale identifier, realizing the full link grayscale capability based on logical isolation mechanism, This enables small-scale traffic verification of the business after the release of the new version. Once any problem is found in the new version, it can be rolled back in time to minimize the impact on the business.
Full link pressure testing is the best way to find system bottlenecks and determine system capacity. The customer technical team and Alibaba Cloud service team sorted out more than 50 pressure testing scenarios for the five core systems on the cloud, and designed a full link pressure testing scheme, data model and script. Perform optimization, table partitioning key, shard optimization, index adjustment, and application log asynchronization through Hologres SQL, JVM parameter optimization, code optimization, Redis cache, More than a dozen tuning measures, such as MQ architecture transformation, have improved the system stability and overall system performance by more than 50 times, significantly improved the utilization of resources on the cloud, and reduced the overall resource cost by more than 5%.
The current limiting and degrading capability can install a reliable safety rope for the system. Since there are many slow SQL and time is too short to optimize, in order to avoid serious slow SQL that will drag down the entire database and cause interruption risk to online business, Bosiden has used MSE's database governance capability to evaluate the peak flow based on the pressure test results, configure SQL flow restriction rules, and selectively let SQL wait or fail quickly when the database traffic is too large, Turn uncertain traffic into deterministic traffic to ensure the stability of the database and the stability of the entire business.