Detailed description of websocket fast reconnection mechanism - truly stable personal space of Netease Yunxin - OSCHINA - Chinese open source technology exchange community

On fast reconnection mechanism of websocket

Article | Ma Yingying, Netease Smart Enterprise Web Front End Development Engineer

introduction

In a perfect instant messaging application, Websocket is an extremely critical link. It provides a full duplex communication mechanism for the client and server of web applications. However, due to the instability of its own and the underlying TCP connection, developers have to design a complete set of live, live, and reconnection schemes for it to ensure the instantaneity and high availability of applications in practical applications. As far as reconnection is concerned, its speed has seriously affected the "immediacy" and user experience of upper layer applications. Imagine that if WeChat can't send and receive messages a minute after opening the network, would you be crazy?

Therefore, how to quickly restore the availability of websocket when the network changes becomes particularly important.

Quick understanding of websocet

Websocket was born in 2008 and became an international standard in 2011. Now all browsers support it. It is a new application layer protocol, a true full duplex communication protocol specially designed for web clients and servers,

You can learn about websocket protocol by analogy with HTTP protocol. Their differences:

The protocol identifier of HTTP is http, and that of websocket is ws
HTTP requests can only be initiated by the client. The server cannot actively push messages to the client, but websocket can
HTTP requests have the same origin restriction, and communication between different origins needs to cross domains, while websocket does not have the same origin restriction

Similarities:

Both are application layer communication protocols
The default ports are the same, 80 or 443
Can be used for communication between browser and server
All based on TCP protocol

Relationship between TCP and TCP:

picture source

Disassembly of reconnection process

First, consider the question when reconnection is required?

The easiest thing to think of is that the websocket connection is broken. In order to send and receive messages next, we need to initiate another connection. However, in many scenarios, even if the websocket connection is not disconnected, it is actually unavailable, such as the device switching network, the route collapse in the middle of the link, and the server load continues to be too high to respond. In these scenarios, the websocket is not disconnected, but for the upper layer, there is no way to send and receive data normally. Therefore, before reconnecting, we need a mechanism to sense whether the connection is available and whether the service is available, and it should be able to quickly sense, so that we can quickly recover from the unavailable state.

Once the connection is perceived to be unavailable, you can discard the old connection, disconnect the old connection, and then initiate a new connection. These two steps seem simple, but if you want to achieve fast, and not so easy.

The first is to disconnect the old connection. For the client, how to quickly disconnect? The protocol stipulates that the client must negotiate with the server before disconnecting the websocket connection. But when the client cannot contact the server and cannot negotiate, how can you disconnect and quickly recover?

The second is to quickly initiate new connections. This fast is not the same fast. The fast here is not to initiate a connection immediately, which will have an unpredictable impact on the server. When reconnecting, some backoff algorithms are usually used, and reconnection is initiated after a delay. But how to make a trade-off between reconnection interval and performance consumption? How to quickly initiate a connection at the "right point in time"?

With these questions, let's take a closer look at these three processes.

Quickly perceive when reconnection is required

The scenarios that need to be reconnected can be divided into three types: the connection is broken, the connection is not broken but unavailable, and the service connected to the opposite end is unavailable.

The first scenario is very simple. If the connection is directly disconnected, it must be reconnected.

For the latter two, whether the connection is unavailable or the service is unavailable, the impact on the upper layer application is that instant messages can no longer be sent and received. So from this perspective, a simple and crude way to sense when reconnection is needed is to timeout the heartbeat packet: send a heartbeat packet, and if the server has not received a packet back after a specific time, The service is considered unavailable, as shown in the scheme on the left in the figure below; This method is the most direct. Then if you want to Rapid perception You can only send more heartbeat packets to speed up the heartbeat rate. However, if the heartbeat is too fast, it will consume too much mobile terminal traffic and power. Therefore, this method can not achieve rapid perception, and can be used as a bottom-up mechanism for detecting connections and services.

If you want to detect the connection unavailability, in addition to using heartbeat detection, you can also judge the network status. Because network disconnection, Wifi switching, and network switching are the most direct reasons for the connection unavailability, when the network status changes from offline to online, you need to reconnect in most cases, but not necessarily, because the bottom layer of webtoken is based on TCP, TCP connection cannot be sensitive to the network changes at the application layer, so sometimes even if the network is disconnected for a short time, it will not affect the websocket connection. After the network is restored, it can still communicate normally. Therefore, when the network is disconnected to the connection, immediately determine whether the connection is available by sending a heartbeat packet. If the heartbeat packet from the server is received normally, the connection is still available. If the heartbeat packet is not received after the timeout, you need to reconnect, as shown on the right in the above figure. The advantage of this method is that it is fast. After the network is recovered, you can immediately perceive whether the connection is available. If it is unavailable, you can quickly perform the recovery, but it can only cover the situation where the application layer network changes and websocket is unavailable.

To sum up, the scheme of regularly sending heartbeat packet detection is stable and can cover all scenarios, but the speed is not very good; The scheme to judge the network status is fast, without waiting for the heartbeat interval, and is more sensitive, but the coverage scenario is limited. Therefore, we can combine two solutions: regularly send heartbeat packets at a not too fast frequency, such as 40s/time, 60s/time, etc., which can be determined according to the application scenario, and then immediately send a heartbeat when the network status changes from offline to online, to detect whether the current connection is available, and immediately recover if unavailable. In this way, in most cases, the application communication of the upper layer can be recovered from the unavailable state quickly. For a few scenarios, there is a timed heartbeat as the background, which can also be recovered in a heartbeat cycle.

Quickly disconnect old connections

In general, before initiating the next connection, if the old connection still exists, the old connection should be disconnected first, so that the resources of the client and server can be released, and second, data can be sent and received from the old connection by mistake.

We know that the bottom layer of websocket transmits data based on the TCP protocol. Both ends of the connection are the server and the client. The TIME_WAIT state of TCP is maintained by the server. Therefore, in most normal cases, the server should initiate the disconnection of the bottom layer TCP connection, not the client. That is to say, when the websocket connection is to be disconnected, if the server receives an instruction to disconnect websocket, it should immediately initiate to disconnect the TCP connection; If the client receives an instruction to disconnect websocket, it should send a signal to the server, and then wait for the underlying TCP connection to be disconnected by the server or until the timeout occurs.

If the client wants to disconnect the old websocket, it can be discussed in two cases: the websocket connection is available and unavailable. When the old connection is available, the client can directly send a disconnection signal to the server, and then the server can initiate a disconnection; When the old connection is unavailable, such as when the client switches over wifi, the client sends a disconnection signal, but the server cannot receive it. The client can only wait until the timeout period expires before being allowed to disconnect. The timeout disconnection process is relatively long. Is there any way to quickly disconnect?

The upper layer application cannot change the protocol level rule that only the server can initiate disconnection, so it can only start from the application logic. For example, in the upper layer, the business logic ensures that the old connection completely fails, simulates disconnection, and then initiates a new connection to resume communication. This method is equivalent to trying to disconnect the old connection when it fails, and then you can quickly enter the next process. So when using this method, you must ensure that the old connection has completely failed in business logic, for example, to ensure that all data received from the old connection is lost, and that the old connection does not hinder the establishment of a new connection, The old connection timeout and disconnection cannot affect the new connection and upper business logic.

Quickly initiate new connections

Students with experience in IM development should know that when encountering reconnection due to network reasons, they should never immediately initiate a new connection, otherwise when network jitter occurs, all devices will immediately initiate a connection to the server at the same time, which is no different from a denial of service attack caused by hackers consuming network bandwidth by launching a large number of requests, This is a disaster for the server. Therefore, some backoff algorithms are usually used during reconnection, and the reconnection is initiated after a delay, as shown in the flow on the left in the figure below.

What if you want to connect quickly? The most direct way is to shorten the retry interval. The shorter the retry interval, the faster the communication can be restored after the network is restored. However, too frequent retries will seriously consume performance, bandwidth and power. How to make a better trade-off?

A more reasonable way is to gradually increase the retry interval as the number of retries increases; On the other hand, monitor network changes. When the network status changes from offline to online, which is more likely to be reconnected, you can appropriately reduce the reconnection interval, as shown on the right side of the figure above (the reconnection interval will also increase with the number of retries). The two methods are used together.

In addition, the interval can also be adjusted properly according to the possibility of successful reconnection in combination with business logic. For example, the interval can be increased when the network is not connected or the application is in the background, and it can be decreased when the network is normal, so as to speed up the reconnection.

ending

In conclusion, at the beginning of this article, the disconnection and reconnection of websocket is divided into three steps: determining when to reconnect, disconnecting the old connection, and initiating a new connection. Then it analyzes how to quickly complete the three steps in different states of websocket and different network states: first, detect whether the current connection is available by sending heartbeat packets regularly, monitor network recovery events, send a heartbeat immediately after recovery, quickly sense the current state, and judge whether reconnection is required; Secondly, under normal circumstances, the server disconnects the old connection. When losing contact with the server, the old connection is directly discarded. The upper layer simulates disconnection to achieve quick disconnection; Finally, when launching a new connection, use the backoff algorithm to delay the connection for a period of time. At the same time, considering the resource waste and reconnection speed, you can increase the reconnection interval when the network is offline, and reduce the reconnection interval when the network is normal or when the network changes from offline to online, so that it can be reconnected as quickly as possible.

reference resources:

understand Netease Yunxin Communication and video cloud services from NetEase core architecture>>

For more technical goods, please follow the vx official account * * * * "Netease Smart Enterprise Technology+" * * *. A series of courses can be viewed in advance. Boutique gifts are free, and you can also talk to CTO directly.

Listen to Netease CTO's talk about cutting-edge observation, see the most valuable technical dry goods, and learn from Netease's latest practical experience. Netease Smart Enterprise Technology+will accompany you to grow from a thinker to a technical expert.

Voice of God 2024-06-01 20:47

By default, injection ($) and splicing are turned off. If you want to use it, you need to sign the birth and death form and press the fingerprint.

kakai 2024-05-10 10:21

The world only knows that Android was created by Google. Several people know that Android is only a product acquired by Google. Similarly, what is the problem with Huawei's contribution to the collection of OGG open source work and integration into its own proprietary product line?

Francesca 2024-05-19 18:00

Wine runs the Android emulator of Windows. Chrome OS is installed in the Android emulator. Linux environment is installed in chrome OS. Linux environment is installed in the Linux environment. Wine is installed in the Android emulator

Li Yinghui 2024-05-09 16:40

Buddhism has a good word, evil opinion. In dealing with the world, it is meaningless to draw conclusions from preset positions; It is also important to receive good logic training.

All the way north GP 2024-04-25 14:55

America, the future of mankind

Code craftsman 2024-06-01 11:22

I also said "user controllable parameters"

oldpig 2024-04-28 09:59

”Huawei contributed all the source code "?, the title is completely inconsistent with the content.

Ning Jinnong 2024-06-01 21:04

Correct it. The example of loading the library is wrong. It should be # library=@ loading the dynamic library, "./yards to the treasurer. dll"

osc_92224065 2024-04-29 10:57

Long term oppressed outsourcing of state-owned enterprises

haol666 2024-05-31 18:56

This story is powerful, I take it seriously, until I see the end.

CodeDoger 2024-05-02 20:48

35 It's too old to go to work and too early to retire at 60

Love to eat raw pears 2024-06-01 19:18

Don't expect programmers to have a deep understanding of the document. I still think that since the tool hides the details of $#, some necessary security checks are necessary. Many people do not use MybatisPlus directly, but use various so-called rapid development platforms. The MyBatisPlus rapid development platform Snowy, Guns, etc., has an impression that many versions have the problem of using Wrapper directly to splice the Request parameter. I remember that JeecgBoot was opened a lot of CVEs last year or the year before last because of the Wrapper splicing problem. Do you know the author of ibeetl? Many CVE blaming holes have been opened before. The problem is similar. The lack of basic knowledge "script editing permission" is actively handed over to the front end. What a low-level error or even low-energy behavior. However, I accepted it with an open mind and added a white list check.

sunday12345 2024-05-15 18:31

What does the bank do? It's blamed on the remote desktop. Persimmons really pick up soft pinches~?

Bright 2024-05-19 23:25

What a fool! I killed myself. How can people deal with me later.

GDWhisperer 2024-05-15 17:23

I transferred tens of thousands of yuan to my own account, which was under risk control. How did I do this? The bank should be responsible for this**

gamedot 2024-05-17 11:14

Old Zhou is deeply concerned about Huawei's great cause of open source. He is not a Huawei person, but has Huawei's soul.

Apizza 2024-06-01 17:52

You can switch from lodash to radash in 2024!!!

Dogo_Little People 2024-06-02 12:24

Not everyone will go to see the document in full detail. As a general basic framework, the method naming should consider not only readability but also understandability. At least, it should also establish a cognition for developers. LambdaQueryWrapper is recommended. The official only briefly said that QueryWrapper may lead to SQL injection risks, There are no detailed examples (many people don't understand what SQL injection is). Now I met a jerk and submitted it to CVE to see who is the most powerful

Monkeys think of apes 2024-05-31 18:31

You can cheat your brother. Just don't cheat yourself

Ma Nong Little Fatty Brother 2024-05-16 14:40

I give you six seconds. I give you six moves with the same effect in the martial arts contest, which shows the invincibility and confidence of the master

Simple code 2024-06-02 20:15

Does JBoot solve the problem that the join template in JFinal only supports Java 8? Is the dependency on Javax to be changed to Jakarta?

Yoona520 2024-05-17 16:34

Zhou Hongyi is now living more and more like a clown. If he stays behind the scenes, he has to become an online celebrity. Can you learn from Lei Jun?

Yokesily 2024-06-02 15:11

So designed

monkey_cici 2024-05-09 00:25

My I9 CPU, 64GB memory module and 3080Ti computer are inferior to the top configuration of 19999 on a tablet

zhy 2024-05-16 13:16

At the end of Shannon is Nong

Single structure 2024-05-11 10:09

Selected as Open Source China's disgrace pillar

Rocket ship 2024-05-31 19:22

It's a ghost anyway.

Shen Lang Panda 2024-06-01 08:16

You can directly ask questions in the project work order. The comment area is not suitable for answering such questions

Bright Stars 2 2024-05-31 23:28

Remove Unsafe? You don't want netty anymore?

One code Yma 2024-05-06 09:14

My technical article was moved by CSDN. Why didn't anyone step on the sewing machine? This kind of report is a joke to me. The monsters with background are fine, and the monsters without background fight to death

The seven in one little King Kong 2024-06-02 15:54

Those people only use resources, others are not developed by NPM...

People are addicted to food 2024-06-01 13:53

History history combination

-SORA- 2024-04-30 17:07

When this happened in a foreign country, the comment area suddenly became very objective and rational**

MrChen89 2024-04-29 09:18

There are a group of people like this. I don't know what they have experienced. When it comes to HW, I can't say anything good, even if it's neutral

Xiao Xu Middle aged 2024-06-01 06:49

thank

osc_25732934 2024-06-01 19:30

It seems that the current version of the Foreign Function&Memory API is not as fast as that of jni, or even worse. In addition, before vallhala comes out, all interactions between java and c have to get an additional memory. Even if it comes out, it may not be possible to directly throw a copy of binary data into memory as a structure. When the two apis are completely stable, the day lily is cold

Xiao Xu Middle aged 2024-05-31 19:13

Very good

Shuimu Yi'an 2024-05-20 09:58

The news should be read continuously. I'm waiting for the third news besides rustdesk and teamviewer. Localized remote desktop software is far ahead.

young crops 2024-06-01 16:21

There is no tipping point. There are also many official documents stating that SQL fragments involving direct string splicing need to be controlled by the user, and specific solutions are also provided. If you say that the value part is injected, then we are also 100% free of any dispute. This obvious SQL fragment is unrealistic for ORM to explain without your control, Since SQL allows splicing fragments, there must be some scenarios that cannot be forced into non SQL strings. It is also very simple. Have you ever thought about why not force them???

Hakuna 2024-05-31 18:28

It is compatible with Oracle, but does not know "just" or "just". Those who can be compatible with Oracle and do well are real men and real warriors. You should know that compatibility means that even bugs must be compatible, and you have no other code that can not be copied. It's all based on real skills and understanding of oracle.

-SORA- 2024-06-01 09:30

American characters

infoworld 2024-05-11 15:12

Universities should use open source free software instead of commercial ones. In this way, hands and feet will not be tied technically.

jalena 2024-05-31 23:57

I can imagine that I will also receive the CVE repair request next week..... I don't use the key!!!!!!!!!

One code Yma 2024-05-09 09:58

Recently, I often go to interviews. People who hate Ali background most regard me as a fool, even though I am a fool

osc_566335 2024-04-28 14:44

This is also called floor washing? Does it mean that Tesla will not wash the floor if it releases all the source code? Some people HWptds? That is to say, the language is ambiguous, which will also rise to the washing ground? Are some people too focused? Think the people he pays attention to must be staring at?

Qin Liming 2024-05-11 09:12

be devoid of any sense of shame

Starry Night Destiny 2024-06-01 21:49

It feels like Mybatis. It's OK to provide users with optional security solutions. It's useless for users to complain about this problem

Small and beautiful software development 2024-06-01 05:06

Cheat one's job

Xiao Xu Middle aged 2024-06-01 07:03

good

Brother Xiao Yang 2024-06-01 20:39

Isn't Ali developed? What are you afraid of? There's no need for every family to set up a set

looly 2024-06-02 14:32

@Qingmiao Hutool has also been mentioned some loopholes that I think are relatively "low-level", or I think are not loopholes. At first, I was also very angry, but after thinking it through, I found that CVE's idea was that once you did not actively remind users that there was a pit, the user fell into the pit is your fault, that is, your vulnerability. For example, as a traffic policeman, you should remind everyone who crosses the road to pay attention to safety, and ask him to answer whether he knows. Once you don't remind someone and are hit by a car, you can't get away from it. Similarly, when using frameworks and tools, you should provide at least one parameter to remind users that there may be SQL injection vulnerabilities. Note that it is not in the comments, but in the method parameters, which is the user's responsibility. Therefore, it is not comprehensive to provide solutions in comments or documents.

Happy LeapFrog 2024-05-18 09:18

But the question is: "What's the use of this for ordinary Android users?" Now the answer seems to be: "Almost nothing.".

sweet potato chips 2024-05-31 22:08

Glue code consumes few resources

zzeric 2024-04-28 20:01

Although France is the parent community, the core developers of OCCT on github are all Russians. Without Russians, the French parent community cannot continue to operate. So Huawei took over, moved to China, changed its name and resumed open source and community operations. What's the problem?

kangaroo 2024-06-01 22:23

The next version focuses on improving existing functions * improving internal power and qi * and continues to move towards the goal of Grand Master.

Love to eat raw pears 2024-06-01 11:48

Why is this so-called "vulnerability" not a vulnerability? Spring, MyBatis and other frameworks can accept all kinds of CVE criticism, while MyBatisPlus has to dump the pot and accuse programmers of being too low-level# There is a difference. The premise is that you write XML, MyBatisPlus encapsulates Wrapper and claims to simplify code. Since it encapsulates and hides $#, it is not appropriate to do some necessary security checks? Instead of doubting the authority of CVE, you should know that SQL ->MyBatis ->MyBatisPlus ->various back-end scaffolds have multiple layers, each layer is simplifying, and each layer is throwing away the upper layer of the boiler. Who dares to use them. The programmers who use MyBatisPlus can't be expected to be at a high level. Every programmer wants to save effort. The front-end parameters can be directly obtained by HttpServletRequest from the back-end. Wrapper splicing can be found everywhere. If something goes wrong, is it the front-end or the framework? According to Qingmiao, can the injection vulnerability of the previous log4j and the deletion vulnerability of the Druid be used to eliminate low-level programmers?

Chief taxi captain 2024-05-17 11:17

I suggest that 360 open source all its products, and then become the leading enterprise in the domestic open source industry through open source, leading everyone to compete with foreign enterprises

Yeah, for 2024-05-17 13:42

That's too right. Old Zhou can't control Google, but he can control 360. Do not do to others what you do not want. All 360 products should be opened first.

xiaoqibabby 2024-05-15 17:36

The bank is strongly required to be responsible for

On fast reconnection mechanism of websocket

Hot content

Popular comments of the whole site

About the author

Author's Album

Author's other popular articles

Hot News

Hot software

OSCHINA Community

Online tools

Introduction

QQ group

Public account

Video number

On fast reconnection mechanism of websocket

Hot content

Popular comments of the whole site

About the author

Author's Album

Author's other popular articles

Hot News

Recommended attention

Hot software

OSCHINA Community

Online tools

Introduction

QQ group

Public account

Video number