Technical preparations for websites with millions of visitors

original
2012/08/17 09:34
Number of readings 692

In terms of pure website technology, because of the development of open source mode, it is very simple and cheap to build a small website, so many people Entrepreneurship Orientation is used in Internet applications. Most of these people do not know much about technology or are not so proficient in it, and the knowledge of website development and maintenance is scattered and the cost of learning is too high, so this article combines these knowledge points. Systematically speaking, what problems may arise in the process of visiting a small website from thousands of visitors a day to 10 or 2 million visitors a day, And how to do enough work at the beginning to avoid these problems.

Your website's traffic has gradually increased due to hard work. In the process of increasing, problems may begin to appear. Because the increase in bandwidth, hardware expansion, and personnel expansion brings about obvious cost increases, and a considerable part of the cost is caused by code refactoring, architecture refactoring, and even the replacement of the underlying development language. The worst case is data loss, and all efforts are wasted. Most of these costs can be avoided at the beginning. If you lay a good foundation, you can save a lot of energy and care less in the future.

For different initial investment costs, the choice of technical route is different. It is assumed that the website is just an idea. It is planned to invest about 50000 yuan in server hardware bandwidth in the first year. For this funding limit, there are many options, such as renting a virtual host, renting a separate server, or the popular private cloud, or hosting server. For the first two options, the website needs to be migrated when it develops to a certain scale, and then it will obviously have a greater impact to redo the planning. Because the server hosting is configured autonomously and can fully control, the website with a certain scale basically adopts this mode. For a website with its own hosting server, the following points should be noted at the beginning——

1、 Development Language

Generally speaking, technicians (programmers) choose the language they are most familiar with according to their own technical background, but they can't always write programs by themselves, so they have to spend some time in choosing the language. First of all, no matter what language is used, the final code quality is management, so we analyze the cost of early development. Currently, the popular languages applicable to websites in China include java, php,. net, python, and ruby. Because python and ruby are relatively late in popularity in China, it is still relatively difficult to recruit personnel There are relatively many people on the. net platform, but when the performance problem needs to be solved in the later stage, the requirements for personnel skills are relatively high. The remaining Java and PHP users can be said to be the largest. Java and php cannot be compared from the language level, but for websites whose applications are almost supported by the front end in the early days, php is easy to get started, fast to write, and has relatively large advantages. As for the back-end, such as behavior analysis, bank interface, asynchronous message processing, etc., when really needed, it is necessary to choose different languages according to different business requirements.

2、 Code version management

A website with a small scale needs to use code version management. There are two greatest benefits of code version management. One is that it is convenient to work together, and the other is that there are historical records that can be queried and compared. There are many code version management software, such as vss/cvs/svn/hg, which are popular in China at present, and the popularity of svn is still high.

Supposing that svn is selected, there are several considerations. The first is what tree structure is used. At the beginning, there may be only one main branch, and then branches need to be established, such as one development branch, one online branch, and later, each team may have one branch. It is recommended to select two branches at the beginning when there are few people, development and online. Each function can be submitted to the development branch after local testing is correct. Finally, it can be tested uniformly and merged into the online branch when online. If everyone builds their own branches, it will waste a lot of energy when merging. It will take too much time for WEB applications that need to be modified several times every day.

Deploy code to the server, either manually or automatically. Manual deployment is relatively simple. Generally, you can directly update the svn on the server, or find a new directory, svn checkout, and then send the web root to ln - s. The more complex the application is, the more complex the deployment is. There is no unified standard. Just don't use the form of ftp upload. First, the file reference inconsistency error rate increases when uploading. Second, it is easy for the developer's version to be inconsistent with the online version, leading to the original intention of changing the wrong character result into rollback. If there are multiple servers, it is still recommended to deploy them automatically. The machine that changes the code temporarily withdraws from the current service pool and rejoins it after the update is completed.

3、 Server hardware

In each computer room, there are countless websites supported by one server alone. However, if the funds are slightly sufficient, it is recommended that at least three sets of standard configurations be used for web processing, database, and backup. The web server needs at least 8G of memory and dual sata raid1. If the economy is slightly loose, or there are many static files or pictures, 15k sas raid10. The database shall have at least 16G memory and 15k sas raid 10. The backup server should be configured the same as the database server. Hardware can be a whole set of brands, compatible with computers, or half brand and half assembly, depending on economic capacity. Of course, this is a typical combination. The performance bottleneck of some types of applications first appears on the web, which needs to be analyzed separately.

The web server can run both programs and memory cache, while the database server only runs the main database (if it is MySQL), and the backup server bears more responsibilities, The web configuration, cache configuration, and database configuration should be consistent with the first two, so that if any of the WEB and database has problems, it is easy to switch the backup server to temporary replacement until the problem is solved. It should be noted that the hardware may break at any time, especially the hard disk. Therefore, it is better to put the WEB server and the database server together than to omit the backup. The backup must be different, and there must be asynchronism. Power failure and misoperation may cause the loss of all data on a machine. There are many open source backup schemes to choose from. The simplest one is rsync, which is written in crontab and synchronized regularly. For backup and switching, it is recommended to do more tests, select the safest and most suitable for the business, and backup in different places as far as possible.

4、 Machine room

Try not to choose three kinds of machine rooms: China Unicom machine rooms with extremely slow access, China Unicom machine rooms with extremely slow access, China Unicom machine rooms with extremely slow access, China Mobile or China Railcom machine rooms. The computer room should be visited and tested as many as possible to find a computer room with good network quality and strict management. The computer room can be said to be very important, which is directly related to the speed of website access. The speed of website access is directly related to the user experience. The website with very slow access speed is difficult to win the favor of users.

5、 Architecture

In the general direction, the well-known architecture is web load balancing+database master-slave+cache+distributed storage+queue. At the beginning, design and program according to the principle of extensibility. Only consider the avalanche effect in case of cache failure, the data consistency and time difference of master slave synchronization, the stability of the queue and the retry strategy after failure, the efficiency of file storage and the backup method, and other unexpected situations. Cache failure, database replication interruption, queue write error, and power damage often occur in actual operation and maintenance. If you do not pay attention to these problems, the recovery period may exceed the expected time for a long time.

6、 Server software

The operating system Linux is very popular. In the absence of professional operation and maintenance personnel, it should be preferred to use the release with many people, active community, convenient configuration and easy upgrade, such as RH series For debian, ubuntu server, etc., hardware and operating system should be selected together to see whether there is a suitable driver. If you decide to use a commercial software or solution, you should also know in advance which operating system it supports best. In terms of web servers, Apache has the largest share in the three series of Apache, nginx, and lighttpd, but it still needs to be professional to tune the performance well. nginx and lighttpd can achieve good performance without too much adjustment. No matter what software you choose, unless you change these software or your program is really incompatible with the new version, the newer version is the better. The newer version means more new features, fewer bugs, and more performance. For a typical php website, basically most people have not changed any server software source code, and in most cases, they can upgrade to a new version smoothly. Upgrades like jdk5 to jdk6, and python2 to python3 that change a lot are relatively rare. Take a look at the ChangeLog, the upgrade instructions, and the evaluation and test based on your own situation. The earlier the upgrade, the better. The later the upgrade, the higher the cost. For software packages, try to use the package management tools built in the distribution. If there are no special requirements, it is not recommended to compile them yourself, which will be detrimental to future operation and maintenance.

7、 Database

Almost all operations end up in the database, which is the most difficult to expand (storage is also very difficult). The common expansion methods of database include replication and fragmentation. When designing, we should consider how to replicate and fragment the data of each application. Of course, this consideration is generally deferred to the technical design period. When designing the database structure at the initial stage, we should consider whether to divide databases and partitions according to different business types and growth expectations, and try not to use federated queries and self increasing IDs to facilitate fragmentation. Replication delay and data consistency of master and slave databases can be written by yourself or detected with existing operation and maintenance tools.

It is difficult to expand with stored procedures, which is often caused by developers who are converted from traditional C/S, especially OA systems. The low-cost website is not a model of one or two minicomputers running a database to handle all businesses, but a model of air sea operations. The convenience of horizontal expansion is much more important than the pre analysis time and network transmission traffic.

In addition, NoSQL is a popular concept, which can be understood as a non-traditional relational database. In practical applications, websites have more and more intensive write operations, hundreds of millions of simple relational data reading, hot standby, etc., which are not good at traditional relational databases. As a result, many non relational databases have emerged, such as Redis/TC&TT/MonoDB/Memcachedb, etc. In the test, these almost reached at least 10000 write operations per second, Memory type even more than 50000. During design, you can choose whether to use such databases according to business characteristics and performance requirements. For example, MongoDB can be configured in a few sentences to form a replication+automatic fragmentation+failover environment. Documented storage also simplifies the redevelopment mode of traditional design library structure. But when you decide to adopt a technology, you must really understand its advantages and disadvantages. For example, the technology you choose may not support the transaction and data consistency requirements you need.

8、 File storage

The distribution of storage is almost as difficult as the expansion of the database, but with only one million PVs, disk IO is generally not a major problem. One or two machines using SATA for stripe RAID can cope, but it is more complicated to do asynchronous backup by themselves, because there are many small files. If there is only one machine for storage, simple optimization can be done, such as the partition with the smallest thumbnail and the partition with the middle thumbnail. Adjust the block size according to the average size. The directory structure should be well planned for storage, otherwise, the maintenance will be complex after the number of files increases, which is not conducive to expansion. At the same time, future capacity expansion should also be considered, such as using LVM or hashing files to different machines according to different rules. The disk is more likely to fail under heavy IO conditions, so it is necessary to make a good backup. If a disk is found to be broken, it needs to be replaced immediately. Many people's hard disks are broken one after another and continue to break one after another.

In order to prepare for the future pictures to go to CDN, it is better to separate the domain names of the pictures at the beginning, and do not use the main domain name. Because many websites set the cookie to. domain.ltd, if the image is also under this domain name, it is likely that the cache will become invalid due to the cookie, which accounts for excess traffic, and the browser's concurrent thread restrictions may also cause slow access.

9、 Procedure

Under certain hardware conditions, how much access an application can carry depends in large part on how the program is written. If the program is poorly written, it may not be able to carry 10000 accesses. If the program is well written, one or two machines may be able to bear millions of PVs. The more complex and real-time applications are, the more difficult it is to optimize them. But there is a unified idea for ordinary websites, which is to optimize the front end as far as possible, reduce database operations, and reduce disk IO. Forward optimization means that, without affecting the function and experience, what can be executed in the browser should not be executed in the server, what can be returned directly in the cache server should not be returned to the application server, the results that can be obtained directly by the program should not be obtained externally, the data that can be obtained in the local machine should not be obtained remotely, and the data that can be obtained in the memory should not be obtained from the disk, Some in the cache should not be queried in the database. Reducing database operations means reducing the number of updates, caching results and queries, and making your programs complete as much as possible (such as join queries). Reducing disk IO means not using the file system as a cache as much as possible, and reducing the number of reads and writes to files. Program optimization always needs to optimize the slow part, which cannot be "optimized" by changing the syntax.

However, programming should not focus on optimization, but on extensibility. In today's WEB applications, the requirements change very quickly, and there is no architecture that can adapt to multiple requirements. Our scalability should focus on the architecture that interacts with the bottom layer, such as the access rules of persistent data and cache, and some common services, such as user information. First, improve the unchanged parts, and the rest will be easy to focus on the business logic.

Expand to read the full text
Loading
Click to join the discussion 🔥 (1) Post and join the discussion 🔥
Reward
one comment
seven Collection
five fabulous
 Back to top
Top