home page » System operation and maintenance » Solutions to Linux failures

Solutions to Linux failures

 
Article Contents

Like the Windows system, The Linux operating system will also have many problems and failures. Many Linux novices are afraid of failures. They seem helpless in the face of problems. What's more, they give up Linux. In fact, we should not be afraid of problems. Learning is a process of finding and solving problems. As long as we master the basic ideas of solving problems, all failures will be solved, Of course, the premise is that we already have the idea of solving problems and solid knowledge.

As a qualified Linux system administrator, you must have a set of clear and definite troubleshooting ideas. When problems occur, you can quickly locate and solve them. Here is a general idea for dealing with problems:

——Pay attention to the error message: every error is an error message. Generally, this message basically locates the problem, so you must pay attention to this error message. If you ignore these error messages, the problem will never be solved.

——Check the log file: Sometimes the error message just gives the surface of the problem. To understand the problem more deeply, you must check the corresponding log file. The log file is divided into the system log file (/var/log) and the application log file. Combining these two log files, you can generally locate the problem.

——Analyzing and locating problems: This process is relatively complicated. According to the error information, combined with the log file, and other relevant conditions, the cause of the problem is finally found.

——Problem solving: It is very simple to solve the problem after finding the cause of the problem.

It can be seen from this process that the process of solving problems is the process of analyzing and finding problems. Once the cause of the problem is determined, the fault will be solved.

Let's take a look at the solutions and methods of these problems:

Problem 1: Read only file system errors and solutions


Analysis: There are many reasons for this problem, which may be caused by inconsistent data blocks in the file system or disk failure. Mainstream ext3/ext4 file systems have strong self-healing mechanisms. For simple errors, the file system can generally repair itself. When a fatal error cannot be repaired, In order to ensure data consistency and security, the file system will temporarily block the write operation of the file system, saying that the file system will become read-only. Today, the above "read-only file system" phenomenon appears.

The command type fsck used to repair file system errors manually. Before repairing the file system, it is better to unload the disk partition where the file system is located

# umount /www/data

Umount : /www/data: device is busy

The prompt indicates that it cannot be uninstalled. It may be that the process corresponding to the file is running on this disk. Check as follows:

# fuser –m /dev/sdb1

/dev/sdb1: 8800

Then check what process corresponds to port 8800,

# ps –ef |grep 8800

After checking, it is found that apache is not closed. Stop apache

# /usr/local/apache2/bin/apachectl stop

# umount /www/data

# fsck –V –a /dev/sdb1

# mount /dev/sdb1 /www/data

Problem 2: "Argument list too long" error and solution


# crontab –e

After editing, save and exit, the error no space left on device is reported

According to the above error report, the disk space is full. First, check the disk space,

# df –h

It is found that the/var disk partition space has reached 100%. So far, the problem is located. The/var disk is full of space, because crontab will write the file information to the/var directory when saving, but this disk has no space, so an error is reported.

Then use the command du – sh * to check the size of all files or directories under the/var directory. It is found that the/var/spool/clientmqueue directory accounts for 90% of the size of the entire partition of/var. Then how the files under the/var/spool/clientmqueue directory are generated and whether they can be deleted is basically mail information and can be deleted

# rm *

/bin/rm :argument list too long

When trying to pass too many parameters to a command in the Linux system, the error "argument list too long" will appear. This is a limitation that the Linux system has always had. To view this limitation, you can use the command "getconf ARG_MAX",

# getconf ARG_MAX

#More/etc/issue View version

Solution: 1

# rm [a-n]* -rf

# rm [o-z]* -rf

2. Use the find command to delete

# find /var/spool/clientmqueue –type f –print –exec rm –f {} \;

3. Through shell script

#/bin/bash

RM_DIR=’/var/spool/clientmqueue’

cd $RM_DIR

for I in `ls`

do

rm –f $i

done

4. Recompile the kernel

You need to manually increase the number of pages allocated to command line parameters in the kernel, open the file include/linux/binfmts. h under the kernel source, and find the following line:

#denfine MAX_ARG_PAGES 32

Change 32 to a larger value, such as 64 or 128, and then recompile the kernel

Problem 3: Application failure caused by inode exhaustion


After a customer's Oracle database, such as a weapon, is powered off and restarted, Oracle monitoring cannot be started, and an error is prompted Linux error: No space left on device

It can be seen from the output information that the listening cannot be started because the disk is exhausted. Because Oracle needs to create a listening log file when starting listening, first check the disk space usage

# df –h

It can be seen from the disk output information that there is still a lot of disk space left in all partitions. The path for Oracle to listen to and write logs is under the/var partition, and the partition space under/var is sufficient.

Solution:

Since the error message is related to the disk space, we will study the problem of disk space in depth. The disk space occupied in the Linux system is divided into three parts: the first is the physical disk space, the second is the disk space occupied by the inode node, and the third is the space used by Linux to store semaphores, while the physical disk space is often contacted. Since it is not a problem of physical disk space, check whether the inode node is exhausted. Check the available inode nodes by executing the command "df - i". It can be seen from the output results that the inode is exhausted and the file cannot be written.

You can view the total number of inodes of a disk partition through the following command

# dumpe2fs –h /dev/sda3 |grep ‘Inode count’

Each inode has a number. The operating system uses the inode number to distinguish different files. You can view the inode number corresponding to the file name through the 'ls - i' command

If you want to view more detailed inode information of this file, you can use the stat command

# stat install.log

solve the problem

# find /var/spool/clientmqueue/ -name “*” –exec rm –rf {} \;

Problem 4: The file has been deleted, but the space has not been freed


The operation and maintenance monitoring system sends a notice that a server is full. Log in to the server and check that the root partition is really full. Here are some deletion strategies of the server. Since Linux does not have the Recycle Bin function, all files to be deleted on the online server will be moved to the system/tmp directory first, and then the data in the/tmp directory will be cleared regularly. There is nothing wrong with this policy itself, but through inspection, it is found that the system partition of this server does not separate the/tmp partition, so that the data under/tmp actually occupies the space of the root partition. Now that the problem is found, delete some data files under the/tmp directory that occupy more space.

# du –sh /tmp/* | sort –nr |head -3

The command found a 66G file access_log in the/tmp directory. This file should be an access log file generated by Apache. From the log size, it should be an Apache log file that has not been cleaned up for a long time. It is basically determined that the root space is full due to this file. After confirming that the file can be deleted, execute the following deletion command,

# rm /tmp/access_Iog

# df –h

From the output, the root partition space is still not released. What's the matter

Generally speaking, the space will not be released after deleting a file, but there are exceptions, such as file process locking, or a process writing data to the file all the time. To understand this problem, you need to know the file storage mechanism and storage structure under Linux.

A file stored in the file system is divided into two parts: the data part and the pointer part. The pointer is located in the meta data of the file system. After the data is deleted, the pointer is cleared from the meta data, and the data part is stored in the disk. After the pointer corresponding to the data is cleared from the meta data, the space occupied by the file data can be overwritten and new content can be written. The reason why the space has not been released after the access_log file is deleted is that the httpd process is still writing content to the file. As a result, although the access_Ilog file has been deleted, the process is locked, The pointer corresponding to the file is not cleared from the meta data. Since the pointer is not deleted, the system kernel thinks that the file has not been deleted, so the space queried by the df command is not freed.

Troubleshooting:

Now that we have a solution, let's see if any process has been writing data to the access_log file. Here, we need to use the losf command under Linux to obtain a list of deleted files still occupied by applications

# lsof |grep delete

It can be seen from the output that the/tmp/access_log file is locked by the process httpd, and the httpd process still writes log data to this file. The 'deleted' status in the last column indicates that this log file has been deleted. However, since the process is still writing data to this file, the space has not been freed.

solve the problem:

At this point, the problem is basically solved. There are many ways to solve this kind of problem. The simplest way is to close or restart the httpd process. Of course, restarting the operating system is also OK. However, these are not the best methods. To free up the disk space occupied by the file, the best way is to empty the file online, which can be done through the following command:

# echo “ ” >/tmp/access_log

With this method, disk space can not only be released immediately, but also ensure that you can continue to write logs to files in the city. This method is often used to clean log files generated by web services such as apache/tomcat/nginx online.

Problem 5: "too many open files" error and solution


Phenomenon: This is a java based web application system. When adding data in the background, it prompts that it cannot be added. So, log in to the server to check the tomcat log and find the following exception information, java.io.IOException: Too many open files

From this error message, it is basically judged that the file descriptors available to the system are insufficient. Since the www user of the tomcat service room system started it, log in to the system as the www user, and check the maximum number of file descriptors that can be opened by the system through the ulimit – n command. The output is as follows:

$ ulimit –n

sixty-five thousand five hundred and thirty-five

It can be seen that the maximum file descriptor that can be opened on this server is 65535. This value should be enough, but why does this error prompt

Solution. This case involves the use of ulimit command

There are several ways to use ulimit:

1. Add in user environment variable

If you use bash, you can add "ulimit – u128" to the environment variable file. bashrc or. bash_profile in the user directory to limit the maximum number of processes you can use

2. Add in the startup script of the application

If the application is tomcat, you can add 'ulimit – n 65535' in the startup script startup.sh of tomcat to limit the maximum number of file descriptors users can use

3. Execute ulimit command directly at shell command terminal

The resource limit of this method only takes effect at the terminal where the command is executed. After exiting or closing the terminal, the setting becomes invalid, and this setting does not affect other shell terminals

solve the problem:

After understanding the ulimit knowledge, follow the above case. Since the ulimit setting is OK, it must be caused by the ineffective setting. Next, check whether ulimit restrictions are added to the www user environment variable that starts tomcat, There is no ulimit limit for www users. Then continue to check whether the ulimit limit has been added to the tomcat startup script startup.sh file. After checking, it is found that no ulimit limit has been added. Finally, check whether the restrictions are added to the limits.conf file. Then check the limits.conf file as follows

# cat /etc/security/limits.conf |grep www

www soft nofile 65535

www hard nofile 65535

From the output, The ulimit limit is added to the limits.conf file. Since the limit has been added and there is no error in the configuration, why is the error still reported? After thinking, it is judged that there is only one possibility, that is, the start time of tomcat is earlier than the addition time of ulimit resource limit. So first check the start time of tomcat, as follows

# uptime

Up 283 days

# pgrep –f tomcat

four thousand six hundred and sixty-seven

# ps –eo pid, lstart,etime|grep 4667

4667 Sat Jul 6 09; 33:39 2013 77-05:26:02

From the output, we can see that 283 servers have not been restarted. Tomcat was started at 9:00 on July 6, 2013, and it has been started for nearly 77 days. Then continue to check the modification time of the limits.conf file,

# stat /etc/security/limits.conf

Clear it with the stat command, The last modification time of the limits.conf file is July 12, 2013, later than the start time of tomcat. After the problem is clear, the solution is simple. Restart tomcat.

Question 6: (Apache common error failure cases) "no space left on device" errors and solutions


Error phenomenon: the customer reports that they are executing "apachectl start" "There is no error message when you start Apache, but you still cannot access the web page. The customer's website is based on the online trading platform of Apache+PHP+MySQL. After hearing the phenomenon described by the customer, the first reaction is that the firewall shields the HTTP port or selinux problem. So you log in to the server to check the relevant information. From the output information, it can be seen that all the firewall policies are open Status, no restrictions are made, and selinux is also closed, which should not be caused by the firewall. Since it is not caused by the firewall, check whether the httpd process exists and whether the httpd port starts normally

# ps –ef |grep httpd|grep –v “grep” |wc –l

zero

# netstat –nultp |grep 80

# /usr/local/apache2/bin/apachectl start

# ps –ef |grpe httpd |grep –v “grep” |wc –l

zero

This operation first checked the httpd process on the server, and found that no HTTP process was running, and the port 80 corresponding to httpd was not started. So restart Apache, and no error was reported during the process of starting Apache. After starting, it was found that the HTTP process was still not running. From this point of view, there should be a problem inside Apache

Solution:

After judging the Apache problem, the first thing to look at is the Apache startup log. After looking at the Apache error log, I found a suspicious output, which is:

No space left on device : mod_rewrite: could not create rewrite_log_lock configuration failed

Seeing this error prompt, I felt that the disk space was exhausted, so I quickly checked all the disk partitions of the system, and found that there was still a lot of free space in all the disk partitions, which was strange. In the previous case introduction, I described in detail that the disk space occupied by Linux is divided into three parts: physical disks Inode node disk space and semaphore disk space. By checking the physical disk space of the server, it is found that there is still a lot of space left. Therefore, the problem of physical space is ruled out. Then, check the available inode nodes of the system through the "df - i" command. It is found that there are many inodes available for each partition. In this case, the problem of inode nodes is also ruled out. It should be caused by the depletion of the semaphore disk space.

Here is a brief introduction of linux semaphore related knowledge. Semaphore is a locking mechanism used to coordinate mutually exclusive access to critical resources between processes to ensure that a shared resource is not accessed by multiple processes at the same time. The semaphore of Linux system is used for interprocess communication. It has two standard implementations, POSIX and System v. Now most Linux systems have implemented these two standards, both of which can be used for communication between threads, but the system call mode is slightly different.

The System v semaphore is created by calling semget by the system, and the system v semaphore and shared memory used for interprocess communication can be displayed through the linux command ipcs.

Posix semaphores can be used for thread and interprocess communication, and can be divided into two types: named and unnamed, or whether they are saved on disk.

solve the problem:

# cat /proc/sys/kernel/sem

# ipcs –s |grep daemon

Daemon is the user who starts the Apache process. The default is daemon or nobody, depending on the actual environment. The solution to semaphore exhaustion is simple. Clear it with the ipcrm command. The simplest method is to execute the following command combination:

# ipcs –s |grep nobody |perl –e ‘while (<STDIN>) { @a=split(/\s+/); print `ipcrmsem $a[1]` }’

Problem 7: Solution to Linux system failure to start


This is the most common fault of Linux. The system may fail to start after power failure, configuration update, software upgrade, and kernel upgrade. There are many reasons for this, including the following:

1) The file system is damaged. Generally, the root partition file system of Linux is damaged, which causes the system to fail to start. This situation is usually caused by a sudden power failure or illegal shutdown of the system.

2) Improper file system configuration, such as incorrect or missing configuration of/etc/inittab file and/etc/fstab file, results in system error and failure to start. This situation is generally caused manually during configuration update

3) The Linux kernel file is lost or crashed, causing the system to fail to boot. This may be caused by kernel upgrade errors or kernel bugs

4) The system boot program has problems. For example, grub is lost or damaged, which makes the system unable to boot. This situation is generally caused by human modification errors or file system failures.

5) System hardware failures, such as problems with the motherboard, power supply, hard disk, etc., cause the Linux system to fail to start normally. This situation is basically caused by server hardware problems.

Problem 8: The file system is damaged and the system cannot be started


Checking root filesystem

/dev/sda6 contains a file system with errors, check forced

An error occurred during the file system check

It can be seen from this error that there is a problem with the file system of the operating system/dev/sda6 partition. The probability of this problem is high. Usually, this problem is caused by the sudden power failure of the system, which causes the file system structure to be inconsistent. Generally, the solution to this problem is to use the fsck command to force repair.

# umount /dev/sda6

# fsck.ext3 –y /dev/sda6

Question 9: Access rights


When some services cannot be accessed, be sure to check whether they are blocked by iptables, the Linux native firewall. You can check the configuration policy of iptables through the iptables – L – n command.

# iptables –L –n

# iptables –A INPUT –i eth0 –p tcp --dport 80 –j ACCEPT

# iptables –A OUTPUT –p tcp --sport 80 –m state –state ESTABLISHED –j ACCEPT

Original link: Solutions to Linux failures , Please indicate the source for reprinting!

fabulous two