Detailed explanation of pure IP database format

web front end five thousand five hundred and eighteen 13 years ago (2011-04-27)

Since IP data base This kind of thing, the display IP function of QQ plug-in also came into being. I have a rather narrow experience. I don't know whether there are other applications. However, the IP database is really a good thing. I think the most popular IP database on the network today should be a pure version (don't beat me if you are wrong). So far, the number of IP records has been close to 30000. For some IPs, it is not too happy that they can even be accurate to the floor. In April and May 2004, when LumaQQ broke ground, I also used the pure IP database in order to add the IP display function that everyone likes, but no one seems to know why they like it. Its advantages are many records and fast query speed. It only uses one file, QQWry.dat, to contain all records, which is convenient for embedding into other programs and upgrading.

Basic structure

The QQWry.dat file is structurally divided into three parts: the file header, the record area, and the index area. Generally, when we want to find the IP address, we first find the record offset in the index area, and then read the information in the record area. Because the records in the record area are variable in length, it is impossible to search directly in the record area. Due to the large number of records, if we traverse the index area, it will be a bit slow. Generally speaking, we can search the index area with binary search method, which is several orders of magnitude faster than traversing the index area. Figure 1 is the file structure diagram of QQWry.dat.

Figure 1. QQWry.dat File Structure

Note that QQWry.dat uses little endian byte order

1、 Understanding file headers

The file header of QQWry.dat has only 8 bytes, and its structure is very simple. The first four bytes are the absolute offset of the first index, and the last four bytes are the absolute offset of the last index.

2、 Understanding the record area

Each IP record is composed of the name of the country and region. The country and region are not exact here, because Tsinghua University may be found computer Tsinghua University has become a national name here, so the name of this country has something to do with the IP database. So the format of the record is a bit like QName, which consists of a global part and a local part. We still use the country name and region name here.

So we imagine that the format of a record should be: [IP address] [country name] [region name]. Of course, this is no problem, but this is the simplest case. Obviously, the country name and region name may have many duplicates. It is not ideal to save a complete copy of the name for each record, so we need to redirect to save space. So in order to get a country name or region name, we have two possibilities: the first is the country name represented by a direct string, and the second is a 4-byte structure. The first byte indicates the redirection mode, and the last three bytes are the actual offset of the country name or region name. For the country name, the situation may be more complicated, because there may be at most two such redirects.

So what is the redirection mode? According to the above, the format of a record is [IP address] [National record] [Regional record]. If the national record is redirected, then the regional record may not exist, so there are two cases, which I call Mode 1 and Mode 2. We illustrate these formats with pictures:

Figure 2. The simplest form of IP records

Figure 2 shows the simplest IP record format. I don't think there is anything to explain

Figure 3. Redirect Mode 1

Figure 3 illustrates redirection mode 1. We can see that in mode 1, the regional records follow the national records. After the IP address, there are only four bytes of the national records. The next three bytes form a pointer, pointing to the actual country name, followed by the address name. The identification byte of mode 1 is 0x01.

Figure 4. Redirection Mode 2

Figure 4 illustrates redirection mode 2. We can see that in mode 2 (the identification byte is 0x02), the regional record does not follow the national record, so there is still a regional record four bytes after the national record. I think you have understood the difference between Mode 1 and Mode 2, that is, there will be no regional records after the national records in Mode 1, and regional records after the national records in Mode 2. Now let's look at a more complex situation.

Figure 5. Mixing 1

Figure 5 illustrates a more complex situation that may occur when the country record is in mode 1. In this case, the location to which the redirect points is still a redirect, but the second redirect is in mode 2. Don't worry, there is no mode 3, and this redirection can only be twice at most. If the second redirection occurs, it must be mode 2, and this will only happen in national records. For regional records, mode 1 and mode 2 are the same, and regional records will not be redirected twice. However, this figure can be more complex, as shown in Figure 7:

Figure 6. Mixing 2

Figure 6 is the most complicated mixing situation in mode 1, but I think it should also be well understood, except that the region record is also redirected. I want to remind you that if the redirected address is 0, it means the unknown region name.

So we can summarize as follows: an IP record consists of [IP address], [country record], and [region record]. For a country record, there are three representations: string form, redirection mode 1, and redirection mode 2. For regional records, there are two representations: string form and redirection. Another rule is that country records in redirection mode 1 cannot be followed by regional records. According to this summary, a reasonable combination of these methods constitutes all possible cases of IP records.

Reason for design

Before we continue to understand the structure of the index area, let's first understand why the structure of the record area should be designed like this. I think you may have the answer: string reuse. Yes, under this structure, I only need to save a country name and region name once. Let's give an example. For convenience, we use lowercase letters to represent IP records, C represents the country name, and A represents the region name:

  • There are two records a (C1, A1) and b (C2, A2). If C1=C2 and A1=A2, we can use the structure shown in Figure 3 to achieve reuse

  • There are three records a (C1, A1), b (C2, A2), c (C3, A3). If C1=C2, A2=A3, now we want to store record b, we can use the structure in Figure 6 to achieve reuse

  • There are two records a (C1, A1) and b (C2, A2). If C1=C2, now we want to store record b, we can use mode 2 to represent C2 and string to represent A2

You can cite more cases, and you will also find that under this structure, different strings only need to be stored once.

Understanding the index area

In the part of understanding the file header, we explained that the file header is actually two pointers, pointing to the absolute offset of the first index and the last index respectively. As shown in Figure 8:

Figure 8. Diagram of file header pointing to index area

It's really simple, isn't it? From the file header, you can locate the index area, and then you can start searching the IP address! Each index is 7 bytes long. The first 4 bytes are the starting IP address, and the last 3 bytes point to the IP record. Some concepts need to be explained here. What is the starting IP and is there an ending IP? Suppose there is a record: 166.111.0.0 - 166.111.255.255, then 166.111.0.0 is the starting IP address, 166.111.255.255 is the ending IP address, and the ending IP address is the first four bytes in the IP record. You should know that. Thus, each index and a record form an IP range. If you want to find the location of 166.111.138.138, you will find that 166.111.138.138 falls within the range of 166.111.0.0 - 166.111.255.255. Then you can read the country and region names along this index. Let's give a most detailed diagram:

Figure 9. Detailed structure of document

Now everything is clear, isn't it? Maybe you don't know where the version information of QQWry.dat exists? The answer is that the last IP record is actually the version information. The last record shows the IP data of 255.255.255.0 255.255.255.255 pure network on June 25, 2004. OK, you should know everything by now.

Demo

Next step: I give a program fragment to read IP records. This fragment is extracted from the LumaQQ source file edu.tsinghua.lumaqq IPSeeker.java, if you are interested, you can download the source code for a detailed look.

 /** *Given the offset of a record in an IP country or region, an IPLocation structure is returned *@ param offset Start offset of national records *@ return IPLocation object */ private IPLocation getIPLocation(long offset) { try { //Skip 4-byte ip ipFile.seek(offset + 4); //Read the first byte to determine whether to mark the byte byte b = ipFile.readByte(); if(b == REDIRECT_MODE_1) { //Read Country Offset long countryOffset = readLong3(); //Jump to offset ipFile.seek(countryOffset); //Check the flag byte again, because this place may still be a redirect at this time b = ipFile.readByte(); if(b == REDIRECT_MODE_2) { loc.country = readString(readLong3()); ipFile.seek(countryOffset + 4); } else loc.country = readString(countryOffset); //Read region flag loc.area = readArea(ipFile.getFilePointer()); } else if(b == REDIRECT_MODE_2) { loc.country = readString(readLong3()); loc.area = readArea(offset + 8); } else { loc.country = readString(ipFile.getFilePointer() - 1); loc.area = readArea(ipFile.getFilePointer()); } return loc; } catch (IOException e) { return null; } }	 /** *Parse the following bytes from the offset and read a region name *@ param offset Start offset of regional records *@ return Region name string *@ throws IOException region name string */ private String readArea(long offset) throws IOException { ipFile.seek(offset); byte b = ipFile.readByte(); if(b == REDIRECT_MODE_1 || b == REDIRECT_MODE_2) { long areaOffset = readLong3(offset + 1); if(areaOffset == 0) return LumaQQ.getString("unknown.area"); else return readString(areaOffset); } else return readString(offset); } /** *Reading three bytes from the offset position is a long. Because Java is in big endian format, there is no way to *This function is used for conversion *@ param offset The starting offset of the integer *@ return The long value read. A return of - 1 indicates that reading the file failed */ private long readLong3(long offset) { long ret = 0; try { ipFile.seek(offset); ipFile.readFully(b3); ret |= (b3[0] & 0xFF); ret |= ((b3[1] << 8) & 0xFF00); ret |= ((b3[2] << 16) & 0xFF0000); return ret; } catch (IOException e) { return -1; } }		 /** *Read 3 bytes from the current position and convert them to long *@ return The long value read. A return of - 1 indicates that reading the file failed */ private long readLong3() { long ret = 0; try { ipFile.readFully(b3); ret |= (b3[0] & 0xFF); ret |= ((b3[1] << 8) & 0xFF00); ret |= ((b3[2] << 16) & 0xFF0000); return ret; } catch (IOException e) { return -1; } } /** *Read a string ending in 0 from the offset *@ param offset Start offset of string *@ return The string read. An error returns an empty string */ private String readString(long offset) { try { ipFile.seek(offset); int i; for(i = 0, buf[i] = ipFile.readByte();  buf[i] !=  0;  buf[++i] = ipFile.readByte()); if(i != 0)  return Utils.getString(buf, 0, i, "GBK"); } catch (IOException e) {			 log.error(e.getMessage()); } return ""; }

The code is not complicated. getIPLocation is the main method. It checks the format of national records and uses different codes for string format. Mode 1 and Mode 2 use different codes. ReadArea is relatively simple, because only string and redirection need to be handled.

summary

The structure of a pure IP database makes it easy and fast to find IP addresses, but it is troublesome for you to edit it. I think a special tool should be needed to generate the QQWry.dat file. Due to the limitation of its file format, it is not easy for you to add IP records directly. However, I am very happy to find the IP address. I hope more and more innocent records will be recorded.