Compression and selection of large dictionary during PDF parsing

0x01 Previously

At present, we are working on an unstructured data analysis project. The largest proportion of unstructured data is in PDF format.

During the parsing process, pymupdf will be used to preliminarily parse the text and images in PDF

The API will return a dictionary similar to the structure shown above,

The hierarchy is doc → page → block → line → span → char

Our parsing logic code will analyze according to the preliminary parsing dict given by pymupdf, and generate an intermediate state dict,

However, this intermediate dict contains too much process data, resulting in a terrible size expansion. In general, this dict is about 10 times the size of the pdf source file, and in extreme cases, it can reach 30 times the size.

At present, batch tasks are run through Spark. When slicing 30w tasks for 1000w tasks, there are about 30 PDFs for each task. After the memory burst in the initial tests (from the initial 10core40G to 5core100G), the luxury configuration of 1core40G was self destructively configured, and the task was only wanted to start, but the jvm virtual machine was still exploded

 24/01/14 04:31:12 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] Uncaught exception in thread Thread[stdout writer for /share/dataproc/envs/py3.10/bin/python,5,main] java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.lang.StringCoding.encode(StringCoding.java:350) at java.lang.String.getBytes(String.java:941) at org.apache.spark.unsafe.types. UTF8String.fromString(UTF8String.java:139) at org.apache.spark.sql.execution.python. EvaluatePython$$anonfun$$nestedInanonfun$makeFromJava$11$1.applyOrElse(EvaluatePython.scala:149) at org.apache.spark.sql.execution.python. EvaluatePython$.nullSafeConvert(EvaluatePython.scala:213) at org.apache.spark.sql.execution.python. EvaluatePython$.$ anonfun$makeFromJava$11(EvaluatePython.scala:148) at org.apache.spark.sql.execution.python. EvaluatePython$$$Lambda$851/617702869.apply(Unknown Source) at org.apache.spark.sql.execution.python. EvaluatePython$$anonfun$$nestedInanonfun$makeFromJava$16$1.applyOrElse(EvaluatePython.scala:195) at org.apache.spark.sql.execution.python. EvaluatePython$.nullSafeConvert(EvaluatePython.scala:213) at org.apache.spark.sql.execution.python. EvaluatePython$.$ anonfun$makeFromJava$16(EvaluatePython.scala:182) at org.apache.spark.sql.execution.python. EvaluatePython$$$Lambda$929/208188683.apply(Unknown Source) at org.apache.spark.sql. SparkSession.$ anonfun$applySchemaToPythonRDD$2(SparkSession.scala:802) at org.apache.spark.sql. SparkSession$$Lambda$930/21820570.apply(Unknown Source) at scala.collection. Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.sql.catalyst.expressions. GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution. BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution. WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at scala.collection. Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39) at org.apache.spark.api.python. SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:89) at org.apache.spark.api.python. SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:80) at scala.collection. Iterator.foreach(Iterator.scala:943) at scala.collection. Iterator.foreach$(Iterator.scala:943) at org.apache.spark.api.python. SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:80) at org.apache.spark.api.python. PythonRDD$.writeIteratorToStream(PythonRDD.scala:320) at org.apache.spark.api.python. PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:734) at org.apache.spark.api.python. BasePythonRunner$WriterThread.$ anonfun$run$1(PythonRunner.scala:440) at org.apache.spark.api.python. BasePythonRunner$WriterThread$$Lambda$841/158147681.apply(Unknown Source) at org.apache.spark.util. Utils$.logUncaughtExceptions(Utils.scala:2088) at org.apache.spark.api.python. BasePythonRunner$WriterThread.run(PythonRunner.scala:274)

It is found through monitoring that the memory occupied by a single task in the limit state reaches 28G. Although it does not exceed the allocation limit of 40G, it is still exploded. After asking the gpt, it is understood that java storage string is realized through byte array, and the maximum length of this array is int The maximum value of the type is Integer. MAX_VALUE , that is, 2 ^ 31 - 1, which is about the same size as a text file of 2G.

The PDF file that caused the java.lang.OutOfMemoryError was retested. It can be parsed normally locally through the python program. The parsed intermediate state dict is about 900M+and 1.5G. Considering that there are more than 30 pdf files in a single task, it is not so surprising that the java string is broken.

0x02 Solution selection

In this case, it is difficult to configure more cores in one executor to increase the parallelism of tasks,

I have discussed with other development students that the intermediate dict is too big, can we not drop it, but baa~

How can this problem be solved by the clever myhloli sauce,

After searching for json compression on the Internet, someone mentioned two schemes, cjson and json. hpack (jsonh), but after testing, both schemes have limitations and are not suitable for json compression with abnormal copying of logical structure

I clicked the right button in the explorer to compress. Unexpectedly, the effect was pretty good

~~But the speed of desktop compression is a bit too slow~~ 360 compression. The compression provided by Windows is very fast

Try using gzip compression directly in your code

By default, gzip uses level 9. 914m is compressed to 107m, which takes 18s. It seems to be OK,

After discussion, it was preliminarily decided to use the lossless compression scheme to directly compress the json string and store the base64 text.

0x03 Compression scheme selection

Since the compression scheme is selected, the scientific implementation process is to select an excellent compression algorithm before proceeding to the next step

An excellent compression algorithm should have two characteristics:

1. High compression ratio, that is, the file after compression should be as small as possible

2. Fast, the compression speed should be as fast as possible under the same compression ratio

The advantage of gzip is that the python library is built in, and there is no need to introduce third-party dependencies. Python also has a built-in lzma library,

These two libraries have their own characteristics. gzip is faster, compression ratio is lower, and lzma is slower, but compression ratio is higher

In addition to the two built-in compression libraries, two modern and widely used compression libraries were added for comparison through research.

Brotli：

https://github.com/google/brotli

Brotli is a lossless compression algorithm developed by Google, originally designed for HTTP content encoding. Brotli provides a good balance between compression efficiency and speed, especially when dealing with text and HTML content. It provides different levels of compression options, allowing users to choose between compression speed and compression effect. In addition, because of the broad support of Brotli, it can be used in various applications, including Web servers and browsers.

Zstandard：

https://github.com/facebook/zstd

ZStandard is a lossless compression algorithm developed by Facebook and designed to achieve high compression ratio and speed. Zstd provides a wide selection of compression levels, allowing users to make trade-offs between compression speed and compression effect. Compared with Brotli, Zstd can usually provide faster compression and decompression speed, especially when dealing with large data sets.

0x04 Code writing

Now AI is developing so fast that simple test code can no longer defeat AI. As an AI field engineer, it's too out to write code by yourself. It's all left to AI

 import gzip import time import brotli import zstandard import lzma import json import os import base64 #Define compression function def compress_with_method(input_str, output_file, compression_method, **kwargs): start = time.time() compressed = compression_method(input_str.encode(), **kwargs) with open(output_file, 'wb') as f_out: f_out.write(base64.b64encode(compressed)) end = time.time() compressed_size = os.path.getsize(output_file) original_size = len(input_str.encode()) compression_ratio = compressed_size / original_size return end - start, compression_ratio def compress_with_gzip(input_str, output_file, compresslevel): return compress_with_method(input_str, output_file, gzip.compress, compresslevel=compresslevel) def compress_with_brotli(input_str, output_file, quality): return compress_with_method(input_str, output_file, brotli.compress, quality=quality) def compress_with_zstandard(input_str, output_file, level): cctx = zstandard. ZstdCompressor(level=level) return compress_with_method(input_str, output_file, cctx.compress) def compress_with_lzma(input_str, output_file, preset): return compress_with_method(input_str, output_file, lzma.compress, preset=preset) #Defining Test Functions def test_compression(compression_name, compression_func, param_range, file, json_str, results): for param in param_range: output_file = f'test_ {compression_name}_ {param}_ {file}.compressed' elapsed_time, compression_ratio = compression_func(json_str, output_file, param) print(f"{compression_name} level {param} for {file}:") print(f"\tCompressed size: {os.path.getsize(output_file) / 1024 / 1024} MB") print(f"\tCompression time: {elapsed_time} seconds") print(f"\tCompression ratio: {compression_ratio}") results.append([file, f' {compression_name}_ {param}', elapsed_time, os.path.getsize(output_file), compression_ratio]) #Define File List files = ["simple1.json", "simple2.json"] results = [] #Compress each file for file in files: #Convert test data to JSON string with open(file, 'r', encoding='utf-8') as f_in: json_str = json.dumps(f_in.read()) #Calculate raw data size original_size = len(json_str.encode('utf-8')) #Test the quality of brotli (0~11) test_compression('brotli', compress_with_brotli, range(0, 12), file, json_str, results) #Test the level of zstandard (- 5~22) test_compression('zstandard', compress_with_zstandard, range(-5, 23), file, json_str, results) #Test the level of gzip (0-9) test_compression('gzip', compress_with_gzip, range(0, 10), file, json_str, results) #Test the level of lzma (0-9) test_compression('lzma', compress_with_lzma, range(0, 10), file, json_str, results) #Import csv module import csv #Open a file named 'compression_results. csv' in 'w' mode with open('compression_results.csv', 'w', encoding='utf-8', newline='') as f_out: #Create a csv write object writer = csv.writer(f_out) #Write Header writer.writerow(['file', 'algorithm', 'time', 'size', 'ratio']) #Write all results writer.writerows(results)

It takes no more than 10 minutes to write all the code and run directly

0x05 Statistical analysis

The test uses two samples, simple1.json (914MB) and simple2.json (1.49GB)

Test machine windows 11/i7-11700 4.6GHZ/python 3.11/Brotli 1.1.0/python zstandard 0.22.0

After running the data, I found a strange thing. The thing that gzip ran under level 0 was larger than the original content

When making statistics, remove gzip0 and draw a picture first

It can be seen from the figure that with the increase of compression level, the compression rate is getting closer to the limit level, and the compression time is also increasing exponentially

At the same time, it can be found that even with the highest compression level, the compression rate of gzip is not as good as any of the other three algorithms, so you can directly exclude the gzip scheme

At the same time, eliminate several algorithm levels that take too long to be acceptable, and then draw a picture

This is more intuitive. It can be seen that the compression rate of the target json file is around 0.04

When approaching the limit compression ratio, brotli_9 shows high cost performance

In this figure, the ordinate of time is still at a high level. Let's filter out some schemes that are time-consuming

Filter out the scheme that takes more than 100 seconds, and the result is much less, like entering the finals

Several strange sharp points seem to be brought by lzma. First remove lzma, then remove two high level zstandard_13 and zstandard_14

The results of the rest of the picture are more obvious

Zstandard_0/zstandard_3/zstandard_4 achieves a compression ratio of 0.06 at very low consumption

Zstandard_5/zstandard_6/brotli_4 achieves a compression ratio around 0.05 at low consumption

In pursuit of higher compression ratio, brotli_6/brotli_7 provides better cost performance

Because the selected test sample is an extreme 1G+json, there are two other key points in actual use,

First, other jsons are not so large, which may be a fraction of the size. Second, the parsing time of a single PDF is far more than the compression time. The compression time is a sensitive parameter, but not so sensitive.

After comprehensive consideration, I finally chose ~~brotli_7~~ The brotli_6 scheme is used as the compression scheme for json data.