Compression and selection of large dictionary during PDF parsing

0x01 Previously, we are currently working on an unstructured data analysis project. The largest proportion of unstructured data is in PDF format. During the parsing process, pymupdf will be used to preliminarily parse the text and images in the PDF. The api will return a dictionary similar to the structure in the figure above, with the level of doc → page → block → line → span → char. Our parsing logic code will be based on the pym