Now more and more people release their resumes through the Internet, and PDF is a wide adopted format of resume documents which contain lots of valuable information for recruitment, personal profile mining, etc. However, only a few studies have been down in this direction. Therefore, this paper focuses on the task–information extraction from resume documents in PDF format, and proposes a hierarchical extraction method. At first, this method segments a page into blocks according to heuristic rules. And then each block is classified by a Conditional Random Field (CRF) model. To take advantage of the structure and layout information of PDF documents, the classification model employs two kinds of features:content-based features and layoutbased features which are parsed from PDF documents. The experimental results show that the effectiveness of the proposed method. Especially, the layout-based features are proved to be very useful for the task, improving more than 20 percent of the average F1-score in the experiments.
Jiaze Chen, Liangcai Gao, Zhi Tang, "Information Extraction from Resume Documents in PDF Format" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Document Recognition and Retrieval XXIII, 2016, https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-064