Information Extraction from Resume Documents in PDF Format

Jiaze Chen; Liangcai Gao; Zhi Tang

doi:10.2352/ISSN.2470-1173.2016.17.DRR-064

Now more and more people release their resumes through the Internet, and PDF is a wide adopted format of resume documents which contain lots of valuable information for recruitment, personal profile mining, etc. However, only a few studies have been down in this direction. Therefore, this paper focuses on the task–information extraction from resume documents in PDF format, and proposes a hierarchical extraction method. At first, this method segments a page into blocks according to heuristic rules. And then each block is classified by a Conditional Random Field (CRF) model. To take advantage of the structure and layout information of PDF documents, the classification model employs two kinds of features:content-based features and layoutbased features which are parsed from PDF documents. The experimental results show that the effectiveness of the proposed method. Especially, the layout-based features are proved to be very useful for the task, improving more than 20 percent of the average F1-score in the experiments.