Back to articles
Volume: 28 | Article ID: art00013
Information Extraction from Resume Documents in PDF Format
  DOI :  10.2352/ISSN.2470-1173.2016.17.DRR-064  Published OnlineFebruary 2016

Now more and more people release their resumes through the Internet, and PDF is a wide adopted format of resume documents which contain lots of valuable information for recruitment, personal profile mining, etc. However, only a few studies have been down in this direction. Therefore, this paper focuses on the task–information extraction from resume documents in PDF format, and proposes a hierarchical extraction method. At first, this method segments a page into blocks according to heuristic rules. And then each block is classified by a Conditional Random Field (CRF) model. To take advantage of the structure and layout information of PDF documents, the classification model employs two kinds of features:content-based features and layoutbased features which are parsed from PDF documents. The experimental results show that the effectiveness of the proposed method. Especially, the layout-based features are proved to be very useful for the task, improving more than 20 percent of the average F1-score in the experiments.

Subject Areas :
Views 174
Downloads 81
 articleview.views 174
 articleview.downloads 81
  Cite this article 

Jiaze Chen, Liangcai Gao, Zhi Tang, "Information Extraction from Resume Documents in PDF Formatin Proc. IS&T Int’l. Symp. on Electronic Imaging: Document Recognition and Retrieval XXIII,  2016,

 Copy citation
  Copyright statement 
Copyright © Society for Imaging Science and Technology 2016
Electronic Imaging
Society for Imaging Science and Technology