DataTaunew | comments | leaders | submitlogin
Identifying document start and end pages
1 point by virafb 2081 days ago | discuss
I receive PDF files that contain scanned images, and need to classify documents within these files. I have been successful in classifying the first page of a subset of the documents of interest, by training them on the OCR generated text of the pages.

However, the current implementation is limited, as I need to be able to identify and extract each of the documents within the PDF file.

What methods can I use to identify the boundaries of a document (start and end pages) within the file?




RSS | Announcements