| I receive PDF files that contain scanned images, and need to classify documents within these files. I have been successful in classifying the first page of a subset of the documents of interest, by training them on the OCR generated text of the pages. However, the current implementation is limited, as I need to be able to identify and extract each of the documents within the PDF file. What methods can I use to identify the boundaries of a document (start and end pages) within the file? |