PDF Importer Problems
Incorrect PDF parsing
Some PDF documents (e.g. some OCR-ed documents, some LaTeX-generated documents, as well as some improperly encoded documents) may appear perfectly formatted onscreen, but text can't be extracted correctly.
The reason is that PDF parsing is not an exact science, as printed text may have to be reconstructed from individual glyphs drawn on a page, not necessarily in the logical reading order.
In some cases, completely garbled data is extracted, and in other cases, the extracted text contains either extraneous spaces inside words, or missing spaces between words. In all cases, there are two consequences to this:
- these documents can't be found when searching a string that actually occurs in their content
- if you have a large number of such documents (or very large documents like this), indexing can become very slow
See avoid large non-linguistic textual data to identify these files and work around the problem.
Partial PDF parsing
Additionally, Apple’s Spotlight importer (which is used by default by FoxTrot) stops after extracting about 5,000,000 characters in a document. Thus, PDF documents containing thousands of pages may be only partially indexed.
Checking how PDF documents have been parsed
To see the exact text that has been extracted and indexed for a given document, option-click it in the result list (or use the “display type” popup menu in the toolbar, in FoxTrot 7).
Changing FoxTrot’s preferred PDF parser
By default FoxTrot uses Spotlight's metadata importer to extract text from PDF documents, but an alternate parser is available: Xpdf. For some documents, Xpdf may give better results than Spotlight's importer, and for some other documents, this is the opposite. Xpdf is quite slower than Spotlight’s PDF importer. To change which parser is used to index PDF documents:
- quit FoxTrot
- relaunch it while pressing the command and option keys
- enable the manage third-party metadata importers checkbox
- check prefer Xpdf for PDF documents, below the importer list
You will need to rebuild your index for this change to take effect. You may create a small test index and compare both parsers on a few documents, before rebuilding your main index.
Changing the PDF parser for specific files
With FoxTrot 7.5 and later, you can specify whether to use Xpdf or Spotlight’s importer, for each individual file. In the search results list, select the files that need to be re-parsed, and select “choose PDF parser” in the contextual menu. This requires having write permission to these files, to set an extended attribute. You may use the all items of type criterion, in conjunction with in: other indexed folder, or with a date filter, or an advanced filter, and select all the found files to change the parser for a set of PDF files.
You will then need to either update your index, or select “reindex selection now” in the same contextual menu.
You can also set the PDF parser to use for specific files using one the following Terminal.app commands:
xattr -w com.ctmdev.foxtrot.extractor xpdf [file ...]
xattr -w com.ctmdev.foxtrot.extractor spotlight [file ...]
xattr -d com.ctmdev.foxtrot.extractor [file ...]