PDF Importer Problems
Some PDF documents (e.g. some OCR-ed documents, some LaTeX-generated documents, as well as some improperly encoded documents) may appear perfectly formatted onscreen, but text can't be extracted correctly. In some cases, completely garbled data is extracted, and in other cases, the extracted text contains either extraneous spaces inside words, or missing spaces between words. In all cases, there are two consequences to this:
- these documents can't be found when searching a string that actually occurs in their content
- if you have a large number of such documents (or very large documents like this), indexing can become very slow
See avoid large non-linguistic textual data to identify these files and work around the problem.
By default FoxTrot uses Spotlight's metadata importer to extract text from PDF documents, but an alternate method is available: Xpdf. For some documents, Xpdf may give better results than Spotlight's importer, and for some other documents, this is the opposite. To change which method is used to index PDF documents:
- quit FoxTrot
- relaunch it while pressing the command and option keys
- enable the manage third-party metadata importers checkbox
- check prefer Xpdf for PDF documents, below the importer list
You will need to rebuild your index for this change to take effect. You may create a small test index and compare both methods on a few documents, before rebuilding your main index.