Text Extraction
phpdftk extracts text from PDF content streams by interpreting text operators and resolving font encodings to Unicode.
How it works
Section titled “How it works”The text extractor processes content stream operators (BT, ET, Tf, Td, Tj, TJ, etc.) and converts character codes to Unicode using:
- ToUnicode CMap (if present on the font) — most reliable
- Encoding + Differences (custom encoding vectors)
- WinAnsi + Adobe Glyph List fallback for standard fonts
Text positioning operators are used to infer spaces and line breaks.
Form XObjects
Section titled “Form XObjects”The extractor handles the Do operator to recurse into Form XObjects — text inside stamped content, form field appearances, and embedded XObjects is extracted automatically. Nested Form XObjects are supported up to 10 levels deep, with font state properly saved/restored across XObject boundaries.
Via PdfReader
Section titled “Via PdfReader”use Phpdftk\Pdf\Reader\PdfReader;
$pdf = PdfReader::fromFile('document.pdf');
// Single page (0-based index)$text = $pdf->extractText(0);
// All pages$allText = $pdf->extractAllText("\n\n");Via TextExtractor (Toolkit)
Section titled “Via TextExtractor (Toolkit)”The toolkit’s TextExtractor adds search and per-page access with 1-based page numbers:
use Phpdftk\Pdf\Toolkit\TextExtractor;
$extractor = TextExtractor::open('document.pdf');
// Single page (1-based)$text = $extractor->page(1);
// All pages$allText = $extractor->allPages();
// Per-page map$pages = $extractor->perPage();// => [1 => "page 1 text", 2 => "page 2 text", ...]Searching
Section titled “Searching”// Literal string search$results = $extractor->search('contract');echo $results->count() . " matches\n";
foreach ($results as $match) { echo "Page {$match->pageNumber}, offset {$match->offset}: {$match->text}\n";}
// First match only$first = $extractor->search('signature')->first();
// Regex search$results = $extractor->searchPattern('/\$[\d,]+\.\d{2}/');
// Quick contains checkif ($extractor->contains('confidential')) { echo "Document contains sensitive content\n";}Search results
Section titled “Search results”TextSearchResults implements IteratorAggregate and Countable:
$results = $extractor->search('term');
$results->count(); // number of matches$results->all(); // list<TextMatch>$results->first(); // ?TextMatch (null if no matches)
foreach ($results as $match) { $match->pageNumber; // 1-based $match->text; // matched text $match->offset; // character offset within page text}