Text Extractor
TextExtractor wraps PdfReader’s text extraction with a friendly, toolkit-level API. All page numbers are 1-based.
End-to-end example
Section titled “End-to-end example”use Phpdftk\Pdf\Toolkit\TextExtractor;
$extractor = TextExtractor::open('contract.pdf');
echo "Pages: {$extractor->getPageCount()}\n\n";
// Page-by-page extraction (1-based)$firstPage = $extractor->page(1);
// Search for a literal phrase$results = $extractor->search('indemnify');foreach ($results as $match) { echo "Page {$match->pageNumber}: {$match->text}\n";}
// Regex search — match dollar amounts$amounts = $extractor->searchPattern('/\$[\d,]+\.\d{2}/');foreach ($amounts as $match) { echo "Found amount: {$match->text}\n";}📄 View the sample PDF · View the full script on GitHub ↗
Opening a PDF
Section titled “Opening a PDF”use Phpdftk\Pdf\Toolkit\TextExtractor;
// From file$extractor = TextExtractor::open('report.pdf');
// From string$extractor = TextExtractor::openString($pdfBytes);
// Encrypted PDF$extractor = TextExtractor::open('secured.pdf', password: 'secret');Extracting text
Section titled “Extracting text”// Single page (1-based)$text = $extractor->page(1);
// All pages with separator$text = $extractor->allPages("\n---\n");
// Per-page array$pages = $extractor->perPage();// => [1 => "page 1 text", 2 => "page 2 text"]Searching
Section titled “Searching”Literal string
Section titled “Literal string”$results = $extractor->search('indemnification');
echo $results->count() . " matches\n";
foreach ($results as $match) { echo "Page {$match->pageNumber}: {$match->text}\n";}Regex pattern
Section titled “Regex pattern”$results = $extractor->searchPattern('/\d{3}-\d{2}-\d{4}/'); // SSN patternQuick contains check
Section titled “Quick contains check”if ($extractor->contains('CONFIDENTIAL')) { // handle sensitive document}Search results API
Section titled “Search results API”$results = $extractor->search('term');
$results->count(); // int$results->all(); // list<TextMatch>$results->first(); // ?TextMatch
// TextMatch properties$match->pageNumber; // int (1-based)$match->text; // string$match->offset; // int (char offset in page text)TextSearchResults implements IteratorAggregate and Countable, so it works with foreach and count().
Document info
Section titled “Document info”$extractor->getPageCount(); // intEscape hatch
Section titled “Escape hatch”$reader = $extractor->getReader(); // PdfReaderLower-level access via PdfReader
Section titled “Lower-level access via PdfReader”TextExtractor wraps PdfReader. If you only need text extraction without the search/per-page conveniences, you can call the reader directly. Note that PdfReader uses 0-based page indexes:
use Phpdftk\Pdf\Reader\PdfReader;
$pdf = PdfReader::fromFile('document.pdf');
// Single page (0-based index)$text = $pdf->extractText(0);
// All pages$allText = $pdf->extractAllText("\n\n");How it works
Section titled “How it works”The text extractor processes content stream operators (BT, ET, Tf, Td, Tj, TJ, etc.) and converts character codes to Unicode using:
- ToUnicode CMap (if present on the font) — most reliable
- Encoding + Differences (custom encoding vectors)
- WinAnsi + Adobe Glyph List fallback for standard fonts
Text positioning operators are used to infer spaces and line breaks.
Form XObjects
Section titled “Form XObjects”The extractor handles the Do operator to recurse into Form XObjects — text inside stamped content, form field appearances, and embedded XObjects is extracted automatically. Nested Form XObjects are supported up to 10 levels deep, with font state properly saved/restored across XObject boundaries.