Text Extractor

TextExtractor wraps PdfReader’s text extraction with a friendly, toolkit-level API. All page numbers are 1-based.

End-to-end example

use Phpdftk\Pdf\Toolkit\TextExtractor;

$extractor = TextExtractor::open('contract.pdf');

echo "Pages: {$extractor->getPageCount()}\n\n";

// Page-by-page extraction (1-based)
$firstPage = $extractor->page(1);

// Search for a literal phrase
$results = $extractor->search('indemnify');
foreach ($results as $match) {
    echo "Page {$match->pageNumber}: {$match->text}\n";
}

// Regex search — match dollar amounts
$amounts = $extractor->searchPattern('/\$[\d,]+\.\d{2}/');
foreach ($amounts as $match) {
    echo "Found amount: {$match->text}\n";
}

📄 View the sample PDF · View the full script on GitHub ↗

Opening a PDF

use Phpdftk\Pdf\Toolkit\TextExtractor;

// From file
$extractor = TextExtractor::open('report.pdf');

// From string
$extractor = TextExtractor::openString($pdfBytes);

// Encrypted PDF
$extractor = TextExtractor::open('secured.pdf', password: 'secret');

Extracting text

// Single page (1-based)
$text = $extractor->page(1);

// All pages with separator
$text = $extractor->allPages("\n---\n");

// Per-page array
$pages = $extractor->perPage();
// => [1 => "page 1 text", 2 => "page 2 text"]

Searching

Literal string

$results = $extractor->search('indemnification');

echo $results->count() . " matches\n";

foreach ($results as $match) {
    echo "Page {$match->pageNumber}: {$match->text}\n";
}

Regex pattern

$results = $extractor->searchPattern('/\d{3}-\d{2}-\d{4}/'); // SSN pattern

Quick contains check

if ($extractor->contains('CONFIDENTIAL')) {
    // handle sensitive document
}

Search results API

$results = $extractor->search('term');

$results->count();   // int
$results->all();     // list<TextMatch>
$results->first();   // ?TextMatch

// TextMatch properties
$match->pageNumber;  // int (1-based)
$match->text;        // string
$match->offset;      // int (char offset in page text)

TextSearchResults implements IteratorAggregate and Countable, so it works with foreach and count().

Document info

$extractor->getPageCount(); // int

Escape hatch

$reader = $extractor->getReader(); // PdfReader

Lower-level access via PdfReader

TextExtractor wraps PdfReader. If you only need text extraction without the search/per-page conveniences, you can call the reader directly. Note that PdfReader uses 0-based page indexes:

use Phpdftk\Pdf\Reader\PdfReader;

$pdf = PdfReader::fromFile('document.pdf');

// Single page (0-based index)
$text = $pdf->extractText(0);

// All pages
$allText = $pdf->extractAllText("\n\n");

How it works

The text extractor processes content stream operators (BT, ET, Tf, Td, Tj, TJ, etc.) and converts character codes to Unicode using:

ToUnicode CMap (if present on the font) — most reliable
Encoding + Differences (custom encoding vectors)
WinAnsi + Adobe Glyph List fallback for standard fonts

Text positioning operators are used to infer spaces and line breaks.

Form XObjects

The extractor handles the Do operator to recurse into Form XObjects — text inside stamped content, form field appearances, and embedded XObjects is extracted automatically. Nested Form XObjects are supported up to 10 levels deep, with font state properly saved/restored across XObject boundaries.