Skip to content

Text Extractor

TextExtractor wraps PdfReader’s text extraction with a friendly, toolkit-level API. All page numbers are 1-based.

use Phpdftk\Pdf\Toolkit\TextExtractor;
$extractor = TextExtractor::open('contract.pdf');
echo "Pages: {$extractor->getPageCount()}\n\n";
// Page-by-page extraction (1-based)
$firstPage = $extractor->page(1);
// Search for a literal phrase
$results = $extractor->search('indemnify');
foreach ($results as $match) {
echo "Page {$match->pageNumber}: {$match->text}\n";
}
// Regex search — match dollar amounts
$amounts = $extractor->searchPattern('/\$[\d,]+\.\d{2}/');
foreach ($amounts as $match) {
echo "Found amount: {$match->text}\n";
}
use Phpdftk\Pdf\Toolkit\TextExtractor;
// From file
$extractor = TextExtractor::open('report.pdf');
// From string
$extractor = TextExtractor::openString($pdfBytes);
// Encrypted PDF
$extractor = TextExtractor::open('secured.pdf', password: 'secret');
// Single page (1-based)
$text = $extractor->page(1);
// All pages with separator
$text = $extractor->allPages("\n---\n");
// Per-page array
$pages = $extractor->perPage();
// => [1 => "page 1 text", 2 => "page 2 text"]
$results = $extractor->search('indemnification');
echo $results->count() . " matches\n";
foreach ($results as $match) {
echo "Page {$match->pageNumber}: {$match->text}\n";
}
$results = $extractor->searchPattern('/\d{3}-\d{2}-\d{4}/'); // SSN pattern
if ($extractor->contains('CONFIDENTIAL')) {
// handle sensitive document
}
$results = $extractor->search('term');
$results->count(); // int
$results->all(); // list<TextMatch>
$results->first(); // ?TextMatch
// TextMatch properties
$match->pageNumber; // int (1-based)
$match->text; // string
$match->offset; // int (char offset in page text)

TextSearchResults implements IteratorAggregate and Countable, so it works with foreach and count().

$extractor->getPageCount(); // int
$reader = $extractor->getReader(); // PdfReader

TextExtractor wraps PdfReader. If you only need text extraction without the search/per-page conveniences, you can call the reader directly. Note that PdfReader uses 0-based page indexes:

use Phpdftk\Pdf\Reader\PdfReader;
$pdf = PdfReader::fromFile('document.pdf');
// Single page (0-based index)
$text = $pdf->extractText(0);
// All pages
$allText = $pdf->extractAllText("\n\n");

The text extractor processes content stream operators (BT, ET, Tf, Td, Tj, TJ, etc.) and converts character codes to Unicode using:

  1. ToUnicode CMap (if present on the font) — most reliable
  2. Encoding + Differences (custom encoding vectors)
  3. WinAnsi + Adobe Glyph List fallback for standard fonts

Text positioning operators are used to infer spaces and line breaks.

The extractor handles the Do operator to recurse into Form XObjects — text inside stamped content, form field appearances, and embedded XObjects is extracted automatically. Nested Form XObjects are supported up to 10 levels deep, with font state properly saved/restored across XObject boundaries.