PdfReader
PdfReader parses existing PDF files into the phpdftk object model. It handles classic xref tables, cross-reference streams, object streams, incremental updates, and encrypted PDFs.
Opening a PDF
Section titled “Opening a PDF”use Phpdftk\Pdf\Reader\PdfReader;
// From file$pdf = PdfReader::fromFile('document.pdf');
// From string (e.g., HTTP response body)$pdf = PdfReader::fromString($bytes);
// From stream resource$pdf = PdfReader::fromStream(fopen('php://stdin', 'rb'));
// Encrypted PDF$pdf = PdfReader::fromFile('secured.pdf', password: 'secret');
// Public-key encrypted PDF$pdf = PdfReader::fromFilePublicKey('secured.pdf', $certPem, $keyPem);Document info
Section titled “Document info”echo $pdf->getVersion(); // "1.7"echo $pdf->getPageCount(); // 42
// Linearization detectionif ($pdf->isLinearized()) { $params = $pdf->getLinearizationParameters(); echo "Web-optimized, {$params['pageCount']} pages";}Accessing pages
Section titled “Accessing pages”// All pages as raw dictionaries$pages = $pdf->getPages();
// Single page by 0-based index$page = $pdf->getPage(0);
// Typed Page objects (hydrated into Core\Document\Page)$typedPages = $pdf->getTypedPages();$typedPage = $pdf->getTypedPage(0);Catalog and trailer
Section titled “Catalog and trailer”$catalog = $pdf->getCatalog(); // raw PdfDictionary$typed = $pdf->getTypedCatalog(); // hydrated Core\Document\Catalog$trailer = $pdf->getTrailer();$info = $pdf->getInfo();Resolving objects
Section titled “Resolving objects”// By object number$obj = $pdf->getObject(42);
// By reference$target = $pdf->resolveReference($ref);
// Typed hydration of any object$typed = $pdf->getTypedObject(42);Text extraction
Section titled “Text extraction”// Single page (0-based index)$text = $pdf->extractText(0);
// All pages concatenated$allText = $pdf->extractAllText("\n\n");The example below reads the page-labels showcase PDF and writes both the joined and per-page transcriptions to disk.
use Phpdftk\Pdf\Reader\PdfReader;
// Reuse the page-labels showcase PDF as a realistic input — it has front matter,// chapters, and an appendix with distinctly labelled pages.$inputPdf = example_output_path('writer/page-labels.pdf');$reader = PdfReader::fromFile($inputPdf);
// extractAllText() concatenates every page's text with a separator.// extractText($i) returns one page at a time when you only need a slice.$allText = $reader->extractAllText("\n\n--- page break ---\n\n");
// Persist both the per-page and full-document forms so the docs page can show both.file_put_contents(example_output_path('reader/text-output.txt'), $allText);
$perPage = [];for ($i = 0, $n = $reader->getPageCount(); $i < $n; $i++) { $perPage[] = sprintf("=== page %d ===\n%s", $i + 1, $reader->extractText($i));}file_put_contents( example_output_path('reader/text-per-page.txt'), implode("\n\n", $perPage),);📥 Input PDF · View the full script on GitHub ↗
For per-page access, search, and a friendlier 1-based API, see Working with PDFs → Text Extractor. That page also covers how text extraction works under the hood (ToUnicode CMaps, encoding fallbacks, Form XObject recursion).
Extracting metadata
Section titled “Extracting metadata”getInfo() returns the document’s /Info dictionary. The example below builds a PDF with rich metadata, reads it back, and writes a JSON summary.
use Phpdftk\Pdf\Core\Document\Info;use Phpdftk\Pdf\Core\Font\StandardFont;use Phpdftk\Pdf\Core\Font\Type1Font;use Phpdftk\Pdf\Core\PdfString;use Phpdftk\Pdf\Reader\PdfReader;use Phpdftk\Pdf\Writer\PdfWriter;
// Step 1: write a small PDF with rich /Info metadata + a synced XMP packet.$writer = new PdfWriter();$page = $writer->addPage();$writer->addFont(new Type1Font(StandardFont::Helvetica));$writer->addContentStream($page) ->beginText()->setFont('F1', 18)->moveTextPosition(72, 720) ->showText('Document with rich metadata')->endText();
$info = new Info();$info->title = new PdfString('Q4 2026 Quarterly Report');$info->author = new PdfString('Finance Team');$info->subject = new PdfString('Revenue and expense summary');$info->keywords = new PdfString('finance, quarterly, 2026, Q4');$info->creator = new PdfString('phpdftk metadata showcase');$info->producer = new PdfString('phpdftk');$info->creationDate = new PdfString('D:20260512000000Z');$writer->setInfo($info);$writer->syncInfoToMetadata();
$inputPdf = example_output_path('reader/metadata-source.pdf');$writer->save($inputPdf);
// Step 2: read the same PDF back and extract its metadata.$reader = PdfReader::fromFile($inputPdf);$infoDict = $reader->getInfo();
$decode = static function (\Phpdftk\Pdf\Core\Serializable $value): string { if ($value instanceof \Phpdftk\Pdf\Core\PdfString) { return $value->value; } if ($value instanceof \Phpdftk\Pdf\Core\PdfName) { return $value->value; } return (string) $value;};
$summary = [ 'pdfVersion' => $reader->getVersion(), 'pageCount' => $reader->getPageCount(), 'linearized' => $reader->isLinearized(), 'info' => [],];foreach (['Title', 'Author', 'Subject', 'Keywords', 'Creator', 'Producer', 'CreationDate', 'ModDate'] as $key) { if ($infoDict?->has($key)) { $summary['info'][$key] = $decode($infoDict->get($key)); }}
file_put_contents( example_output_path('reader/metadata.json'), json_encode($summary, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n",);View the full script on GitHub ↗
The captured summary lives at samples/reader/metadata.json.
Inspecting structure
Section titled “Inspecting structure”getCatalog() and getPages() expose the document’s structural skeleton. The example below walks the outline showcase PDF and emits a JSON description of its catalog and page tree.
use Phpdftk\Pdf\Reader\PdfReader;
// Inspect the structure of the outline showcase PDF — it has multiple pages,// nested bookmarks, and named destinations.$inputPdf = example_output_path('writer/outline.pdf');$reader = PdfReader::fromFile($inputPdf);
$catalog = $reader->getCatalog();$pages = $reader->getPages();
// Summarise the page tree.$pageSummary = [];foreach ($pages as $i => $page) { $mediaBox = null; if ($page->has('MediaBox')) { $box = $page->get('MediaBox'); if ($box instanceof \Phpdftk\Pdf\Core\PdfArray) { $mediaBox = array_map( fn ($n) => $n instanceof \Phpdftk\Pdf\Core\PdfNumber ? $n->value : null, $box->items, ); } } $pageSummary[] = [ 'index' => $i, 'mediaBox' => $mediaBox, 'hasContents' => $page->has('Contents'), 'hasAnnots' => $page->has('Annots'), ];}
// Summarise the catalog's top-level keys (only the structural ones the reader// can resolve without a typed schema).$catalogKeys = [];foreach (['Type', 'Version', 'Pages', 'Outlines', 'Names', 'PageLabels', 'AcroForm', 'OpenAction'] as $key) { $catalogKeys[$key] = $catalog->has($key) ? 'present' : 'absent';}
$summary = [ 'pdfVersion' => $reader->getVersion(), 'linearized' => $reader->isLinearized(), 'pageCount' => $reader->getPageCount(), 'catalog' => $catalogKeys, 'pages' => $pageSummary,];
file_put_contents( example_output_path('reader/structure.json'), json_encode($summary, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n",);📥 Input PDF · View the full script on GitHub ↗
The summary lives at samples/reader/structure.json.
Error tolerance
Section titled “Error tolerance”In lenient mode, the reader recovers from common PDF issues:
$pdf = PdfReader::fromFile('damaged.pdf', strict: false);
// Check what was wrongforeach ($pdf->getParseWarnings() as $warning) { echo "Warning: $warning\n";}Recoverable issues include displaced headers, malformed xref tables, and missing trailers (reconstructed via object scanning).
Encryption support
Section titled “Encryption support”The reader automatically handles all standard encryption methods:
| Method | Version |
|---|---|
| RC4 40-bit | V=1 R=2 |
| RC4 128-bit | V=2 R=3 |
| AES-128 | V=4 R=4 |
| AES-256 | V=5 R=6 |
| Public-key (Adobe.PubSec) | AES-128/256 |