Skip to content

PdfReader

PdfReader parses existing PDF files into the phpdftk object model. It handles classic xref tables, cross-reference streams, object streams, incremental updates, and encrypted PDFs.

use Phpdftk\Pdf\Reader\PdfReader;
// From file
$pdf = PdfReader::fromFile('document.pdf');
// From string (e.g., HTTP response body)
$pdf = PdfReader::fromString($bytes);
// From stream resource
$pdf = PdfReader::fromStream(fopen('php://stdin', 'rb'));
// Encrypted PDF
$pdf = PdfReader::fromFile('secured.pdf', password: 'secret');
// Public-key encrypted PDF
$pdf = PdfReader::fromFilePublicKey('secured.pdf', $certPem, $keyPem);
echo $pdf->getVersion(); // "1.7"
echo $pdf->getPageCount(); // 42
// Linearization detection
if ($pdf->isLinearized()) {
$params = $pdf->getLinearizationParameters();
echo "Web-optimized, {$params['pageCount']} pages";
}
// All pages as raw dictionaries
$pages = $pdf->getPages();
// Single page by 0-based index
$page = $pdf->getPage(0);
// Typed Page objects (hydrated into Core\Document\Page)
$typedPages = $pdf->getTypedPages();
$typedPage = $pdf->getTypedPage(0);
$catalog = $pdf->getCatalog(); // raw PdfDictionary
$typed = $pdf->getTypedCatalog(); // hydrated Core\Document\Catalog
$trailer = $pdf->getTrailer();
$info = $pdf->getInfo();
// By object number
$obj = $pdf->getObject(42);
// By reference
$target = $pdf->resolveReference($ref);
// Typed hydration of any object
$typed = $pdf->getTypedObject(42);
// Single page (0-based index)
$text = $pdf->extractText(0);
// All pages concatenated
$allText = $pdf->extractAllText("\n\n");

The example below reads the page-labels showcase PDF and writes both the joined and per-page transcriptions to disk.

use Phpdftk\Pdf\Reader\PdfReader;
// Reuse the page-labels showcase PDF as a realistic input — it has front matter,
// chapters, and an appendix with distinctly labelled pages.
$inputPdf = example_output_path('writer/page-labels.pdf');
$reader = PdfReader::fromFile($inputPdf);
// extractAllText() concatenates every page's text with a separator.
// extractText($i) returns one page at a time when you only need a slice.
$allText = $reader->extractAllText("\n\n--- page break ---\n\n");
// Persist both the per-page and full-document forms so the docs page can show both.
file_put_contents(example_output_path('reader/text-output.txt'), $allText);
$perPage = [];
for ($i = 0, $n = $reader->getPageCount(); $i < $n; $i++) {
$perPage[] = sprintf("=== page %d ===\n%s", $i + 1, $reader->extractText($i));
}
file_put_contents(
example_output_path('reader/text-per-page.txt'),
implode("\n\n", $perPage),
);

For per-page access, search, and a friendlier 1-based API, see Working with PDFs → Text Extractor. That page also covers how text extraction works under the hood (ToUnicode CMaps, encoding fallbacks, Form XObject recursion).

getInfo() returns the document’s /Info dictionary. The example below builds a PDF with rich metadata, reads it back, and writes a JSON summary.

use Phpdftk\Pdf\Core\Document\Info;
use Phpdftk\Pdf\Core\Font\StandardFont;
use Phpdftk\Pdf\Core\Font\Type1Font;
use Phpdftk\Pdf\Core\PdfString;
use Phpdftk\Pdf\Reader\PdfReader;
use Phpdftk\Pdf\Writer\PdfWriter;
// Step 1: write a small PDF with rich /Info metadata + a synced XMP packet.
$writer = new PdfWriter();
$page = $writer->addPage();
$writer->addFont(new Type1Font(StandardFont::Helvetica));
$writer->addContentStream($page)
->beginText()->setFont('F1', 18)->moveTextPosition(72, 720)
->showText('Document with rich metadata')->endText();
$info = new Info();
$info->title = new PdfString('Q4 2026 Quarterly Report');
$info->author = new PdfString('Finance Team');
$info->subject = new PdfString('Revenue and expense summary');
$info->keywords = new PdfString('finance, quarterly, 2026, Q4');
$info->creator = new PdfString('phpdftk metadata showcase');
$info->producer = new PdfString('phpdftk');
$info->creationDate = new PdfString('D:20260512000000Z');
$writer->setInfo($info);
$writer->syncInfoToMetadata();
$inputPdf = example_output_path('reader/metadata-source.pdf');
$writer->save($inputPdf);
// Step 2: read the same PDF back and extract its metadata.
$reader = PdfReader::fromFile($inputPdf);
$infoDict = $reader->getInfo();
$decode = static function (\Phpdftk\Pdf\Core\Serializable $value): string {
if ($value instanceof \Phpdftk\Pdf\Core\PdfString) {
return $value->value;
}
if ($value instanceof \Phpdftk\Pdf\Core\PdfName) {
return $value->value;
}
return (string) $value;
};
$summary = [
'pdfVersion' => $reader->getVersion(),
'pageCount' => $reader->getPageCount(),
'linearized' => $reader->isLinearized(),
'info' => [],
];
foreach (['Title', 'Author', 'Subject', 'Keywords', 'Creator', 'Producer', 'CreationDate', 'ModDate'] as $key) {
if ($infoDict?->has($key)) {
$summary['info'][$key] = $decode($infoDict->get($key));
}
}
file_put_contents(
example_output_path('reader/metadata.json'),
json_encode($summary, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n",
);

The captured summary lives at samples/reader/metadata.json.

getCatalog() and getPages() expose the document’s structural skeleton. The example below walks the outline showcase PDF and emits a JSON description of its catalog and page tree.

use Phpdftk\Pdf\Reader\PdfReader;
// Inspect the structure of the outline showcase PDF — it has multiple pages,
// nested bookmarks, and named destinations.
$inputPdf = example_output_path('writer/outline.pdf');
$reader = PdfReader::fromFile($inputPdf);
$catalog = $reader->getCatalog();
$pages = $reader->getPages();
// Summarise the page tree.
$pageSummary = [];
foreach ($pages as $i => $page) {
$mediaBox = null;
if ($page->has('MediaBox')) {
$box = $page->get('MediaBox');
if ($box instanceof \Phpdftk\Pdf\Core\PdfArray) {
$mediaBox = array_map(
fn ($n) => $n instanceof \Phpdftk\Pdf\Core\PdfNumber ? $n->value : null,
$box->items,
);
}
}
$pageSummary[] = [
'index' => $i,
'mediaBox' => $mediaBox,
'hasContents' => $page->has('Contents'),
'hasAnnots' => $page->has('Annots'),
];
}
// Summarise the catalog's top-level keys (only the structural ones the reader
// can resolve without a typed schema).
$catalogKeys = [];
foreach (['Type', 'Version', 'Pages', 'Outlines', 'Names', 'PageLabels', 'AcroForm', 'OpenAction'] as $key) {
$catalogKeys[$key] = $catalog->has($key) ? 'present' : 'absent';
}
$summary = [
'pdfVersion' => $reader->getVersion(),
'linearized' => $reader->isLinearized(),
'pageCount' => $reader->getPageCount(),
'catalog' => $catalogKeys,
'pages' => $pageSummary,
];
file_put_contents(
example_output_path('reader/structure.json'),
json_encode($summary, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n",
);

The summary lives at samples/reader/structure.json.

In lenient mode, the reader recovers from common PDF issues:

$pdf = PdfReader::fromFile('damaged.pdf', strict: false);
// Check what was wrong
foreach ($pdf->getParseWarnings() as $warning) {
echo "Warning: $warning\n";
}

Recoverable issues include displaced headers, malformed xref tables, and missing trailers (reconstructed via object scanning).

The reader automatically handles all standard encryption methods:

MethodVersion
RC4 40-bitV=1 R=2
RC4 128-bitV=2 R=3
AES-128V=4 R=4
AES-256V=5 R=6
Public-key (Adobe.PubSec)AES-128/256