PdfReader

PdfReader parses existing PDF files into the phpdftk object model. It handles classic xref tables, cross-reference streams, object streams, incremental updates, and encrypted PDFs.

Opening a PDF

use Phpdftk\Pdf\Reader\PdfReader;

// From file
$pdf = PdfReader::fromFile('document.pdf');

// From string (e.g., HTTP response body)
$pdf = PdfReader::fromString($bytes);

// From stream resource
$pdf = PdfReader::fromStream(fopen('php://stdin', 'rb'));

// Encrypted PDF
$pdf = PdfReader::fromFile('secured.pdf', password: 'secret');

// Public-key encrypted PDF
$pdf = PdfReader::fromFilePublicKey('secured.pdf', $certPem, $keyPem);

Document info

echo $pdf->getVersion();    // "1.7"
echo $pdf->getPageCount();  // 42

// Linearization detection
if ($pdf->isLinearized()) {
    $params = $pdf->getLinearizationParameters();
    echo "Web-optimized, {$params['pageCount']} pages";
}

Accessing pages

// All pages as raw dictionaries
$pages = $pdf->getPages();

// Single page by 0-based index
$page = $pdf->getPage(0);

// Typed Page objects (hydrated into Core\Document\Page)
$typedPages = $pdf->getTypedPages();
$typedPage = $pdf->getTypedPage(0);

Catalog and trailer

$catalog = $pdf->getCatalog();      // raw PdfDictionary
$typed = $pdf->getTypedCatalog();   // hydrated Core\Document\Catalog
$trailer = $pdf->getTrailer();
$info = $pdf->getInfo();

Resolving objects

// By object number
$obj = $pdf->getObject(42);

// By reference
$target = $pdf->resolveReference($ref);

// Typed hydration of any object
$typed = $pdf->getTypedObject(42);

Text extraction

// Single page (0-based index)
$text = $pdf->extractText(0);

// All pages concatenated
$allText = $pdf->extractAllText("\n\n");

The example below reads the page-labels showcase PDF and writes both the joined and per-page transcriptions to disk.

use Phpdftk\Pdf\Reader\PdfReader;

// Reuse the page-labels showcase PDF as a realistic input — it has front matter,
// chapters, and an appendix with distinctly labelled pages.
$inputPdf = example_output_path('writer/page-labels.pdf');
$reader = PdfReader::fromFile($inputPdf);

// extractAllText() concatenates every page's text with a separator.
// extractText($i) returns one page at a time when you only need a slice.
$allText = $reader->extractAllText("\n\n--- page break ---\n\n");

// Persist both the per-page and full-document forms so the docs page can show both.
file_put_contents(example_output_path('reader/text-output.txt'), $allText);

$perPage = [];
for ($i = 0, $n = $reader->getPageCount(); $i < $n; $i++) {
    $perPage[] = sprintf("=== page %d ===\n%s", $i + 1, $reader->extractText($i));
}
file_put_contents(
    example_output_path('reader/text-per-page.txt'),
    implode("\n\n", $perPage),
);

📥 Input PDF · View the full script on GitHub ↗

For per-page access, search, and a friendlier 1-based API, see Working with PDFs → Text Extractor. That page also covers how text extraction works under the hood (ToUnicode CMaps, encoding fallbacks, Form XObject recursion).

Extracting metadata

getInfo() returns the document’s /Info dictionary. The example below builds a PDF with rich metadata, reads it back, and writes a JSON summary.

use Phpdftk\Pdf\Core\Document\Info;
use Phpdftk\Pdf\Core\Font\StandardFont;
use Phpdftk\Pdf\Core\Font\Type1Font;
use Phpdftk\Pdf\Core\PdfString;
use Phpdftk\Pdf\Reader\PdfReader;
use Phpdftk\Pdf\Writer\PdfWriter;

// Step 1: write a small PDF with rich /Info metadata + a synced XMP packet.
$writer = new PdfWriter();
$page = $writer->addPage();
$writer->addFont(new Type1Font(StandardFont::Helvetica));
$writer->addContentStream($page)
    ->beginText()->setFont('F1', 18)->moveTextPosition(72, 720)
    ->showText('Document with rich metadata')->endText();

$info = new Info();
$info->title    = new PdfString('Q4 2026 Quarterly Report');
$info->author   = new PdfString('Finance Team');
$info->subject  = new PdfString('Revenue and expense summary');
$info->keywords = new PdfString('finance, quarterly, 2026, Q4');
$info->creator  = new PdfString('phpdftk metadata showcase');
$info->producer = new PdfString('phpdftk');
$info->creationDate = new PdfString('D:20260512000000Z');
$writer->setInfo($info);
$writer->syncInfoToMetadata();

$inputPdf = example_output_path('reader/metadata-source.pdf');
$writer->save($inputPdf);

// Step 2: read the same PDF back and extract its metadata.
$reader = PdfReader::fromFile($inputPdf);
$infoDict = $reader->getInfo();

$decode = static function (\Phpdftk\Pdf\Core\Serializable $value): string {
    if ($value instanceof \Phpdftk\Pdf\Core\PdfString) {
        return $value->value;
    }
    if ($value instanceof \Phpdftk\Pdf\Core\PdfName) {
        return $value->value;
    }
    return (string) $value;
};

$summary = [
    'pdfVersion' => $reader->getVersion(),
    'pageCount'  => $reader->getPageCount(),
    'linearized' => $reader->isLinearized(),
    'info'       => [],
];
foreach (['Title', 'Author', 'Subject', 'Keywords', 'Creator', 'Producer', 'CreationDate', 'ModDate'] as $key) {
    if ($infoDict?->has($key)) {
        $summary['info'][$key] = $decode($infoDict->get($key));
    }
}

file_put_contents(
    example_output_path('reader/metadata.json'),
    json_encode($summary, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n",
);

View the full script on GitHub ↗

The captured summary lives at samples/reader/metadata.json.

Inspecting structure

getCatalog() and getPages() expose the document’s structural skeleton. The example below walks the outline showcase PDF and emits a JSON description of its catalog and page tree.

use Phpdftk\Pdf\Reader\PdfReader;

// Inspect the structure of the outline showcase PDF — it has multiple pages,
// nested bookmarks, and named destinations.
$inputPdf = example_output_path('writer/outline.pdf');
$reader = PdfReader::fromFile($inputPdf);

$catalog = $reader->getCatalog();
$pages   = $reader->getPages();

// Summarise the page tree.
$pageSummary = [];
foreach ($pages as $i => $page) {
    $mediaBox = null;
    if ($page->has('MediaBox')) {
        $box = $page->get('MediaBox');
        if ($box instanceof \Phpdftk\Pdf\Core\PdfArray) {
            $mediaBox = array_map(
                fn ($n) => $n instanceof \Phpdftk\Pdf\Core\PdfNumber ? $n->value : null,
                $box->items,
            );
        }
    }
    $pageSummary[] = [
        'index'    => $i,
        'mediaBox' => $mediaBox,
        'hasContents' => $page->has('Contents'),
        'hasAnnots'   => $page->has('Annots'),
    ];
}

// Summarise the catalog's top-level keys (only the structural ones the reader
// can resolve without a typed schema).
$catalogKeys = [];
foreach (['Type', 'Version', 'Pages', 'Outlines', 'Names', 'PageLabels', 'AcroForm', 'OpenAction'] as $key) {
    $catalogKeys[$key] = $catalog->has($key) ? 'present' : 'absent';
}

$summary = [
    'pdfVersion' => $reader->getVersion(),
    'linearized' => $reader->isLinearized(),
    'pageCount'  => $reader->getPageCount(),
    'catalog'    => $catalogKeys,
    'pages'      => $pageSummary,
];

file_put_contents(
    example_output_path('reader/structure.json'),
    json_encode($summary, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES) . "\n",
);

📥 Input PDF · View the full script on GitHub ↗

The summary lives at samples/reader/structure.json.

Error tolerance

In lenient mode, the reader recovers from common PDF issues:

$pdf = PdfReader::fromFile('damaged.pdf', strict: false);

// Check what was wrong
foreach ($pdf->getParseWarnings() as $warning) {
    echo "Warning: $warning\n";
}

Recoverable issues include displaced headers, malformed xref tables, and missing trailers (reconstructed via object scanning).

Encryption support

The reader automatically handles all standard encryption methods:

Method	Version
RC4 40-bit	V=1 R=2
RC4 128-bit	V=2 R=3
AES-128	V=4 R=4
AES-256	V=5 R=6
Public-key (Adobe.PubSec)	AES-128/256