phpdftk API Documentation

PdfReader
in package

phpdftk

FinalYes

PDF reader — parses existing PDF files into the phpdftk object model.

Phase 1 supports unencrypted PDFs with classic cross-reference tables. Returns raw PdfDictionary objects; typed hydration (into Catalog, Page, etc.) is a future phase.

Three factory methods mirror the writer's output modes:

$pdf = PdfReader::fromFile('/path/to/document.pdf');
$pdf = PdfReader::fromString($bytes);
$pdf = PdfReader::fromStream(fopen('php://stdin', 'rb'));

Methods

extractAllText() : string: Extract text from all pages, concatenated with page separators.
extractAllTextWithPositions() : array<int, array<int, TextSpan>>: Extract text with precise positioning from all pages.
extractText() : string: Extract text from a page by index (zero-based).
extractTextWithPositions() : array<int, TextSpan>: Extract text with precise positioning from a page by index (zero-based).
fromFile() : self
fromFilePublicKey() : self: Read a public-key (certificate-based) encrypted PDF from a file.
fromStream() : self
fromString() : self
fromStringPublicKey() : self: Read a public-key (certificate-based) encrypted PDF from a string.
getCatalog() : PdfDictionary: Resolve /Root from the trailer — returns the Catalog dictionary.
getEffectiveVersion() : PdfVersion: Effective PDF version — max(header, catalog /Version).
getInfo() : PdfDictionary|null: Resolve /Info from the trailer.
getLinearizationParameters() : array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null: Get linearization parameters if the PDF is linearized.
getObject() : Serializable: Resolve any object by number.
getPage() : PdfDictionary: Get a specific page by zero-based index.
getPageByteRange() : array{offset: int, length: int}|null: Calculate the byte range for a specific page in a linearized PDF.
getPageCount() : int: Get the total page count from /Pages -> /Count.
getPageOffsetHintTable() : PageOffsetHintTable|null: Parse the page offset hint table from a linearized PDF.
getPages() : array<int, PdfDictionary>: Get all Page dictionaries by traversing the page tree.
getParseWarnings() : array<int, string>: Return warnings accumulated during parsing.
getPdfVersion() : PdfVersion: Typed PDF version from the file header.
getResolver() : ObjectResolver: The underlying object resolver.
getTrailer() : PdfDictionary: The raw trailer dictionary.
getTypedCatalog() : Catalog: Return the document catalog as a typed Catalog object.
getTypedObject() : PdfObject|PdfDictionary: Hydrate any resolved object by object number.
getTypedPage() : Page: Return a specific page as a typed Page object.
getTypedPages() : array<int, Page>: Return all pages as typed Page objects.
getVersion() : string: PDF version string, e.g. "1.7".
isLinearized() : bool: Check whether this PDF is linearized (web-optimized).
resolveReference() : Serializable: Resolve an indirect reference to its target.
validateVersion() : array<int, string>: Scan the document for structural features inconsistent with the declared version. Returns a list of warning strings.

extractAllText()

Extract text from all pages, concatenated with page separators.


    public
                    extractAllText([string $separator = "\n" ]) : string

Parameters

$separator : string = "\n": Separator between pages (default: newline)

Return values

string

extractAllTextWithPositions()

Extract text with precise positioning from all pages.


    public
                    extractAllTextWithPositions() : array<int, array<int, TextSpan>>

Return values

array<int, array<int, TextSpan>> —

Zero-based page index => spans

extractText()

Extract text from a page by index (zero-based).


    public
                    extractText(int $pageIndex) : string

Interprets content stream operators, resolves font encodings (ToUnicode CMap, /Encoding + /Differences, WinAnsi fallback), and infers spacing from text positioning operators.

Parameters

$pageIndex : int

Return values

string

extractTextWithPositions()

Extract text with precise positioning from a page by index (zero-based).


    public
                    extractTextWithPositions(int $pageIndex) : array<int, TextSpan>

Returns a list of TextSpan objects, each containing the text content, position (x, y in user space), dimensions (width, height), font size, and font name.

Parameters

$pageIndex : int

Return values

array<int, TextSpan>

fromFile()


    public
            static        fromFile(string $path[, string $password = '' ][, bool $strict = true ]) : self

Parameters

$path : string
$password : string = ''
$strict : bool = true

Return values

self

fromFilePublicKey()

Read a public-key (certificate-based) encrypted PDF from a file.


    public
            static        fromFilePublicKey(string $path, string $certificate, string $privateKey[, bool $strict = true ]) : self

Parameters

$path : string
$certificate : string
$privateKey : string
$strict : bool = true

Return values

self

fromStream()


    public
            static        fromStream(resource $stream[, string $password = '' ][, bool $strict = true ]) : self

Parameters

$stream : resource
$password : string = ''
$strict : bool = true

Return values

self

fromString()


    public
            static        fromString(string $content[, string $password = '' ][, bool $strict = true ]) : self

Parameters

$content : string
$password : string = ''
$strict : bool = true

Return values

self

fromStringPublicKey()

Read a public-key (certificate-based) encrypted PDF from a string.


    public
            static        fromStringPublicKey(string $content, string $certificate, string $privateKey[, bool $strict = true ]) : self

Parameters

$content : string
$certificate : string
$privateKey : string
$strict : bool = true

Return values

self

getCatalog()

Resolve /Root from the trailer — returns the Catalog dictionary.


    public
                    getCatalog() : PdfDictionary

Return values

PdfDictionary

getEffectiveVersion()

Effective PDF version — max(header, catalog /Version).


    public
                    getEffectiveVersion() : PdfVersion

Per ISO 32000 §7.2.2, the catalog /Version entry (PDF 1.4+) overrides the header version if it is higher.

Return values

PdfVersion

getInfo()

Resolve /Info from the trailer.


    public
                    getInfo() : PdfDictionary|null

Return values

PdfDictionary|null

getLinearizationParameters()

Get linearization parameters if the PDF is linearized.


    public
                    getLinearizationParameters() : array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null

Return values

array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null

getObject()

Resolve any object by number.


    public
                    getObject(int $objNum) : Serializable

Parameters

$objNum : int

Return values

Serializable

getPage()

Get a specific page by zero-based index.


    public
                    getPage(int $index) : PdfDictionary

Parameters

$index : int

Return values

PdfDictionary

getPageByteRange()

Calculate the byte range for a specific page in a linearized PDF.


    public
                    getPageByteRange(int $pageIndex) : array{offset: int, length: int}|null

Returns an associative array with 'offset' and 'length' keys, or null if the PDF is not linearized or hints are unavailable.

Parameters

$pageIndex : int

Return values

array{offset: int, length: int}|null

getPageCount()

Get the total page count from /Pages -> /Count.


    public
                    getPageCount() : int

Return values

int

getPageOffsetHintTable()

Parse the page offset hint table from a linearized PDF.


    public
                    getPageOffsetHintTable() : PageOffsetHintTable|null

Returns null if the PDF is not linearized or the hint stream cannot be located/parsed.

Return values

PageOffsetHintTable|null

getPages()

Get all Page dictionaries by traversing the page tree.


    public
                    getPages() : array<int, PdfDictionary>

Return values

array<int, PdfDictionary>

getParseWarnings()

Return warnings accumulated during parsing.


    public
                    getParseWarnings() : array<int, string>

Return values

array<int, string>

getPdfVersion()

Typed PDF version from the file header.


    public
                    getPdfVersion() : PdfVersion

Return values

PdfVersion

getResolver()

The underlying object resolver.


    public
                    getResolver() : ObjectResolver

Return values

ObjectResolver

getTrailer()

The raw trailer dictionary.


    public
                    getTrailer() : PdfDictionary

Return values

PdfDictionary

getTypedCatalog()

Return the document catalog as a typed Catalog object.


    public
                    getTypedCatalog() : Catalog

Return values

Catalog

getTypedObject()

Hydrate any resolved object by object number.


    public
                    getTypedObject(int $objNum) : PdfObject|PdfDictionary

Parameters

$objNum : int

Return values

PdfObject|PdfDictionary

getTypedPage()

Return a specific page as a typed Page object.


    public
                    getTypedPage(int $index) : Page

Parameters

$index : int

Return values

Page

getTypedPages()

Return all pages as typed Page objects.


    public
                    getTypedPages() : array<int, Page>

Return values

array<int, Page>

getVersion()

PDF version string, e.g. "1.7".


    public
                    getVersion() : string

Return values

string

isLinearized()

Check whether this PDF is linearized (web-optimized).


    public
                    isLinearized() : bool

A linearized PDF has a LinearizationParameters dictionary as the very first indirect object, containing a /Linearized key. The reader handles linearized PDFs correctly (via startxref), but does not use the hint tables for progressive loading.

Return values

bool

resolveReference()

Resolve an indirect reference to its target.


    public
                    resolveReference(PdfReference $ref) : Serializable

Parameters

$ref : PdfReference

Return values

Serializable

validateVersion()

Scan the document for structural features inconsistent with the declared version. Returns a list of warning strings.


    public
                    validateVersion() : array<int, string>

Checks top-level indicators that can be detected from raw dictionaries without full object hydration.

Return values

array<int, string>

PdfReader in package phpdftk

Table of Contents

Methods

Methods

extractAllText()

Parameters

Return values

extractAllTextWithPositions()

Return values

extractText()

Parameters

Return values

extractTextWithPositions()

Parameters

Return values

fromFile()

Parameters

Return values

fromFilePublicKey()

Parameters

Return values

fromStream()

Parameters

Return values

fromString()

Parameters

Return values

fromStringPublicKey()

Parameters

Return values

getCatalog()

Return values

getEffectiveVersion()

Return values

getInfo()

Return values

getLinearizationParameters()

Return values

getObject()

Parameters

Return values

getPage()

Parameters

Return values

getPageByteRange()

Parameters

Return values

getPageCount()

Return values

getPageOffsetHintTable()

Return values

getPages()

Return values

getParseWarnings()

Return values

getPdfVersion()

Return values

getResolver()

Return values

getTrailer()

Return values

getTypedCatalog()

Return values

getTypedObject()

Parameters

Return values

getTypedPage()

Parameters

Return values

getTypedPages()

Return values

getVersion()

Return values

isLinearized()

Return values

resolveReference()

Parameters

Return values

validateVersion()

Return values

PdfReader
in package

phpdftk