phpdftk API Documentation

PdfReader
in package

FinalYes

PDF reader — parses existing PDF files into the phpdftk object model.

Phase 1 supports unencrypted PDFs with classic cross-reference tables. Returns raw PdfDictionary objects; typed hydration (into Catalog, Page, etc.) is a future phase.

Three factory methods mirror the writer's output modes:

$pdf = PdfReader::fromFile('/path/to/document.pdf');
$pdf = PdfReader::fromString($bytes);
$pdf = PdfReader::fromStream(fopen('php://stdin', 'rb'));

Table of Contents

Methods

extractAllText()  : string
Extract text from all pages, concatenated with page separators.
extractAllTextWithPositions()  : array<int, array<int, TextSpan>>
Extract text with precise positioning from all pages.
extractText()  : string
Extract text from a page by index (zero-based).
extractTextWithPositions()  : array<int, TextSpan>
Extract text with precise positioning from a page by index (zero-based).
fromFile()  : self
fromFilePublicKey()  : self
Read a public-key (certificate-based) encrypted PDF from a file.
fromStream()  : self
fromString()  : self
fromStringPublicKey()  : self
Read a public-key (certificate-based) encrypted PDF from a string.
getCatalog()  : PdfDictionary
Resolve /Root from the trailer — returns the Catalog dictionary.
getEffectiveVersion()  : PdfVersion
Effective PDF version — max(header, catalog /Version).
getInfo()  : PdfDictionary|null
Resolve /Info from the trailer.
getLinearizationParameters()  : array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null
Get linearization parameters if the PDF is linearized.
getObject()  : Serializable
Resolve any object by number.
getPage()  : PdfDictionary
Get a specific page by zero-based index.
getPageByteRange()  : array{offset: int, length: int}|null
Calculate the byte range for a specific page in a linearized PDF.
getPageCount()  : int
Get the total page count from /Pages -> /Count.
getPageOffsetHintTable()  : PageOffsetHintTable|null
Parse the page offset hint table from a linearized PDF.
getPages()  : array<int, PdfDictionary>
Get all Page dictionaries by traversing the page tree.
getParseWarnings()  : array<int, string>
Return warnings accumulated during parsing.
getPdfVersion()  : PdfVersion
Typed PDF version from the file header.
getResolver()  : ObjectResolver
The underlying object resolver.
getTrailer()  : PdfDictionary
The raw trailer dictionary.
getTypedCatalog()  : Catalog
Return the document catalog as a typed Catalog object.
getTypedObject()  : PdfObject|PdfDictionary
Hydrate any resolved object by object number.
getTypedPage()  : Page
Return a specific page as a typed Page object.
getTypedPages()  : array<int, Page>
Return all pages as typed Page objects.
getVersion()  : string
PDF version string, e.g. "1.7".
isLinearized()  : bool
Check whether this PDF is linearized (web-optimized).
resolveReference()  : Serializable
Resolve an indirect reference to its target.
validateVersion()  : array<int, string>
Scan the document for structural features inconsistent with the declared version. Returns a list of warning strings.

Methods

extractAllText()

Extract text from all pages, concatenated with page separators.

public extractAllText([string $separator = "\n" ]) : string
Parameters
$separator : string = "\n"

Separator between pages (default: newline)

Return values
string

extractAllTextWithPositions()

Extract text with precise positioning from all pages.

public extractAllTextWithPositions() : array<int, array<int, TextSpan>>
Return values
array<int, array<int, TextSpan>>

Zero-based page index => spans

extractText()

Extract text from a page by index (zero-based).

public extractText(int $pageIndex) : string

Interprets content stream operators, resolves font encodings (ToUnicode CMap, /Encoding + /Differences, WinAnsi fallback), and infers spacing from text positioning operators.

Parameters
$pageIndex : int
Return values
string

extractTextWithPositions()

Extract text with precise positioning from a page by index (zero-based).

public extractTextWithPositions(int $pageIndex) : array<int, TextSpan>

Returns a list of TextSpan objects, each containing the text content, position (x, y in user space), dimensions (width, height), font size, and font name.

Parameters
$pageIndex : int
Return values
array<int, TextSpan>

fromFile()

public static fromFile(string $path[, string $password = '' ][, bool $strict = true ]) : self
Parameters
$path : string
$password : string = ''
$strict : bool = true
Return values
self

fromFilePublicKey()

Read a public-key (certificate-based) encrypted PDF from a file.

public static fromFilePublicKey(string $path, string $certificate, string $privateKey[, bool $strict = true ]) : self
Parameters
$path : string
$certificate : string
$privateKey : string
$strict : bool = true
Return values
self

fromStream()

public static fromStream(resource $stream[, string $password = '' ][, bool $strict = true ]) : self
Parameters
$stream : resource
$password : string = ''
$strict : bool = true
Return values
self

fromString()

public static fromString(string $content[, string $password = '' ][, bool $strict = true ]) : self
Parameters
$content : string
$password : string = ''
$strict : bool = true
Return values
self

fromStringPublicKey()

Read a public-key (certificate-based) encrypted PDF from a string.

public static fromStringPublicKey(string $content, string $certificate, string $privateKey[, bool $strict = true ]) : self
Parameters
$content : string
$certificate : string
$privateKey : string
$strict : bool = true
Return values
self

getEffectiveVersion()

Effective PDF version — max(header, catalog /Version).

public getEffectiveVersion() : PdfVersion

Per ISO 32000 §7.2.2, the catalog /Version entry (PDF 1.4+) overrides the header version if it is higher.

Return values
PdfVersion

getLinearizationParameters()

Get linearization parameters if the PDF is linearized.

public getLinearizationParameters() : array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null
Return values
array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null

getPageByteRange()

Calculate the byte range for a specific page in a linearized PDF.

public getPageByteRange(int $pageIndex) : array{offset: int, length: int}|null

Returns an associative array with 'offset' and 'length' keys, or null if the PDF is not linearized or hints are unavailable.

Parameters
$pageIndex : int
Return values
array{offset: int, length: int}|null

getPageCount()

Get the total page count from /Pages -> /Count.

public getPageCount() : int
Return values
int

getPageOffsetHintTable()

Parse the page offset hint table from a linearized PDF.

public getPageOffsetHintTable() : PageOffsetHintTable|null

Returns null if the PDF is not linearized or the hint stream cannot be located/parsed.

Return values
PageOffsetHintTable|null

getParseWarnings()

Return warnings accumulated during parsing.

public getParseWarnings() : array<int, string>
Return values
array<int, string>

getTypedCatalog()

Return the document catalog as a typed Catalog object.

public getTypedCatalog() : Catalog
Return values
Catalog

getTypedPage()

Return a specific page as a typed Page object.

public getTypedPage(int $index) : Page
Parameters
$index : int
Return values
Page

getTypedPages()

Return all pages as typed Page objects.

public getTypedPages() : array<int, Page>
Return values
array<int, Page>

getVersion()

PDF version string, e.g. "1.7".

public getVersion() : string
Return values
string

isLinearized()

Check whether this PDF is linearized (web-optimized).

public isLinearized() : bool

A linearized PDF has a LinearizationParameters dictionary as the very first indirect object, containing a /Linearized key. The reader handles linearized PDFs correctly (via startxref), but does not use the hint tables for progressive loading.

Return values
bool

validateVersion()

Scan the document for structural features inconsistent with the declared version. Returns a list of warning strings.

public validateVersion() : array<int, string>

Checks top-level indicators that can be detected from raw dictionaries without full object hydration.

Return values
array<int, string>

        
On this page

Search results