PdfReader
in package
PDF reader — parses existing PDF files into the phpdftk object model.
Phase 1 supports unencrypted PDFs with classic cross-reference tables.
Returns raw PdfDictionary objects; typed hydration (into Catalog,
Page, etc.) is a future phase.
Three factory methods mirror the writer's output modes:
$pdf = PdfReader::fromFile('/path/to/document.pdf');
$pdf = PdfReader::fromString($bytes);
$pdf = PdfReader::fromStream(fopen('php://stdin', 'rb'));
Table of Contents
Methods
- extractAllText() : string
- Extract text from all pages, concatenated with page separators.
- extractAllTextWithPositions() : array<int, array<int, TextSpan>>
- Extract text with precise positioning from all pages.
- extractText() : string
- Extract text from a page by index (zero-based).
- extractTextWithPositions() : array<int, TextSpan>
- Extract text with precise positioning from a page by index (zero-based).
- fromFile() : self
- fromFilePublicKey() : self
- Read a public-key (certificate-based) encrypted PDF from a file.
- fromStream() : self
- fromString() : self
- fromStringPublicKey() : self
- Read a public-key (certificate-based) encrypted PDF from a string.
- getCatalog() : PdfDictionary
- Resolve /Root from the trailer — returns the Catalog dictionary.
- getEffectiveVersion() : PdfVersion
- Effective PDF version — max(header, catalog /Version).
- getInfo() : PdfDictionary|null
- Resolve /Info from the trailer.
- getLinearizationParameters() : array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null
- Get linearization parameters if the PDF is linearized.
- getObject() : Serializable
- Resolve any object by number.
- getPage() : PdfDictionary
- Get a specific page by zero-based index.
- getPageByteRange() : array{offset: int, length: int}|null
- Calculate the byte range for a specific page in a linearized PDF.
- getPageCount() : int
- Get the total page count from /Pages -> /Count.
- getPageOffsetHintTable() : PageOffsetHintTable|null
- Parse the page offset hint table from a linearized PDF.
- getPages() : array<int, PdfDictionary>
- Get all Page dictionaries by traversing the page tree.
- getParseWarnings() : array<int, string>
- Return warnings accumulated during parsing.
- getPdfVersion() : PdfVersion
- Typed PDF version from the file header.
- getResolver() : ObjectResolver
- The underlying object resolver.
- getTrailer() : PdfDictionary
- The raw trailer dictionary.
- getTypedCatalog() : Catalog
- Return the document catalog as a typed Catalog object.
- getTypedObject() : PdfObject|PdfDictionary
- Hydrate any resolved object by object number.
- getTypedPage() : Page
- Return a specific page as a typed Page object.
- getTypedPages() : array<int, Page>
- Return all pages as typed Page objects.
- getVersion() : string
- PDF version string, e.g. "1.7".
- isLinearized() : bool
- Check whether this PDF is linearized (web-optimized).
- resolveReference() : Serializable
- Resolve an indirect reference to its target.
- validateVersion() : array<int, string>
- Scan the document for structural features inconsistent with the declared version. Returns a list of warning strings.
Methods
extractAllText()
Extract text from all pages, concatenated with page separators.
public
extractAllText([string $separator = "\n" ]) : string
Parameters
- $separator : string = "\n"
-
Separator between pages (default: newline)
Return values
stringextractAllTextWithPositions()
Extract text with precise positioning from all pages.
public
extractAllTextWithPositions() : array<int, array<int, TextSpan>>
Return values
array<int, array<int, TextSpan>> —Zero-based page index => spans
extractText()
Extract text from a page by index (zero-based).
public
extractText(int $pageIndex) : string
Interprets content stream operators, resolves font encodings (ToUnicode CMap, /Encoding + /Differences, WinAnsi fallback), and infers spacing from text positioning operators.
Parameters
- $pageIndex : int
Return values
stringextractTextWithPositions()
Extract text with precise positioning from a page by index (zero-based).
public
extractTextWithPositions(int $pageIndex) : array<int, TextSpan>
Returns a list of TextSpan objects, each containing the text content, position (x, y in user space), dimensions (width, height), font size, and font name.
Parameters
- $pageIndex : int
Return values
array<int, TextSpan>fromFile()
public
static fromFile(string $path[, string $password = '' ][, bool $strict = true ]) : self
Parameters
- $path : string
- $password : string = ''
- $strict : bool = true
Return values
selffromFilePublicKey()
Read a public-key (certificate-based) encrypted PDF from a file.
public
static fromFilePublicKey(string $path, string $certificate, string $privateKey[, bool $strict = true ]) : self
Parameters
- $path : string
- $certificate : string
- $privateKey : string
- $strict : bool = true
Return values
selffromStream()
public
static fromStream(resource $stream[, string $password = '' ][, bool $strict = true ]) : self
Parameters
- $stream : resource
- $password : string = ''
- $strict : bool = true
Return values
selffromString()
public
static fromString(string $content[, string $password = '' ][, bool $strict = true ]) : self
Parameters
- $content : string
- $password : string = ''
- $strict : bool = true
Return values
selffromStringPublicKey()
Read a public-key (certificate-based) encrypted PDF from a string.
public
static fromStringPublicKey(string $content, string $certificate, string $privateKey[, bool $strict = true ]) : self
Parameters
- $content : string
- $certificate : string
- $privateKey : string
- $strict : bool = true
Return values
selfgetCatalog()
Resolve /Root from the trailer — returns the Catalog dictionary.
public
getCatalog() : PdfDictionary
Return values
PdfDictionarygetEffectiveVersion()
Effective PDF version — max(header, catalog /Version).
public
getEffectiveVersion() : PdfVersion
Per ISO 32000 §7.2.2, the catalog /Version entry (PDF 1.4+) overrides the header version if it is higher.
Return values
PdfVersiongetInfo()
Resolve /Info from the trailer.
public
getInfo() : PdfDictionary|null
Return values
PdfDictionary|nullgetLinearizationParameters()
Get linearization parameters if the PDF is linearized.
public
getLinearizationParameters() : array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|null
Return values
array{linearized: float, fileLength: int, firstPageObj: int, firstPageEnd: int, pageCount: int, xrefOffset: int}|nullgetObject()
Resolve any object by number.
public
getObject(int $objNum) : Serializable
Parameters
- $objNum : int
Return values
SerializablegetPage()
Get a specific page by zero-based index.
public
getPage(int $index) : PdfDictionary
Parameters
- $index : int
Return values
PdfDictionarygetPageByteRange()
Calculate the byte range for a specific page in a linearized PDF.
public
getPageByteRange(int $pageIndex) : array{offset: int, length: int}|null
Returns an associative array with 'offset' and 'length' keys, or null if the PDF is not linearized or hints are unavailable.
Parameters
- $pageIndex : int
Return values
array{offset: int, length: int}|nullgetPageCount()
Get the total page count from /Pages -> /Count.
public
getPageCount() : int
Return values
intgetPageOffsetHintTable()
Parse the page offset hint table from a linearized PDF.
public
getPageOffsetHintTable() : PageOffsetHintTable|null
Returns null if the PDF is not linearized or the hint stream cannot be located/parsed.
Return values
PageOffsetHintTable|nullgetPages()
Get all Page dictionaries by traversing the page tree.
public
getPages() : array<int, PdfDictionary>
Return values
array<int, PdfDictionary>getParseWarnings()
Return warnings accumulated during parsing.
public
getParseWarnings() : array<int, string>
Return values
array<int, string>getPdfVersion()
Typed PDF version from the file header.
public
getPdfVersion() : PdfVersion
Return values
PdfVersiongetResolver()
The underlying object resolver.
public
getResolver() : ObjectResolver
Return values
ObjectResolvergetTrailer()
The raw trailer dictionary.
public
getTrailer() : PdfDictionary
Return values
PdfDictionarygetTypedCatalog()
Return the document catalog as a typed Catalog object.
public
getTypedCatalog() : Catalog
Return values
CataloggetTypedObject()
Hydrate any resolved object by object number.
public
getTypedObject(int $objNum) : PdfObject|PdfDictionary
Parameters
- $objNum : int
Return values
PdfObject|PdfDictionarygetTypedPage()
Return a specific page as a typed Page object.
public
getTypedPage(int $index) : Page
Parameters
- $index : int
Return values
PagegetTypedPages()
Return all pages as typed Page objects.
public
getTypedPages() : array<int, Page>
Return values
array<int, Page>getVersion()
PDF version string, e.g. "1.7".
public
getVersion() : string
Return values
stringisLinearized()
Check whether this PDF is linearized (web-optimized).
public
isLinearized() : bool
A linearized PDF has a LinearizationParameters dictionary as the very first indirect object, containing a /Linearized key. The reader handles linearized PDFs correctly (via startxref), but does not use the hint tables for progressive loading.
Return values
boolresolveReference()
Resolve an indirect reference to its target.
public
resolveReference(PdfReference $ref) : Serializable
Parameters
- $ref : PdfReference
Return values
SerializablevalidateVersion()
Scan the document for structural features inconsistent with the declared version. Returns a list of warning strings.
public
validateVersion() : array<int, string>
Checks top-level indicators that can be detected from raw dictionaries without full object hydration.