TextExtractor
in package
FinalYes
Extracts text content from a PDF page by interpreting content stream operators.
Tracks text state (current font, position, spacing) and converts character codes to Unicode using:
- /ToUnicode CMap (if present on the font)
- /Encoding + /Differences (if present)
- WinAnsi → GlyphList fallback (for standard fonts)
Text positioning is used to insert spaces and newlines where the PDF moves the text cursor by significant amounts.
Table of Contents
Methods
- __construct() : mixed
- extractFromPage() : string
- Extract text from a page dictionary.
Methods
__construct()
public
__construct(ObjectResolver $resolver) : mixed
Parameters
- $resolver : ObjectResolver
extractFromPage()
Extract text from a page dictionary.
public
extractFromPage(PdfDictionary $page) : string
Parameters
- $page : PdfDictionary
-
The page dictionary (must have /Contents and /Resources)