Skip to content

Arlington PDF Model

The Arlington PDF Model is a machine-readable definition of every PDF object type maintained by the PDF Association (the ISO 32000 standards body). It provides the canonical grammar for PDF file structure — 613 dictionary specifications covering PDF 1.0 through 2.0.

License: Apache 2.0

The Arlington validator checks generated PDFs against the spec at the dictionary level:

CheckSeverityDescription
Required keys missingErrorFields marked Required: TRUE that are absent from the dictionary
Unknown keysWarningKeys present in the PDF that don’t appear in the Arlington spec
Version constraintsWarningFields used that require a higher PDF version than the document declares
Deprecated keysWarningFields used that are deprecated in the document’s PDF version

The validator currently checks Catalog and Page dictionaries with unconditional rules only. Conditional requirements (encoded as fn: predicates like fn:IsRequired(fn:IsPresent(PieceInfo))) are deferred.

  • fn: predicate evaluation for conditional required fields
  • Type checking (verifying value types match the spec: name, string, integer, etc.)
  • PossibleValues enumeration checking
  • Cross-dictionary link traversal (following Link column references)
  • Font, annotation, and action dictionary validation

The Arlington model is included as a git submodule at vendor-data/arlington-pdf-model/. The validator reads TSV files from tsv/latest/, which contains the current definitions for all PDF versions.

Terminal window
# Initialize the submodule
git submodule update --init
# Verify
ls vendor-data/arlington-pdf-model/tsv/latest/ | head -5

Each TSV file defines one PDF dictionary type with 12 columns:

ColumnDescription
KeyProperty name (e.g., Type, Pages, MediaBox)
TypeData types, semicolon-separated (e.g., name, dictionary, rectangle)
SinceVersionPDF version when introduced (e.g., 1.0, 1.4, 2.0)
DeprecatedInPDF version when deprecated (empty if not deprecated)
RequiredTRUE, FALSE, or fn: predicate for conditional requirement
IndirectReferenceWhether the value must be an indirect reference
InheritableWhether the value can be inherited from a parent
DefaultValueDefault value if the key is absent
PossibleValuesAllowed values (e.g., [Catalog], [SinglePage,OneColumn])
SpecialCaseAdditional validity constraints
LinkReferences to other TSV files for nested dictionary types
NotePDF spec table reference or GitHub issue link

Example (Catalog.tsv, first few rows):

Key Type SinceVersion Required ...
Type name 1.0 TRUE [Catalog]
Version name 1.4 FALSE [1.0,1.1,...,2.0]
Pages dictionary 1.0 TRUE ...

Arlington TSV filenames don’t always match PDF /Type values. The validator maps them:

PDF /Type valueArlington TSV file
CatalogCatalog.tsv
PagePageObject.tsv
PagesPageTreeNodeRoot.tsv
FontFontType1.tsv
ExtGStateGraphicsStateParameter.tsv
OutlinesOutline.tsv
XRefXRefStream.tsv
ObjStmObjectStream.tsv

ArlingtonLoader::load() scans all *.tsv files in the TSV directory, parses each into a DictionarySpec containing FieldSpec entries, and caches the result. The 613 specs are loaded once per test run.

ArlingtonValidator::validate() takes a PdfDictionary (from PdfReader), a spec name, and an optional PdfVersion, then returns a ValidationResult with errors and warnings.

The ArlingtonValidationTrait provides:

// Validate a PDF file (reads with PdfReader, validates Catalog + all Pages)
$this->assertArlingtonValid('/path/to/file.pdf');
// Validate in-memory PDF bytes
$this->assertArlingtonValidBytes($pdfBytes);

If the Arlington submodule isn’t initialized, both methods call markTestSkipped().

FileClassPurpose
tests/Support/Arlington/ArlingtonLoader.phpArlingtonLoaderParses TSV files, caches specs
tests/Support/Arlington/ArlingtonValidator.phpArlingtonValidatorValidates dictionaries against specs
tests/Support/Arlington/ArlingtonValidationTrait.phpArlingtonValidationTraitPHPUnit assertion trait
tests/Support/Arlington/DictionarySpec.phpDictionarySpecOne dictionary type with its fields
tests/Support/Arlington/FieldSpec.phpFieldSpecOne field (12-column TSV row)
tests/Support/Arlington/ValidationResult.phpValidationResultErrors + warnings container

Arlington validation is currently applied to 5 core integration tests:

Test fileWhat it validates
SimpleTextTest3-page text PDF — Catalog + 3 Pages
MultiPageComplexTest10-page PDF with Info/ViewerPreferences — Catalog + 10 Pages
FormFieldsTestAcroForm with text/checkbox/choice fields — Catalog + 1 Page
BookmarksTestNested outline tree — Catalog + 6 Pages
DocumentFeaturesTestOutputIntent, OCG, tagged structure — Catalog + Pages (2 test methods)

The Arlington submodule is initialized in CI by the submodules: true checkout option:

.github/workflows/ci.yml
- uses: actions/checkout@v4
with:
submodules: true

No additional installation is needed — the validator is pure PHP.

use Phpdftk\Tests\Support\Arlington\ArlingtonLoader;
use Phpdftk\Tests\Support\Arlington\ArlingtonValidator;
use Phpdftk\Pdf\Reader\PdfReader;
$specs = ArlingtonLoader::load();
$validator = new ArlingtonValidator($specs);
$reader = PdfReader::fromFile('my-document.pdf');
$version = $reader->getPdfVersion();
// Validate catalog
$result = $validator->validate($reader->getCatalog(), 'Catalog', $version);
echo "Errors: " . implode(', ', $result->errors) . "\n";
echo "Warnings: " . implode(', ', $result->warnings) . "\n";
// Validate each page
for ($i = 0; $i < $reader->getPageCount(); $i++) {
$result = $validator->validate($reader->getPage($i), 'PageObject', $version);
echo "Page {$i}: " . count($result->errors) . " errors\n";
}