Marker Converters¶
Complete reference for all converter types in marker-pdf.
Converter Types¶
PdfConverter (Default)¶
Full-featured PDF converter with OCR, layout detection, and table recognition.
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("document.pdf")
Features: - Text extraction from native PDFs - OCR for scanned pages - Table detection and extraction - Equation recognition - Image extraction - Layout preservation
Best for: General PDF conversion, mixed content documents
TableConverter¶
Specialized converter for extracting tables from PDFs.
from marker.converters.table import TableConverter
from marker.models import create_model_dict
converter = TableConverter(artifact_dict=create_model_dict())
rendered = converter("document.pdf")
Features: - Focused table detection - Structured table output - Cell-level extraction - Merged cell handling
Best for: Documents with many tables, data extraction
Output format: JSON with table structures
{
"tables": [
{
"page": 1,
"bbox": [x1, y1, x2, y2],
"cells": [
{"row": 0, "col": 0, "text": "Header 1"},
{"row": 0, "col": 1, "text": "Header 2"}
]
}
]
}
OCRConverter¶
Pure OCR converter without layout analysis.
from marker.converters.ocr import OCRConverter
from marker.models import create_model_dict
converter = OCRConverter(artifact_dict=create_model_dict())
rendered = converter("document.pdf")
Features: - Raw OCR text extraction - No layout preservation - Fast processing - Simple output
Best for: Scanned documents, text-only extraction, speed priority
Output: Plain text in reading order
EquationConverter¶
Specialized converter for mathematical equations.
from marker.converters.equation import EquationConverter
from marker.models import create_model_dict
converter = EquationConverter(artifact_dict=create_model_dict())
rendered = converter("document.pdf")
Features: - LaTeX equation extraction - Inline and display equations - Symbol recognition - Math notation preservation
Best for: Academic papers, mathematical documents
Output format: LaTeX equations
$$
E = mc^2
$$
Inline equation: $x^2 + y^2 = z^2$
Converter Comparison¶
| Converter | Speed | Accuracy | Layout | Tables | Equations | Images |
|---|---|---|---|---|---|---|
| PdfConverter | Medium | High | ✓ | ✓ | ✓ | ✓ |
| TableConverter | Fast | High | ✗ | ✓ | ✗ | ✗ |
| OCRConverter | Fast | Medium | ✗ | ✗ | ✗ | ✗ |
| EquationConverter | Medium | High | ✗ | ✗ | ✓ | ✗ |
Using Converters via CLI¶
PdfConverter (default)¶
marker_single document.pdf
TableConverter¶
marker_single document.pdf \
--converter_cls marker.converters.table.TableConverter \
--output_format json
OCRConverter¶
marker_single document.pdf \
--converter_cls marker.converters.ocr.OCRConverter
EquationConverter¶
marker_single document.pdf \
--converter_cls marker.converters.equation.EquationConverter
Custom Converters¶
Create custom converters by extending BaseConverter:
from marker.converters.base import BaseConverter
from marker.models import create_model_dict
class CustomConverter(BaseConverter):
def __init__(self, artifact_dict, **kwargs):
super().__init__(artifact_dict, **kwargs)
# Custom initialization
def convert_page(self, page):
# Custom page conversion logic
return processed_page
def post_process(self, pages):
# Custom post-processing
return final_output
# Usage
converter = CustomConverter(artifact_dict=create_model_dict())
rendered = converter("document.pdf")
Converter Configuration¶
Common Options¶
All converters support these configuration options:
config = {
"page_range": (0, 10), # Process pages 0-10
"languages": ["en", "es"], # Expected languages
"workers": 4, # Parallel workers
"output_format": "markdown" # Output format
}
converter = PdfConverter(
artifact_dict=create_model_dict(),
config=config
)
Converter-Specific Options¶
PdfConverter¶
config = {
"use_llm": True, # Use LLM for accuracy
"force_ocr": False, # Force OCR on all pages
"extract_images": True, # Extract embedded images
"preserve_layout": True # Maintain document layout
}
TableConverter¶
config = {
"min_table_confidence": 0.8, # Minimum confidence threshold
"merge_cells": True, # Handle merged cells
"extract_headers": True # Identify table headers
}
OCRConverter¶
config = {
"ocr_confidence": 0.7, # Minimum OCR confidence
"denoise": True, # Apply denoising
"deskew": True # Correct page skew
}
Chaining Converters¶
Process documents with multiple converters:
from marker.converters.pdf import PdfConverter
from marker.converters.table import TableConverter
from marker.models import create_model_dict
models = create_model_dict()
# Extract general content
pdf_converter = PdfConverter(artifact_dict=models)
text, metadata, images = text_from_rendered(pdf_converter("doc.pdf"))
# Extract tables separately
table_converter = TableConverter(artifact_dict=models)
tables = text_from_rendered(table_converter("doc.pdf"), output_format="json")
# Combine results
combined = {
"text": text,
"tables": tables,
"metadata": metadata
}
See Also¶
- api.md - Full API reference
- output-formats.md - Output format specifications