Output Formats¶

Complete specification for all marker-pdf output formats.

Supported Formats¶

markdown - GitHub-flavored markdown (default)
json - Structured JSON with metadata
html - Semantic HTML5
chunks - Text chunks for RAG/embedding

Markdown Format¶

Overview¶

GitHub-flavored markdown with preserved formatting, tables, equations, and images.

marker_single document.pdf --output_format markdown

Structure¶

# Document Title

## Section 1

Paragraph text with **bold** and *italic* formatting.

### Subsection 1.1

- Bullet point 1
- Bullet point 2

| Header 1 | Header 2 |
|----------|----------|
| Cell 1   | Cell 2   |

![Image description](image_0.png)

$$
E = mc^2
$$

Inline equation: $x^2 + y^2 = z^2$

Features¶

Headings (H1-H6)
Bold, italic, underline
Lists (ordered, unordered)
Tables with alignment
Code blocks
Block quotes
Links
Images (embedded or referenced)
LaTeX equations (inline and display)

Image Handling¶

Images are extracted and referenced:

![Figure 1: Chart](image_0.png)

Images saved separately in output directory.

JSON Format¶

Overview¶

Structured JSON with full document metadata and content.

marker_single document.pdf --output_format json

Structure¶

{
  "metadata": {
    "pages": 10,
    "language": "en",
    "title": "Document Title",
    "author": "Author Name",
    "created": "2024-01-01",
    "table_count": 3,
    "equation_count": 5,
    "image_count": 2
  },
  "content": [
    {
      "type": "heading",
      "level": 1,
      "text": "Document Title",
      "page": 1
    },
    {
      "type": "paragraph",
      "text": "Paragraph content...",
      "page": 1,
      "bbox": [x1, y1, x2, y2]
    },
    {
      "type": "table",
      "page": 2,
      "bbox": [x1, y1, x2, y2],
      "rows": [
        ["Header 1", "Header 2"],
        ["Cell 1", "Cell 2"]
      ]
    },
    {
      "type": "equation",
      "page": 3,
      "latex": "E = mc^2",
      "display": true
    },
    {
      "type": "image",
      "page": 4,
      "filename": "image_0.png",
      "bbox": [x1, y1, x2, y2],
      "caption": "Figure 1: Chart"
    }
  ],
  "images": {
    "image_0.png": "base64_encoded_data..."
  }
}

Content Types¶

heading - Document headings
paragraph - Text paragraphs
list - Ordered/unordered lists
table - Tabular data
equation - Mathematical equations
image - Embedded images
code - Code blocks

Use Cases¶

Structured data extraction
Database import
API integration
Custom processing pipelines

HTML Format¶

Overview¶

Semantic HTML5 with CSS classes for styling.

marker_single document.pdf --output_format html

Structure¶

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document Title</title>
    <style>
        /* Embedded CSS */
    </style>
</head>
<body>
    <article>
        <h1>Document Title</h1>

        <section>
            <h2>Section 1</h2>
            <p>Paragraph text with <strong>bold</strong> and <em>italic</em>.</p>

            <table>
                <thead>
                    <tr>
                        <th>Header 1</th>
                        <th>Header 2</th>
                    </tr>
                </thead>
                <tbody>
                    <tr>
                        <td>Cell 1</td>
                        <td>Cell 2</td>
                    </tr>
                </tbody>
            </table>

            <figure>
                <img src="image_0.png" alt="Figure 1: Chart">
                <figcaption>Figure 1: Chart</figcaption>
            </figure>

            <div class="equation">
                $$E = mc^2$$
            </div>
        </section>
    </article>
</body>
</html>

CSS Classes¶

.heading-1 through .heading-6 - Headings
.paragraph - Paragraphs
.table - Tables
.equation - Equations (inline and display)
.image - Images
.code-block - Code blocks

Use Cases¶

Web publishing
Documentation sites
Email templates
Print-ready output

Chunks Format¶

Overview¶

Text chunks optimized for RAG (Retrieval Augmented Generation) and embedding.

marker_single document.pdf --output_format chunks

Structure¶

{
  "chunks": [
    {
      "id": "chunk_0",
      "text": "Document Title\n\nSection 1\n\nParagraph text...",
      "page": 1,
      "char_count": 150,
      "metadata": {
        "section": "Introduction",
        "has_table": false,
        "has_equation": false
      }
    },
    {
      "id": "chunk_1",
      "text": "Section 2\n\nMore content...",
      "page": 2,
      "char_count": 200,
      "metadata": {
        "section": "Methods",
        "has_table": true,
        "has_equation": true
      }
    }
  ],
  "metadata": {
    "total_chunks": 15,
    "avg_chunk_size": 175,
    "pages": 10
  }
}

Chunking Strategy¶

Semantic boundaries: Chunks split at section/paragraph boundaries
Size target: ~150-300 characters per chunk
Overlap: Optional 10% overlap between chunks
Metadata: Each chunk includes context metadata

Configuration¶

config = {
    "chunk_size": 200,        # Target chunk size
    "chunk_overlap": 20,      # Overlap between chunks
    "split_on": "paragraph"   # "paragraph", "section", or "sentence"
}

Use Cases¶

Vector database ingestion
Semantic search
RAG pipelines
Document embedding

Format Comparison¶

Format	Size	Structure	Metadata	Images	Best For
Markdown	Small	Moderate	Limited	Referenced	Documentation, notes
JSON	Large	High	Complete	Embedded	APIs, databases
HTML	Medium	High	Moderate	Embedded	Web, publishing
Chunks	Medium	Low	Moderate	None	RAG, search

Converting Between Formats¶

Python¶

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = PdfConverter(artifact_dict=create_model_dict())
rendered = converter("document.pdf")

# Get different formats
markdown, _, _ = text_from_rendered(rendered, output_format="markdown")
json_data, _, _ = text_from_rendered(rendered, output_format="json")
html, _, _ = text_from_rendered(rendered, output_format="html")
chunks, _, _ = text_from_rendered(rendered, output_format="chunks")

CLI¶

# Convert to all formats
for format in markdown json html chunks; do
    marker_single document.pdf --output_format $format --output_dir ./output_$format
done

Custom Output Formats¶

Create custom output formats by extending the renderer:

from marker.renderers.base import BaseRenderer

class CustomRenderer(BaseRenderer):
    def render(self, document):
        # Custom rendering logic
        return custom_output

converter = PdfConverter(
    artifact_dict=create_model_dict(),
    renderer=CustomRenderer()
)

Output Formats¶

Supported Formats¶

Markdown Format¶

Overview¶

Structure¶

Features¶

Image Handling¶

JSON Format¶

Overview¶

Structure¶

Content Types¶

Use Cases¶

HTML Format¶

Overview¶

Structure¶

CSS Classes¶

Use Cases¶

Chunks Format¶

Overview¶

Structure¶

Chunking Strategy¶

Configuration¶

Use Cases¶

Format Comparison¶

Converting Between Formats¶

Python¶

CLI¶

Custom Output Formats¶

See Also¶