Skip to content

Marker Document Conversion

When to use this skill

Convert documents (PDF, image, PPTX, DOCX, XLSX, HTML, EPUB) to markdown, JSON, chunks, or HTML. Use for extracting content from PDFs, processing academic papers, or transforming documents for analysis.

Quick start

# Simple conversion
$SKILL_PATH/scripts/convert.sh document.pdf --output_dir ./output

# Large PDFs (auto-chunking)
$SKILL_PATH/scripts/chunk-convert.sh document.pdf 20 ./output

# Batch conversion
$SKILL_PATH/scripts/batch-convert.sh ./pdfs ./output markdown

Python scripts

# Basic conversion
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/basic-convert.py document.pdf output.md

# Batch processing
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/batch-process.py ./pdfs ./output

# Extract images
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/extract-images.py document.pdf ./images

# Extract metadata
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/extract-metadata.py document.pdf metadata.json

# Safe conversion (with error handling)
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/safe-convert.py document.pdf output.md

Direct commands

# Single document
$SKILL_PATH/.venv/bin/marker_single document.pdf --output_dir ./output

# Multiple documents
$SKILL_PATH/.venv/bin/marker /path/to/folder --workers 4

# Output formats: markdown (default), json, html, chunks
$SKILL_PATH/.venv/bin/marker_single document.pdf --output_format json

# High accuracy (slower)
$SKILL_PATH/.venv/bin/marker_single document.pdf --use_llm --force_ocr

# Page ranges
$SKILL_PATH/.venv/bin/marker_single document.pdf --page_range "0-19"

Configuration

Edit ${SKILL_PATH}/.config to customize cache locations or GPU settings:

export MARKER_CACHE_DIR="${SKILL_PATH}/cache"
export TORCH_HOME="${SKILL_PATH}/torch-cache"
export HF_HOME="${SKILL_PATH}/huggingface-cache"
export CUDA_VISIBLE_DEVICES=""  # Empty = CPU, "0" = GPU 0, "0,1" = GPU 0+1

Common issues

  • Out of memory: Use $SKILL_PATH/scripts/chunk-convert.sh or add --workers 1
  • Slow first run: Downloads ~4GB models (cached for subsequent runs)
  • Poor quality: Add --force_ocr for scanned PDFs, --use_llm for complex layouts
  • GPU errors: Set CUDA_VISIBLE_DEVICES="" in .config to force CPU mode

References