Marker Document Conversion¶
When to use this skill¶
Convert documents (PDF, image, PPTX, DOCX, XLSX, HTML, EPUB) to markdown, JSON, chunks, or HTML. Use for extracting content from PDFs, processing academic papers, or transforming documents for analysis.
Quick start¶
Helper scripts (recommended)¶
# Simple conversion
$SKILL_PATH/scripts/convert.sh document.pdf --output_dir ./output
# Large PDFs (auto-chunking)
$SKILL_PATH/scripts/chunk-convert.sh document.pdf 20 ./output
# Batch conversion
$SKILL_PATH/scripts/batch-convert.sh ./pdfs ./output markdown
Python scripts¶
# Basic conversion
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/basic-convert.py document.pdf output.md
# Batch processing
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/batch-process.py ./pdfs ./output
# Extract images
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/extract-images.py document.pdf ./images
# Extract metadata
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/extract-metadata.py document.pdf metadata.json
# Safe conversion (with error handling)
$SKILL_PATH/.venv/bin/python $SKILL_PATH/assets/safe-convert.py document.pdf output.md
Direct commands¶
# Single document
$SKILL_PATH/.venv/bin/marker_single document.pdf --output_dir ./output
# Multiple documents
$SKILL_PATH/.venv/bin/marker /path/to/folder --workers 4
# Output formats: markdown (default), json, html, chunks
$SKILL_PATH/.venv/bin/marker_single document.pdf --output_format json
# High accuracy (slower)
$SKILL_PATH/.venv/bin/marker_single document.pdf --use_llm --force_ocr
# Page ranges
$SKILL_PATH/.venv/bin/marker_single document.pdf --page_range "0-19"
Configuration¶
Edit ${SKILL_PATH}/.config to customize cache locations or GPU settings:
export MARKER_CACHE_DIR="${SKILL_PATH}/cache"
export TORCH_HOME="${SKILL_PATH}/torch-cache"
export HF_HOME="${SKILL_PATH}/huggingface-cache"
export CUDA_VISIBLE_DEVICES="" # Empty = CPU, "0" = GPU 0, "0,1" = GPU 0+1
Common issues¶
- Out of memory: Use
$SKILL_PATH/scripts/chunk-convert.shor add--workers 1 - Slow first run: Downloads ~4GB models (cached for subsequent runs)
- Poor quality: Add
--force_ocrfor scanned PDFs,--use_llmfor complex layouts - GPU errors: Set
CUDA_VISIBLE_DEVICES=""in.configto force CPU mode
References¶
- API Reference - Complete API documentation
- Converters - Available converter types
- Output Formats - Format specifications
- Performance Tips - Optimization strategies
- Troubleshooting - Common issues and solutions