mirror of
https://github.com/microsoft/markitdown.git
synced 2026-07-03 12:28:14 +08:00
feat: Add Azure Content Understanding converter (#1865)
* inital version * improve mime type detection * prebuilt-image custom analzyer route to image * enhance cu priority over di * fix: apply black formatting * update cache of known prebuilt name and README improvement * add test cases, run black * update readme and deriving content_type from the resolved file_type * update readme
This commit is contained in:
78
README.md
78
README.md
@@ -107,6 +107,7 @@ At the moment, the following optional dependencies are available:
|
||||
* `[pdf]` Installs dependencies for PDF files
|
||||
* `[outlook]` Installs dependencies for Outlook messages
|
||||
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
|
||||
* `[az-content-understanding]` Installs dependencies for Azure Content Understanding
|
||||
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
|
||||
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
|
||||
|
||||
@@ -158,6 +159,83 @@ If no `llm_client` is provided the plugin still loads, but OCR is silently skipp
|
||||
|
||||
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
|
||||
|
||||
### Azure Content Understanding
|
||||
|
||||
[Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers.
|
||||
|
||||
Install: `pip install 'markitdown[az-content-understanding]'`
|
||||
|
||||
#### When to use Content Understanding
|
||||
|
||||
Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:
|
||||
|
||||
- **Audio and video files** — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
|
||||
- **Structured field extraction** — [Prebuilt](https://learn.microsoft.com/azure/ai-services/content-understanding/concepts/prebuilt-analyzers) or [custom-built](https://learn.microsoft.com/azure/ai-services/content-understanding/how-to/customize-analyzer-content-understanding-studio?tabs=portal) analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.
|
||||
- **Higher-quality document extraction** — Cloud-based layout analysis and OCR for scanned PDFs, complex tables, and multi-page documents.
|
||||
- **Single API for all modalities** — One `cu_endpoint` handles documents, images, audio, and video with automatic analyzer routing.
|
||||
|
||||
| Capability | Built-in converters | Azure Document Intelligence | Azure Content Understanding |
|
||||
|------------|---------------------|-----------------------------|-----------------------------|
|
||||
| Document conversion | Offline, format-specific extraction | Cloud layout extraction | Cloud multimodal extraction |
|
||||
| Structured fields | Not available | Not exposed by this integration | YAML front matter from analyzer fields |
|
||||
| Custom analyzers | Not available | Not configurable in this integration | Supported with `cu_analyzer_id` |
|
||||
| Audio and video | Basic audio, no video | Not supported | Audio and video analyzers |
|
||||
| Cost | Local compute only | Billable Azure API calls | Billable Azure API calls |
|
||||
|
||||
**CLI:**
|
||||
|
||||
```bash
|
||||
markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
|
||||
```
|
||||
|
||||
**Python API:**
|
||||
|
||||
```python
|
||||
from markitdown import MarkItDown
|
||||
|
||||
# Zero-config — auto-selects analyzer per file type
|
||||
md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
|
||||
result = md.convert("report.pdf") # documents → prebuilt-documentSearch
|
||||
result = md.convert("meeting.mp4") # video → prebuilt-videoSearch
|
||||
result = md.convert("call.wav") # audio → prebuilt-audioSearch
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
**With a custom analyzer** (for domain-specific field extraction):
|
||||
|
||||
```python
|
||||
md = MarkItDown(
|
||||
cu_endpoint="<content_understanding_endpoint>",
|
||||
cu_analyzer_id="my-invoice-analyzer",
|
||||
)
|
||||
result = md.convert("invoice.pdf")
|
||||
print(result.markdown)
|
||||
# Output includes YAML front matter with extracted fields:
|
||||
# ---
|
||||
# contentType: document
|
||||
# fields:
|
||||
# VendorName: CONTOSO LTD.
|
||||
# InvoiceDate: '2019-11-15'
|
||||
# ---
|
||||
# <!-- page 1 -->
|
||||
# ...
|
||||
```
|
||||
|
||||
When `cu_analyzer_id` is set, the converter automatically scopes it to compatible file types based on the analyzer's modality. Incompatible types (e.g., audio files with a document analyzer) auto-route to default prebuilt analyzers.
|
||||
|
||||
**Cost note:** Each `convert()` call for a CU-routed format is a billable Azure API call. Use `cu_file_types` to restrict which formats route to CU:
|
||||
|
||||
```python
|
||||
from markitdown.converters import ContentUnderstandingFileType
|
||||
|
||||
md = MarkItDown(
|
||||
cu_endpoint="<content_understanding_endpoint>",
|
||||
cu_file_types=[ContentUnderstandingFileType.PDF], # only PDFs use CU
|
||||
)
|
||||
```
|
||||
|
||||
More information about Azure Content Understanding can be found [here](https://learn.microsoft.com/azure/ai-services/content-understanding/).
|
||||
|
||||
### Azure Document Intelligence
|
||||
|
||||
To use Microsoft Document Intelligence for conversion:
|
||||
|
||||
@@ -47,6 +47,7 @@ all = [
|
||||
"SpeechRecognition",
|
||||
"youtube-transcript-api~=1.0.0",
|
||||
"azure-ai-documentintelligence",
|
||||
"azure-ai-contentunderstanding>=1.2.0b1",
|
||||
"azure-identity",
|
||||
]
|
||||
pptx = ["python-pptx"]
|
||||
@@ -58,6 +59,8 @@ outlook = ["olefile"]
|
||||
audio-transcription = ["pydub", "SpeechRecognition"]
|
||||
youtube-transcription = ["youtube-transcript-api"]
|
||||
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
|
||||
# >=1.2.0b1 required for to_llm_input() helper used by ContentUnderstandingConverter
|
||||
az-content-understanding = ["azure-ai-contentunderstanding>=1.2.0b1", "azure-identity"]
|
||||
|
||||
[project.urls]
|
||||
Documentation = "https://github.com/microsoft/markitdown#readme"
|
||||
|
||||
@@ -4,6 +4,7 @@
|
||||
import argparse
|
||||
import sys
|
||||
import codecs
|
||||
from typing import Any, Dict
|
||||
from textwrap import dedent
|
||||
from importlib.metadata import entry_points
|
||||
from .__about__ import __version__
|
||||
@@ -77,13 +78,22 @@ def main():
|
||||
help="Provide a hint about the file's charset (e.g, UTF-8).",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
cloud_group = parser.add_mutually_exclusive_group()
|
||||
cloud_group.add_argument(
|
||||
"-d",
|
||||
"--use-docintel",
|
||||
action="store_true",
|
||||
help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
|
||||
)
|
||||
|
||||
cloud_group.add_argument(
|
||||
"--use-cu",
|
||||
"--use-content-understanding",
|
||||
action="store_true",
|
||||
dest="use_cu",
|
||||
help="Use Azure Content Understanding to extract text. Requires --cu-endpoint.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-e",
|
||||
"--endpoint",
|
||||
@@ -91,6 +101,24 @@ def main():
|
||||
help="Document Intelligence Endpoint. Required if using Document Intelligence.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--cu-endpoint",
|
||||
type=str,
|
||||
help="Content Understanding Endpoint. Required if using --use-cu.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--cu-analyzer",
|
||||
type=str,
|
||||
help="Content Understanding analyzer ID. If not specified, auto-selects by file type.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--cu-file-types",
|
||||
type=str,
|
||||
help="Comma-separated list of file types to route to Content Understanding (e.g., pdf,jpeg,mp4). If omitted, all supported types are routed.",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-p",
|
||||
"--use-plugins",
|
||||
@@ -183,6 +211,36 @@ def main():
|
||||
markitdown = MarkItDown(
|
||||
enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
|
||||
)
|
||||
elif args.use_cu:
|
||||
if args.cu_endpoint is None:
|
||||
_exit_with_error(
|
||||
"Content Understanding Endpoint (--cu-endpoint) is required when using --use-cu."
|
||||
)
|
||||
elif args.filename is None:
|
||||
_exit_with_error("Filename is required when using Content Understanding.")
|
||||
|
||||
cu_kwargs: Dict[str, Any] = {
|
||||
"cu_endpoint": args.cu_endpoint,
|
||||
}
|
||||
if args.cu_analyzer is not None:
|
||||
cu_kwargs["cu_analyzer_id"] = args.cu_analyzer
|
||||
if args.cu_file_types is not None:
|
||||
# Parse comma-separated file types into ContentUnderstandingFileType list
|
||||
from .converters import ContentUnderstandingFileType
|
||||
|
||||
type_names = [
|
||||
t.strip().lower() for t in args.cu_file_types.split(",") if t.strip()
|
||||
]
|
||||
cu_types = []
|
||||
for name in type_names:
|
||||
# Try matching by value (e.g., "pdf", "jpeg", "mp4")
|
||||
try:
|
||||
cu_types.append(ContentUnderstandingFileType(name))
|
||||
except ValueError:
|
||||
_exit_with_error(f"Unknown file type: {name}")
|
||||
cu_kwargs["cu_file_types"] = cu_types
|
||||
|
||||
markitdown = MarkItDown(enable_plugins=args.use_plugins, **cu_kwargs)
|
||||
else:
|
||||
markitdown = MarkItDown(enable_plugins=args.use_plugins)
|
||||
|
||||
|
||||
@@ -38,6 +38,7 @@ from .converters import (
|
||||
ZipConverter,
|
||||
EpubConverter,
|
||||
DocumentIntelligenceConverter,
|
||||
ContentUnderstandingConverter,
|
||||
CsvConverter,
|
||||
)
|
||||
|
||||
@@ -225,6 +226,28 @@ class MarkItDown:
|
||||
DocumentIntelligenceConverter(**docintel_args),
|
||||
)
|
||||
|
||||
# Register Content Understanding converter at the top of the stack if endpoint is provided
|
||||
cu_endpoint = kwargs.get("cu_endpoint")
|
||||
if cu_endpoint is not None:
|
||||
cu_args: Dict[str, Any] = {}
|
||||
cu_args["endpoint"] = cu_endpoint
|
||||
|
||||
cu_credential = kwargs.get("cu_credential")
|
||||
if cu_credential is not None:
|
||||
cu_args["credential"] = cu_credential
|
||||
|
||||
cu_analyzer_id = kwargs.get("cu_analyzer_id")
|
||||
if cu_analyzer_id is not None:
|
||||
cu_args["analyzer_id"] = cu_analyzer_id
|
||||
|
||||
cu_file_types = kwargs.get("cu_file_types")
|
||||
if cu_file_types is not None:
|
||||
cu_args["file_types"] = cu_file_types
|
||||
|
||||
self.register_converter(
|
||||
ContentUnderstandingConverter(**cu_args),
|
||||
)
|
||||
|
||||
self._builtins_enabled = True
|
||||
else:
|
||||
warn("Built-in converters are already enabled.", RuntimeWarning)
|
||||
|
||||
@@ -21,6 +21,10 @@ from ._doc_intel_converter import (
|
||||
DocumentIntelligenceConverter,
|
||||
DocumentIntelligenceFileType,
|
||||
)
|
||||
from ._cu_converter import (
|
||||
ContentUnderstandingConverter,
|
||||
ContentUnderstandingFileType,
|
||||
)
|
||||
from ._epub_converter import EpubConverter
|
||||
from ._csv_converter import CsvConverter
|
||||
|
||||
@@ -43,6 +47,8 @@ __all__ = [
|
||||
"ZipConverter",
|
||||
"DocumentIntelligenceConverter",
|
||||
"DocumentIntelligenceFileType",
|
||||
"ContentUnderstandingConverter",
|
||||
"ContentUnderstandingFileType",
|
||||
"EpubConverter",
|
||||
"CsvConverter",
|
||||
]
|
||||
|
||||
570
packages/markitdown/src/markitdown/converters/_cu_converter.py
Normal file
570
packages/markitdown/src/markitdown/converters/_cu_converter.py
Normal file
@@ -0,0 +1,570 @@
|
||||
"""Azure Content Understanding converter for MarkItDown.
|
||||
|
||||
Converts files using Azure Content Understanding (CU) for high-quality,
|
||||
multi-modal extraction with structured field output. Supports documents,
|
||||
images, audio, and video. Fields are serialized as YAML front matter via
|
||||
the CU SDK's ``to_llm_input()`` helper.
|
||||
|
||||
Install dependencies: ``pip install markitdown[az-content-understanding]``
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from typing import BinaryIO, Any, List, Optional, Dict
|
||||
from enum import Enum
|
||||
|
||||
from .._base_converter import DocumentConverter, DocumentConverterResult
|
||||
from .._stream_info import StreamInfo
|
||||
from .._exceptions import MissingDependencyException
|
||||
|
||||
# Try loading optional dependencies — save error for later
|
||||
_dependency_exc_info = None
|
||||
try:
|
||||
from azure.ai.contentunderstanding import ContentUnderstandingClient, to_llm_input
|
||||
from azure.core.credentials import AzureKeyCredential, TokenCredential
|
||||
from azure.core.pipeline.policies import UserAgentPolicy
|
||||
from azure.identity import DefaultAzureCredential
|
||||
except ImportError:
|
||||
_dependency_exc_info = sys.exc_info()
|
||||
|
||||
# Stub classes for type hinting
|
||||
class AzureKeyCredential: # type: ignore[no-redef]
|
||||
pass
|
||||
|
||||
class TokenCredential: # type: ignore[no-redef]
|
||||
pass
|
||||
|
||||
class ContentUnderstandingClient: # type: ignore[no-redef]
|
||||
pass
|
||||
|
||||
class UserAgentPolicy: # type: ignore[no-redef]
|
||||
pass
|
||||
|
||||
class DefaultAzureCredential: # type: ignore[no-redef]
|
||||
pass
|
||||
|
||||
def to_llm_input(*args, **kwargs): # type: ignore[no-redef]
|
||||
pass
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# File type enum and routing tables
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class ContentUnderstandingFileType(str, Enum):
|
||||
"""Supported file types for Content Understanding conversion."""
|
||||
|
||||
# Documents
|
||||
PDF = "pdf"
|
||||
DOCX = "docx"
|
||||
PPTX = "pptx"
|
||||
XLSX = "xlsx"
|
||||
HTML = "html"
|
||||
TXT = "txt"
|
||||
MD = "md"
|
||||
RTF = "rtf"
|
||||
XML = "xml"
|
||||
|
||||
# Email
|
||||
EML = "eml"
|
||||
MSG = "msg"
|
||||
|
||||
# Images (document modality)
|
||||
JPEG = "jpeg"
|
||||
PNG = "png"
|
||||
BMP = "bmp"
|
||||
TIFF = "tiff"
|
||||
HEIF = "heif"
|
||||
|
||||
# Video
|
||||
MP4 = "mp4"
|
||||
M4V = "m4v"
|
||||
MOV = "mov"
|
||||
AVI = "avi"
|
||||
MKV = "mkv"
|
||||
WEBM = "webm"
|
||||
FLV = "flv"
|
||||
WMV = "wmv"
|
||||
|
||||
# Audio
|
||||
WAV = "wav"
|
||||
MP3 = "mp3"
|
||||
M4A = "m4a"
|
||||
FLAC = "flac"
|
||||
OGG = "ogg"
|
||||
AAC = "aac"
|
||||
WMA = "wma"
|
||||
|
||||
|
||||
# Extension → file type
|
||||
_EXTENSION_MAP: Dict[str, ContentUnderstandingFileType] = {
|
||||
# Documents
|
||||
".pdf": ContentUnderstandingFileType.PDF,
|
||||
".docx": ContentUnderstandingFileType.DOCX,
|
||||
".pptx": ContentUnderstandingFileType.PPTX,
|
||||
".xlsx": ContentUnderstandingFileType.XLSX,
|
||||
".html": ContentUnderstandingFileType.HTML,
|
||||
".txt": ContentUnderstandingFileType.TXT,
|
||||
".md": ContentUnderstandingFileType.MD,
|
||||
".rtf": ContentUnderstandingFileType.RTF,
|
||||
".xml": ContentUnderstandingFileType.XML,
|
||||
# Email
|
||||
".eml": ContentUnderstandingFileType.EML,
|
||||
".msg": ContentUnderstandingFileType.MSG,
|
||||
# Images
|
||||
".jpg": ContentUnderstandingFileType.JPEG,
|
||||
".jpeg": ContentUnderstandingFileType.JPEG,
|
||||
".jpe": ContentUnderstandingFileType.JPEG,
|
||||
".png": ContentUnderstandingFileType.PNG,
|
||||
".bmp": ContentUnderstandingFileType.BMP,
|
||||
".tiff": ContentUnderstandingFileType.TIFF,
|
||||
".heif": ContentUnderstandingFileType.HEIF,
|
||||
".heic": ContentUnderstandingFileType.HEIF,
|
||||
# Video
|
||||
".mp4": ContentUnderstandingFileType.MP4,
|
||||
".m4v": ContentUnderstandingFileType.M4V,
|
||||
".mov": ContentUnderstandingFileType.MOV,
|
||||
".avi": ContentUnderstandingFileType.AVI,
|
||||
".mkv": ContentUnderstandingFileType.MKV,
|
||||
".webm": ContentUnderstandingFileType.WEBM,
|
||||
".flv": ContentUnderstandingFileType.FLV,
|
||||
".wmv": ContentUnderstandingFileType.WMV,
|
||||
# Audio
|
||||
".wav": ContentUnderstandingFileType.WAV,
|
||||
".mp3": ContentUnderstandingFileType.MP3,
|
||||
".m4a": ContentUnderstandingFileType.M4A,
|
||||
".flac": ContentUnderstandingFileType.FLAC,
|
||||
".ogg": ContentUnderstandingFileType.OGG,
|
||||
".aac": ContentUnderstandingFileType.AAC,
|
||||
".wma": ContentUnderstandingFileType.WMA,
|
||||
}
|
||||
|
||||
# MIME type prefixes for each file type
|
||||
_MIME_PREFIXES: Dict[ContentUnderstandingFileType, List[str]] = {
|
||||
# Documents
|
||||
ContentUnderstandingFileType.PDF: ["application/pdf", "application/x-pdf"],
|
||||
ContentUnderstandingFileType.DOCX: [
|
||||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
],
|
||||
ContentUnderstandingFileType.PPTX: [
|
||||
"application/vnd.openxmlformats-officedocument.presentationml"
|
||||
],
|
||||
ContentUnderstandingFileType.XLSX: [
|
||||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
],
|
||||
ContentUnderstandingFileType.HTML: ["text/html", "application/xhtml+xml"],
|
||||
ContentUnderstandingFileType.TXT: ["text/plain"],
|
||||
ContentUnderstandingFileType.MD: ["text/markdown"],
|
||||
ContentUnderstandingFileType.RTF: ["text/rtf", "application/rtf"],
|
||||
ContentUnderstandingFileType.XML: ["text/xml", "application/xml"],
|
||||
# Email
|
||||
ContentUnderstandingFileType.EML: ["message/rfc822"],
|
||||
ContentUnderstandingFileType.MSG: ["application/vnd.ms-outlook"],
|
||||
# Images
|
||||
ContentUnderstandingFileType.JPEG: ["image/jpeg"],
|
||||
ContentUnderstandingFileType.PNG: ["image/png"],
|
||||
ContentUnderstandingFileType.BMP: ["image/bmp"],
|
||||
ContentUnderstandingFileType.TIFF: ["image/tiff"],
|
||||
ContentUnderstandingFileType.HEIF: ["image/heif", "image/heic"],
|
||||
# Video
|
||||
ContentUnderstandingFileType.MP4: ["video/mp4"],
|
||||
ContentUnderstandingFileType.M4V: ["video/x-m4v"],
|
||||
ContentUnderstandingFileType.MOV: ["video/quicktime"],
|
||||
ContentUnderstandingFileType.AVI: ["video/x-msvideo"],
|
||||
ContentUnderstandingFileType.MKV: ["video/x-matroska"],
|
||||
ContentUnderstandingFileType.WEBM: ["video/webm"],
|
||||
ContentUnderstandingFileType.FLV: ["video/x-flv"],
|
||||
ContentUnderstandingFileType.WMV: ["video/x-ms-wmv"],
|
||||
# Audio
|
||||
ContentUnderstandingFileType.WAV: ["audio/wav", "audio/x-wav"],
|
||||
ContentUnderstandingFileType.MP3: ["audio/mpeg", "audio/mp3"],
|
||||
ContentUnderstandingFileType.M4A: ["audio/mp4", "audio/m4a", "audio/x-m4a"],
|
||||
ContentUnderstandingFileType.FLAC: ["audio/flac", "audio/x-flac"],
|
||||
ContentUnderstandingFileType.OGG: ["audio/ogg"],
|
||||
ContentUnderstandingFileType.AAC: ["audio/aac"],
|
||||
ContentUnderstandingFileType.WMA: ["audio/x-ms-wma"],
|
||||
}
|
||||
|
||||
_MIME_ALIASES: Dict[str, str] = {
|
||||
"audio/x-wav": "audio/wav",
|
||||
"audio/x-flac": "audio/flac",
|
||||
"audio/x-m4a": "audio/mp4",
|
||||
"video/x-m4v": "video/mp4",
|
||||
}
|
||||
|
||||
# File type → modality category
|
||||
_DOCUMENT_TYPES = {
|
||||
ContentUnderstandingFileType.PDF,
|
||||
ContentUnderstandingFileType.DOCX,
|
||||
ContentUnderstandingFileType.PPTX,
|
||||
ContentUnderstandingFileType.XLSX,
|
||||
ContentUnderstandingFileType.HTML,
|
||||
ContentUnderstandingFileType.TXT,
|
||||
ContentUnderstandingFileType.MD,
|
||||
ContentUnderstandingFileType.RTF,
|
||||
ContentUnderstandingFileType.XML,
|
||||
ContentUnderstandingFileType.EML,
|
||||
ContentUnderstandingFileType.MSG,
|
||||
}
|
||||
|
||||
_IMAGE_TYPES = {
|
||||
ContentUnderstandingFileType.JPEG,
|
||||
ContentUnderstandingFileType.PNG,
|
||||
ContentUnderstandingFileType.BMP,
|
||||
ContentUnderstandingFileType.TIFF,
|
||||
ContentUnderstandingFileType.HEIF,
|
||||
}
|
||||
|
||||
_VIDEO_TYPES = {
|
||||
ContentUnderstandingFileType.MP4,
|
||||
ContentUnderstandingFileType.M4V,
|
||||
ContentUnderstandingFileType.MOV,
|
||||
ContentUnderstandingFileType.AVI,
|
||||
ContentUnderstandingFileType.MKV,
|
||||
ContentUnderstandingFileType.WEBM,
|
||||
ContentUnderstandingFileType.FLV,
|
||||
ContentUnderstandingFileType.WMV,
|
||||
}
|
||||
|
||||
_AUDIO_TYPES = {
|
||||
ContentUnderstandingFileType.WAV,
|
||||
ContentUnderstandingFileType.MP3,
|
||||
ContentUnderstandingFileType.M4A,
|
||||
ContentUnderstandingFileType.FLAC,
|
||||
ContentUnderstandingFileType.OGG,
|
||||
ContentUnderstandingFileType.AAC,
|
||||
ContentUnderstandingFileType.WMA,
|
||||
}
|
||||
|
||||
_PREBUILT_ANALYZERS = {
|
||||
"document": "prebuilt-documentSearch",
|
||||
"image": "prebuilt-documentSearch",
|
||||
"video": "prebuilt-videoSearch",
|
||||
"audio": "prebuilt-audioSearch",
|
||||
}
|
||||
|
||||
# All supported file types (default set when file_types is None)
|
||||
_ALL_FILE_TYPES = list(ContentUnderstandingFileType)
|
||||
|
||||
|
||||
def _get_modality(file_type: ContentUnderstandingFileType) -> str:
|
||||
"""Get the modality category for a file type."""
|
||||
if file_type in _DOCUMENT_TYPES:
|
||||
return "document"
|
||||
elif file_type in _IMAGE_TYPES:
|
||||
return "image"
|
||||
elif file_type in _VIDEO_TYPES:
|
||||
return "video"
|
||||
elif file_type in _AUDIO_TYPES:
|
||||
return "audio"
|
||||
raise ValueError(f"Unknown file type: {file_type}")
|
||||
|
||||
|
||||
def _detect_file_type(
|
||||
stream_info: StreamInfo,
|
||||
file_types: Optional[List[ContentUnderstandingFileType]] = None,
|
||||
) -> Optional[ContentUnderstandingFileType]:
|
||||
"""Detect a supported CU file type from extension or MIME type."""
|
||||
allowed = set(file_types) if file_types is not None else None
|
||||
|
||||
extension = (stream_info.extension or "").lower()
|
||||
file_type = _EXTENSION_MAP.get(extension)
|
||||
if file_type is not None and (allowed is None or file_type in allowed):
|
||||
return file_type
|
||||
|
||||
mimetype = _clean_mime_type(stream_info.mimetype)
|
||||
if not mimetype:
|
||||
return None
|
||||
|
||||
return _detect_file_type_from_mime(mimetype, allowed)
|
||||
|
||||
|
||||
def _clean_mime_type(mimetype: Optional[str]) -> str:
|
||||
return (mimetype or "").split(";", 1)[0].strip().lower()
|
||||
|
||||
|
||||
def _canonical_mime_type(mimetype: Optional[str]) -> str:
|
||||
cleaned = _clean_mime_type(mimetype)
|
||||
return _MIME_ALIASES.get(cleaned, cleaned) or "application/octet-stream"
|
||||
|
||||
|
||||
def _content_type_for(
|
||||
file_type: ContentUnderstandingFileType,
|
||||
mimetype: Optional[str],
|
||||
) -> str:
|
||||
"""Resolve the content type to send to the CU API.
|
||||
|
||||
Uses the resolved ``file_type`` as the source of truth so analyzer
|
||||
routing and payload metadata stay consistent. The caller-provided
|
||||
``mimetype`` is only used when it is consistent with ``file_type``
|
||||
(e.g., to preserve subtype distinctions like ``image/heic`` vs
|
||||
``image/heif``). When ``mimetype`` disagrees with the resolved
|
||||
``file_type`` (e.g., ``.pdf`` extension with ``audio/mpeg``
|
||||
mimetype), the canonical MIME type for ``file_type`` is used.
|
||||
"""
|
||||
prefixes = _MIME_PREFIXES.get(file_type, [])
|
||||
canonical = _canonical_mime_type(mimetype)
|
||||
|
||||
# Use caller-provided MIME if it's consistent with the resolved file_type
|
||||
if prefixes and canonical != "application/octet-stream":
|
||||
for prefix in prefixes:
|
||||
if canonical.startswith(prefix):
|
||||
return canonical
|
||||
|
||||
# Fallback: derive from the resolved file_type (single source of truth)
|
||||
if prefixes:
|
||||
return _canonical_mime_type(prefixes[0])
|
||||
|
||||
return canonical
|
||||
|
||||
|
||||
def _detect_file_type_from_mime(
|
||||
mimetype: str,
|
||||
allowed: Optional[set[ContentUnderstandingFileType]],
|
||||
) -> Optional[ContentUnderstandingFileType]:
|
||||
for candidate, prefixes in _MIME_PREFIXES.items():
|
||||
if allowed is not None and candidate not in allowed:
|
||||
continue
|
||||
for prefix in prefixes:
|
||||
if mimetype.startswith(prefix):
|
||||
return candidate
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Smart routing: base_analyzer_id → modality mapping
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
_BASE_TO_MODALITY: Dict[str, str] = {
|
||||
"prebuilt-document": "document",
|
||||
"prebuilt-image": "image",
|
||||
"prebuilt-audio": "audio",
|
||||
"prebuilt-video": "video",
|
||||
}
|
||||
|
||||
# Cache of known prebuilt analyzer name → modality (avoids API call)
|
||||
_KNOWN_PREBUILT_MODALITY: Dict[str, str] = {
|
||||
# Document-based prebuilts
|
||||
"prebuilt-documentSearch": "document",
|
||||
"prebuilt-layout": "document",
|
||||
"prebuilt-read": "document",
|
||||
"prebuilt-document": "document",
|
||||
"prebuilt-invoice": "document",
|
||||
"prebuilt-receipt": "document",
|
||||
"prebuilt-receipt.generic": "document",
|
||||
"prebuilt-receipt.hotel": "document",
|
||||
"prebuilt-idDocument": "document",
|
||||
"prebuilt-idDocument.generic": "document",
|
||||
"prebuilt-idDocument.passport": "document",
|
||||
"prebuilt-healthInsuranceCard.us": "document",
|
||||
"prebuilt-contract": "document",
|
||||
"prebuilt-creditCard": "document",
|
||||
"prebuilt-creditMemo": "document",
|
||||
"prebuilt-bankStatement.us": "document",
|
||||
"prebuilt-check.us": "document",
|
||||
"prebuilt-purchaseOrder": "document",
|
||||
"prebuilt-procurement": "document",
|
||||
"prebuilt-payStub.us": "document",
|
||||
"prebuilt-utilityBill": "document",
|
||||
"prebuilt-marriageCertificate.us": "document",
|
||||
"prebuilt-documentFieldSchema": "document",
|
||||
"prebuilt-documentFields": "document",
|
||||
# Tax prebuilts (all document-based)
|
||||
"prebuilt-tax.us": "document",
|
||||
"prebuilt-tax.us.w2": "document",
|
||||
"prebuilt-tax.us.w4": "document",
|
||||
"prebuilt-tax.us.1040": "document",
|
||||
# Mortgage prebuilts
|
||||
"prebuilt-mortgage.us": "document",
|
||||
"prebuilt-mortgage.us.1003": "document",
|
||||
"prebuilt-mortgage.us.closingDisclosure": "document",
|
||||
# Image-based prebuilts
|
||||
"prebuilt-image": "image",
|
||||
"prebuilt-imageSearch": "image",
|
||||
# Audio-based prebuilts
|
||||
"prebuilt-audio": "audio",
|
||||
"prebuilt-audioSearch": "audio",
|
||||
"prebuilt-callCenter": "audio",
|
||||
# Video-based prebuilts
|
||||
"prebuilt-video": "video",
|
||||
"prebuilt-videoSearch": "video",
|
||||
"prebuilt-videoSynopsis": "video",
|
||||
}
|
||||
|
||||
|
||||
def _resolve_analyzer_modality(client: Any, analyzer_id: str) -> str:
|
||||
"""Resolve analyzer modality from cache or via get_analyzer() fallback.
|
||||
|
||||
For known prebuilt-* names, returns the modality from
|
||||
``_KNOWN_PREBUILT_MODALITY`` without an API call. For unknown
|
||||
prebuilt-* names or custom analyzers, calls ``get_analyzer()``
|
||||
to inspect ``base_analyzer_id``.
|
||||
|
||||
Args:
|
||||
client: A ``ContentUnderstandingClient`` instance.
|
||||
analyzer_id: The analyzer ID to resolve.
|
||||
|
||||
Returns:
|
||||
Modality string ("document", "image", "audio", or "video").
|
||||
|
||||
Raises:
|
||||
ValueError: If ``get_analyzer()`` fails.
|
||||
"""
|
||||
# Known prebuilt — use cache, no API call
|
||||
if analyzer_id in _KNOWN_PREBUILT_MODALITY:
|
||||
return _KNOWN_PREBUILT_MODALITY[analyzer_id]
|
||||
|
||||
# Unknown prebuilt or custom analyzer — call get_analyzer()
|
||||
try:
|
||||
analyzer_info = client.get_analyzer(analyzer_id)
|
||||
except Exception as exc:
|
||||
raise ValueError(f"Failed to resolve analyzer '{analyzer_id}': {exc}") from exc
|
||||
|
||||
if analyzer_info.base_analyzer_id:
|
||||
return _BASE_TO_MODALITY.get(analyzer_info.base_analyzer_id, "document")
|
||||
return "document"
|
||||
|
||||
|
||||
def _is_analyzer_compatible(file_modality: str, analyzer_modality: str) -> bool:
|
||||
"""Return True when an analyzer modality can process a file modality."""
|
||||
if analyzer_modality == "document":
|
||||
return file_modality in {"document", "image"}
|
||||
return file_modality == analyzer_modality
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Converter
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class ContentUnderstandingConverter(DocumentConverter):
|
||||
"""Converts files using Azure Content Understanding.
|
||||
|
||||
Provides high-quality document, image, audio, and video conversion
|
||||
with structured field extraction via YAML front matter.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
endpoint: str,
|
||||
credential: AzureKeyCredential | TokenCredential | None = None,
|
||||
analyzer_id: Optional[str] = None,
|
||||
file_types: Optional[List[ContentUnderstandingFileType]] = None,
|
||||
):
|
||||
"""Initialize the Content Understanding converter.
|
||||
|
||||
Args:
|
||||
endpoint: CU resource endpoint URL.
|
||||
credential: Explicit credential. If None, falls back to
|
||||
AZURE_API_KEY env var, then DefaultAzureCredential.
|
||||
analyzer_id: Custom analyzer for compatible file types.
|
||||
When set, the converter checks the analyzer's base modality
|
||||
(via get_analyzer() at init) and routes only compatible
|
||||
file types to it. Incompatible modalities auto-route to
|
||||
default prebuilts. If None, auto-selects by extension/MIME.
|
||||
file_types: Which file types to handle. If None, uses the
|
||||
default set (all supported formats).
|
||||
"""
|
||||
super().__init__()
|
||||
|
||||
# Raise if dependencies are missing
|
||||
if _dependency_exc_info is not None:
|
||||
raise MissingDependencyException(
|
||||
"ContentUnderstandingConverter requires the optional dependency "
|
||||
"[az-content-understanding] (or [all]) to be installed. "
|
||||
"E.g., `pip install markitdown[az-content-understanding]`"
|
||||
) from _dependency_exc_info[
|
||||
1
|
||||
].with_traceback( # type: ignore[union-attr]
|
||||
_dependency_exc_info[2]
|
||||
)
|
||||
|
||||
self._file_types = file_types if file_types is not None else _ALL_FILE_TYPES
|
||||
self._analyzer_id = analyzer_id
|
||||
self._analyzer_modality: Optional[str] = None
|
||||
|
||||
# Resolve credential
|
||||
if credential is None:
|
||||
api_key = os.environ.get("AZURE_API_KEY")
|
||||
if api_key is not None:
|
||||
credential = AzureKeyCredential(api_key)
|
||||
else:
|
||||
credential = DefaultAzureCredential()
|
||||
|
||||
# User agent for telemetry
|
||||
try:
|
||||
from ..__about__ import __version__
|
||||
except ImportError:
|
||||
__version__ = "unknown"
|
||||
user_agent = f"markitdown-cu/{__version__}"
|
||||
|
||||
# Create CU client
|
||||
self._client = ContentUnderstandingClient(
|
||||
endpoint=endpoint,
|
||||
credential=credential,
|
||||
user_agent_policy=UserAgentPolicy(user_agent=user_agent),
|
||||
)
|
||||
|
||||
# Smart routing: resolve analyzer modality at init (at most one API call)
|
||||
if self._analyzer_id is not None:
|
||||
self._analyzer_modality = _resolve_analyzer_modality(
|
||||
self._client, self._analyzer_id
|
||||
)
|
||||
|
||||
def accepts(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> bool:
|
||||
"""Return True if the file type is in the configured set."""
|
||||
return _detect_file_type(stream_info, self._file_types) is not None
|
||||
|
||||
def convert(
|
||||
self,
|
||||
file_stream: BinaryIO,
|
||||
stream_info: StreamInfo,
|
||||
**kwargs: Any,
|
||||
) -> DocumentConverterResult:
|
||||
"""Convert the file using CU and return Markdown with YAML front matter."""
|
||||
|
||||
# 1. Determine analyzer_id (smart routing: check modality)
|
||||
file_type = _detect_file_type(stream_info, self._file_types)
|
||||
if file_type is None:
|
||||
raise ValueError(
|
||||
"Unsupported file type for Content Understanding conversion."
|
||||
)
|
||||
file_modality = _get_modality(file_type)
|
||||
|
||||
if (
|
||||
self._analyzer_id is not None
|
||||
and self._analyzer_modality is not None
|
||||
and _is_analyzer_compatible(file_modality, self._analyzer_modality)
|
||||
):
|
||||
analyzer_id = self._analyzer_id
|
||||
else:
|
||||
analyzer_id = _PREBUILT_ANALYZERS.get(
|
||||
file_modality, "prebuilt-documentSearch"
|
||||
)
|
||||
|
||||
# 2. Read file bytes and determine MIME type
|
||||
file_bytes = file_stream.read()
|
||||
content_type = _content_type_for(file_type, stream_info.mimetype)
|
||||
|
||||
# 3. Call CU SDK
|
||||
poller = self._client.begin_analyze_binary(
|
||||
analyzer_id=analyzer_id,
|
||||
binary_input=file_bytes,
|
||||
content_type=content_type,
|
||||
)
|
||||
|
||||
# 4. Block on result
|
||||
result = poller.result()
|
||||
|
||||
# 5. Format output using to_llm_input()
|
||||
text = to_llm_input(result)
|
||||
|
||||
# 6. Return
|
||||
return DocumentConverterResult(markdown=text)
|
||||
928
packages/markitdown/tests/test_cu_converter.py
Normal file
928
packages/markitdown/tests/test_cu_converter.py
Normal file
@@ -0,0 +1,928 @@
|
||||
"""Tests for ContentUnderstandingConverter.
|
||||
|
||||
Tests accepts() routing, smart routing modality logic, and convert() via mocks.
|
||||
Follows the same pattern as test_docintel_html.py.
|
||||
"""
|
||||
|
||||
import io
|
||||
import sys
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from markitdown.converters._cu_converter import (
|
||||
ContentUnderstandingConverter,
|
||||
ContentUnderstandingFileType,
|
||||
_resolve_analyzer_modality,
|
||||
_get_modality,
|
||||
_detect_file_type,
|
||||
_canonical_mime_type,
|
||||
_content_type_for,
|
||||
_EXTENSION_MAP,
|
||||
)
|
||||
from markitdown._stream_info import StreamInfo
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helper: create a converter with accepts() working but no SDK init
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _make_converter(file_types=None, analyzer_id=None, analyzer_modality=None):
|
||||
"""Create a converter bypassing __init__ (no SDK deps needed)."""
|
||||
conv = ContentUnderstandingConverter.__new__(ContentUnderstandingConverter)
|
||||
conv._analyzer_id = analyzer_id
|
||||
conv._analyzer_modality = analyzer_modality
|
||||
|
||||
# Set accepted file types without running SDK-dependent initialization.
|
||||
from markitdown.converters._cu_converter import (
|
||||
_ALL_FILE_TYPES,
|
||||
)
|
||||
|
||||
types = file_types if file_types is not None else _ALL_FILE_TYPES
|
||||
conv._file_types = types
|
||||
|
||||
return conv
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# accepts() tests — extension-based
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestAcceptsExtension:
|
||||
"""Test accepts() for supported and unsupported file extensions."""
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"ext",
|
||||
[
|
||||
".pdf",
|
||||
".docx",
|
||||
".pptx",
|
||||
".xlsx",
|
||||
".html",
|
||||
".txt",
|
||||
".md",
|
||||
".rtf",
|
||||
".xml",
|
||||
".eml",
|
||||
".msg",
|
||||
".jpg",
|
||||
".jpeg",
|
||||
".jpe",
|
||||
".png",
|
||||
".bmp",
|
||||
".tiff",
|
||||
".heif",
|
||||
".heic",
|
||||
".mp4",
|
||||
".m4v",
|
||||
".mov",
|
||||
".avi",
|
||||
".mkv",
|
||||
".webm",
|
||||
".flv",
|
||||
".wmv",
|
||||
".wav",
|
||||
".mp3",
|
||||
".m4a",
|
||||
".flac",
|
||||
".ogg",
|
||||
".aac",
|
||||
".wma",
|
||||
],
|
||||
)
|
||||
def test_accepts_supported_extensions(self, ext):
|
||||
conv = _make_converter()
|
||||
assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=ext))
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"ext",
|
||||
[
|
||||
".csv",
|
||||
".json",
|
||||
".zip",
|
||||
".epub",
|
||||
".py",
|
||||
".rs",
|
||||
],
|
||||
)
|
||||
def test_rejects_unsupported_extensions(self, ext):
|
||||
conv = _make_converter()
|
||||
assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=ext))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# accepts() tests — MIME-based
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestAcceptsMime:
|
||||
"""Test accepts() for MIME type matching."""
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"mime",
|
||||
[
|
||||
"application/pdf",
|
||||
"image/jpeg",
|
||||
"video/mp4",
|
||||
"audio/wav",
|
||||
"audio/x-wav",
|
||||
"text/html",
|
||||
"audio/mpeg",
|
||||
"audio/x-m4a",
|
||||
"audio/x-flac",
|
||||
"video/quicktime",
|
||||
"video/webm",
|
||||
"video/x-m4v",
|
||||
"video/x-flv",
|
||||
"video/x-ms-wmv",
|
||||
"audio/aac",
|
||||
"audio/x-ms-wma",
|
||||
],
|
||||
)
|
||||
def test_accepts_supported_mimetypes(self, mime):
|
||||
conv = _make_converter()
|
||||
assert conv.accepts(io.BytesIO(b""), StreamInfo(mimetype=mime))
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"mime",
|
||||
[
|
||||
"text/csv",
|
||||
"application/json",
|
||||
"application/zip",
|
||||
],
|
||||
)
|
||||
def test_rejects_unsupported_mimetypes(self, mime):
|
||||
conv = _make_converter()
|
||||
assert not conv.accepts(io.BytesIO(b""), StreamInfo(mimetype=mime))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# accepts() tests — cu_file_types restriction
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestAcceptsFileTypeRestriction:
|
||||
"""Test that cu_file_types restricts which formats are accepted."""
|
||||
|
||||
def test_restricted_to_pdf_only(self):
|
||||
conv = _make_converter(file_types=[ContentUnderstandingFileType.PDF])
|
||||
assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=".pdf"))
|
||||
assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".mp4"))
|
||||
assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".wav"))
|
||||
assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".jpg"))
|
||||
|
||||
def test_restricted_to_audio(self):
|
||||
conv = _make_converter(
|
||||
file_types=[
|
||||
ContentUnderstandingFileType.WAV,
|
||||
ContentUnderstandingFileType.MP3,
|
||||
]
|
||||
)
|
||||
assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=".wav"))
|
||||
assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=".mp3"))
|
||||
assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".pdf"))
|
||||
|
||||
def test_webm_value_matches_cli_input(self):
|
||||
assert ContentUnderstandingFileType("webm") == ContentUnderstandingFileType.WEBM
|
||||
|
||||
def test_m4v_value_matches_cli_input(self):
|
||||
assert ContentUnderstandingFileType("m4v") == ContentUnderstandingFileType.M4V
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# file type detection tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestDetectFileType:
|
||||
"""Test extension and MIME based file type detection."""
|
||||
|
||||
def test_detects_video_from_mime_without_extension(self):
|
||||
assert (
|
||||
_detect_file_type(StreamInfo(mimetype="video/mp4"))
|
||||
== ContentUnderstandingFileType.MP4
|
||||
)
|
||||
|
||||
def test_detects_audio_from_mime_without_extension(self):
|
||||
assert (
|
||||
_detect_file_type(StreamInfo(mimetype="audio/mpeg"))
|
||||
== ContentUnderstandingFileType.MP3
|
||||
)
|
||||
|
||||
def test_detects_audio_alias_from_mime_without_extension(self):
|
||||
assert (
|
||||
_detect_file_type(StreamInfo(mimetype="audio/x-wav"))
|
||||
== ContentUnderstandingFileType.WAV
|
||||
)
|
||||
|
||||
def test_detects_video_alias_from_mime_without_extension(self):
|
||||
assert (
|
||||
_detect_file_type(StreamInfo(mimetype="video/x-m4v"))
|
||||
== ContentUnderstandingFileType.M4V
|
||||
)
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("mimetype", "expected"),
|
||||
[
|
||||
("audio/x-wav", "audio/wav"),
|
||||
("audio/x-flac", "audio/flac"),
|
||||
("audio/x-m4a", "audio/mp4"),
|
||||
("video/x-m4v", "video/mp4"),
|
||||
("video/mp4", "video/mp4"),
|
||||
(None, "application/octet-stream"),
|
||||
],
|
||||
)
|
||||
def test_canonical_mime_type(self, mimetype, expected):
|
||||
assert _canonical_mime_type(mimetype) == expected
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("file_type", "mimetype", "expected"),
|
||||
[
|
||||
(ContentUnderstandingFileType.PDF, None, "application/pdf"),
|
||||
(ContentUnderstandingFileType.M4V, None, "video/mp4"),
|
||||
(ContentUnderstandingFileType.FLAC, "audio/x-flac", "audio/flac"),
|
||||
],
|
||||
)
|
||||
def test_content_type_for(self, file_type, mimetype, expected):
|
||||
assert _content_type_for(file_type, mimetype) == expected
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("file_type", "mimetype", "expected"),
|
||||
[
|
||||
# Extension/file_type wins when mimetype disagrees — the
|
||||
# resolved file_type is the single source of truth so that
|
||||
# analyzer routing and payload metadata stay consistent.
|
||||
(ContentUnderstandingFileType.PDF, "audio/mpeg", "application/pdf"),
|
||||
(ContentUnderstandingFileType.MP3, "application/pdf", "audio/mpeg"),
|
||||
(ContentUnderstandingFileType.MP4, "image/jpeg", "video/mp4"),
|
||||
(ContentUnderstandingFileType.JPEG, "video/mp4", "image/jpeg"),
|
||||
# Subtype distinctions are preserved when consistent
|
||||
# (e.g., HEIC vs HEIF both map to file_type HEIF; if the
|
||||
# caller passed image/heic explicitly, keep it).
|
||||
(ContentUnderstandingFileType.HEIF, "image/heic", "image/heic"),
|
||||
(ContentUnderstandingFileType.HEIF, "image/heif", "image/heif"),
|
||||
],
|
||||
)
|
||||
def test_content_type_for_resolves_conflicts_to_file_type(
|
||||
self, file_type, mimetype, expected
|
||||
):
|
||||
"""When extension and mimetype disagree, file_type wins."""
|
||||
assert _content_type_for(file_type, mimetype) == expected
|
||||
|
||||
def test_conflicting_extension_and_mimetype_in_convert(self):
|
||||
"""End-to-end: conflicting StreamInfo routes by extension and
|
||||
sends a content_type consistent with the resolved file_type."""
|
||||
conv = _make_converter()
|
||||
conv._client = MagicMock()
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = MagicMock(contents=[])
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch(
|
||||
"markitdown.converters._cu_converter.to_llm_input",
|
||||
return_value="ok",
|
||||
):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake"),
|
||||
# .pdf extension but bogus audio mimetype
|
||||
StreamInfo(extension=".pdf", mimetype="audio/mpeg"),
|
||||
)
|
||||
|
||||
call_kwargs = conv._client.begin_analyze_binary.call_args.kwargs
|
||||
# Routed by extension: document modality → prebuilt-documentSearch
|
||||
assert call_kwargs["analyzer_id"] == "prebuilt-documentSearch"
|
||||
# content_type derived from file_type (PDF), not the conflicting mime
|
||||
assert call_kwargs["content_type"] == "application/pdf"
|
||||
|
||||
def test_file_type_restriction_applies_to_mime(self):
|
||||
assert (
|
||||
_detect_file_type(
|
||||
StreamInfo(mimetype="video/mp4"),
|
||||
[ContentUnderstandingFileType.PDF],
|
||||
)
|
||||
is None
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Smart routing tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestSmartRouting:
|
||||
"""Test modality-aware analyzer routing."""
|
||||
|
||||
def test_document_analyzer_routes_pdf_to_custom(self):
|
||||
"""Document-based analyzer should be used for PDF."""
|
||||
conv = _make_converter(
|
||||
analyzer_id="my-doc-analyzer",
|
||||
analyzer_modality="document",
|
||||
)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake pdf"),
|
||||
StreamInfo(extension=".pdf", mimetype="application/pdf"),
|
||||
)
|
||||
|
||||
# Should use the custom analyzer for PDF (document modality)
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "my-doc-analyzer"
|
||||
|
||||
def test_document_analyzer_routes_mp3_to_prebuilt(self):
|
||||
"""Document-based analyzer should auto-route MP3 to prebuilt-audioSearch."""
|
||||
conv = _make_converter(
|
||||
analyzer_id="my-doc-analyzer",
|
||||
analyzer_modality="document",
|
||||
)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake audio"),
|
||||
StreamInfo(extension=".mp3", mimetype="audio/mpeg"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-audioSearch"
|
||||
|
||||
def test_document_analyzer_routes_mp4_to_prebuilt(self):
|
||||
"""Document-based analyzer should auto-route MP4 to prebuilt-videoSearch."""
|
||||
conv = _make_converter(
|
||||
analyzer_id="my-doc-analyzer",
|
||||
analyzer_modality="document",
|
||||
)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake video"),
|
||||
StreamInfo(extension=".mp4", mimetype="video/mp4"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-videoSearch"
|
||||
|
||||
def test_no_analyzer_id_uses_auto_routing(self):
|
||||
"""Without analyzer_id, PDF should auto-route to prebuilt-documentSearch."""
|
||||
conv = _make_converter(analyzer_id=None, analyzer_modality=None)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake pdf"),
|
||||
StreamInfo(extension=".pdf", mimetype="application/pdf"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
|
||||
|
||||
def test_no_analyzer_id_routes_image_to_document_search(self):
|
||||
"""Default image routing should still use prebuilt-documentSearch."""
|
||||
conv = _make_converter(analyzer_id=None, analyzer_modality=None)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake image"),
|
||||
StreamInfo(extension=".jpg", mimetype="image/jpeg"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
|
||||
|
||||
def test_document_analyzer_routes_image_to_custom(self):
|
||||
"""Document-based analyzers should still handle image documents."""
|
||||
conv = _make_converter(
|
||||
analyzer_id="my-doc-analyzer",
|
||||
analyzer_modality="document",
|
||||
)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake image"),
|
||||
StreamInfo(extension=".jpg", mimetype="image/jpeg"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "my-doc-analyzer"
|
||||
|
||||
def test_image_analyzer_routes_jpeg_to_custom(self):
|
||||
"""Image-based analyzers should be used for image files."""
|
||||
conv = _make_converter(
|
||||
analyzer_id="my-image-analyzer",
|
||||
analyzer_modality="image",
|
||||
)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake image"),
|
||||
StreamInfo(extension=".jpg", mimetype="image/jpeg"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "my-image-analyzer"
|
||||
|
||||
def test_image_analyzer_routes_pdf_to_document_prebuilt(self):
|
||||
"""Image-based analyzers should not claim non-image document files."""
|
||||
conv = _make_converter(
|
||||
analyzer_id="my-image-analyzer",
|
||||
analyzer_modality="image",
|
||||
)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(
|
||||
io.BytesIO(b"fake pdf"),
|
||||
StreamInfo(extension=".pdf", mimetype="application/pdf"),
|
||||
)
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
("mimetype", "expected_analyzer"),
|
||||
[
|
||||
("video/mp4", "prebuilt-videoSearch"),
|
||||
("video/x-m4v", "prebuilt-videoSearch"),
|
||||
("audio/mpeg", "prebuilt-audioSearch"),
|
||||
("audio/x-wav", "prebuilt-audioSearch"),
|
||||
],
|
||||
)
|
||||
def test_mime_only_input_uses_auto_routing(self, mimetype, expected_analyzer):
|
||||
"""MIME-only streams should route to the matching modality analyzer."""
|
||||
conv = _make_converter(analyzer_id=None, analyzer_modality=None)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(io.BytesIO(b"fake content"), StreamInfo(mimetype=mimetype))
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == expected_analyzer
|
||||
|
||||
def test_mime_alias_input_uses_canonical_content_type(self):
|
||||
"""Alias MIME types should be sent to CU as canonical content types."""
|
||||
conv = _make_converter(analyzer_id=None, analyzer_modality=None)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(io.BytesIO(b"fake video"), StreamInfo(mimetype="video/x-m4v"))
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-videoSearch"
|
||||
assert call_args.kwargs["content_type"] == "video/mp4"
|
||||
|
||||
def test_extension_only_input_uses_file_type_content_type(self):
|
||||
"""Extension-only inputs should send CU a matching content type."""
|
||||
conv = _make_converter(analyzer_id=None, analyzer_modality=None)
|
||||
conv._client = MagicMock()
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
|
||||
conv.convert(io.BytesIO(b"fake pdf"), StreamInfo(extension=".pdf"))
|
||||
|
||||
call_args = conv._client.begin_analyze_binary.call_args
|
||||
assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
|
||||
assert call_args.kwargs["content_type"] == "application/pdf"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _infer_prebuilt_modality tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestResolveAnalyzerModality:
|
||||
"""Test modality resolution from analyzer IDs."""
|
||||
|
||||
def test_known_document_prebuilts(self):
|
||||
client = MagicMock()
|
||||
assert (
|
||||
_resolve_analyzer_modality(client, "prebuilt-documentSearch") == "document"
|
||||
)
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-invoice") == "document"
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-layout") == "document"
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-receipt") == "document"
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-tax.us.w2") == "document"
|
||||
# Known prebuilts should never call get_analyzer()
|
||||
client.get_analyzer.assert_not_called()
|
||||
|
||||
def test_known_audio_prebuilts(self):
|
||||
client = MagicMock()
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-audioSearch") == "audio"
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-callCenter") == "audio"
|
||||
client.get_analyzer.assert_not_called()
|
||||
|
||||
def test_known_video_prebuilts(self):
|
||||
client = MagicMock()
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-videoSearch") == "video"
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-videoSynopsis") == "video"
|
||||
client.get_analyzer.assert_not_called()
|
||||
|
||||
def test_known_image_prebuilts(self):
|
||||
client = MagicMock()
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-imageSearch") == "image"
|
||||
assert _resolve_analyzer_modality(client, "prebuilt-image") == "image"
|
||||
client.get_analyzer.assert_not_called()
|
||||
|
||||
def test_unknown_prebuilt_falls_back_to_get_analyzer(self):
|
||||
"""Unknown prebuilt-* names should call get_analyzer() for resolution."""
|
||||
client = MagicMock()
|
||||
mock_analyzer = MagicMock()
|
||||
mock_analyzer.base_analyzer_id = "prebuilt-audio"
|
||||
client.get_analyzer.return_value = mock_analyzer
|
||||
|
||||
result = _resolve_analyzer_modality(client, "prebuilt-newAnalyzer")
|
||||
assert result == "audio"
|
||||
client.get_analyzer.assert_called_once_with("prebuilt-newAnalyzer")
|
||||
|
||||
def test_custom_analyzer_calls_get_analyzer(self):
|
||||
"""Custom analyzers should call get_analyzer() to resolve modality."""
|
||||
client = MagicMock()
|
||||
mock_analyzer = MagicMock()
|
||||
mock_analyzer.base_analyzer_id = "prebuilt-document"
|
||||
client.get_analyzer.return_value = mock_analyzer
|
||||
|
||||
result = _resolve_analyzer_modality(client, "my-custom-doc-analyzer")
|
||||
assert result == "document"
|
||||
client.get_analyzer.assert_called_once_with("my-custom-doc-analyzer")
|
||||
|
||||
def test_custom_analyzer_no_base_defaults_to_document(self):
|
||||
"""Analyzer with no base_analyzer_id defaults to document."""
|
||||
client = MagicMock()
|
||||
mock_analyzer = MagicMock()
|
||||
mock_analyzer.base_analyzer_id = None
|
||||
client.get_analyzer.return_value = mock_analyzer
|
||||
|
||||
result = _resolve_analyzer_modality(client, "my-custom-analyzer")
|
||||
assert result == "document"
|
||||
|
||||
def test_get_analyzer_failure_raises_value_error(self):
|
||||
"""Failed get_analyzer() should raise ValueError."""
|
||||
client = MagicMock()
|
||||
client.get_analyzer.side_effect = Exception("not found")
|
||||
|
||||
with pytest.raises(ValueError, match="Failed to resolve analyzer 'bad-id'"):
|
||||
_resolve_analyzer_modality(client, "bad-id")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _get_modality tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGetModality:
|
||||
"""Test file type → modality mapping."""
|
||||
|
||||
def test_document_types(self):
|
||||
assert _get_modality(ContentUnderstandingFileType.PDF) == "document"
|
||||
assert _get_modality(ContentUnderstandingFileType.DOCX) == "document"
|
||||
|
||||
def test_image_types(self):
|
||||
assert _get_modality(ContentUnderstandingFileType.JPEG) == "image"
|
||||
assert _get_modality(ContentUnderstandingFileType.PNG) == "image"
|
||||
|
||||
def test_video_types(self):
|
||||
assert _get_modality(ContentUnderstandingFileType.MP4) == "video"
|
||||
assert _get_modality(ContentUnderstandingFileType.MOV) == "video"
|
||||
|
||||
def test_audio_types(self):
|
||||
assert _get_modality(ContentUnderstandingFileType.WAV) == "audio"
|
||||
assert _get_modality(ContentUnderstandingFileType.MP3) == "audio"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# convert() mock tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestConvertMock:
|
||||
"""Test convert() with mocked CU SDK."""
|
||||
|
||||
def _run_convert(self, extension, mimetype, expected_output="mock output"):
|
||||
conv = _make_converter()
|
||||
conv._client = MagicMock()
|
||||
|
||||
mock_result = MagicMock()
|
||||
mock_result.contents = []
|
||||
mock_poller = MagicMock()
|
||||
mock_poller.result.return_value = mock_result
|
||||
conv._client.begin_analyze_binary.return_value = mock_poller
|
||||
|
||||
with patch(
|
||||
"markitdown.converters._cu_converter.to_llm_input",
|
||||
return_value=expected_output,
|
||||
):
|
||||
result = conv.convert(
|
||||
io.BytesIO(b"fake content"),
|
||||
StreamInfo(extension=extension, mimetype=mimetype),
|
||||
)
|
||||
return result
|
||||
|
||||
def test_pdf_returns_markdown(self):
|
||||
result = self._run_convert(
|
||||
".pdf", "application/pdf", "---\ncontentType: document\n---\n# Test"
|
||||
)
|
||||
assert "contentType: document" in result.markdown
|
||||
|
||||
def test_mp4_returns_markdown(self):
|
||||
result = self._run_convert(
|
||||
".mp4", "video/mp4", "---\ncontentType: audioVisual\n---\nSpeaker 1: Hello"
|
||||
)
|
||||
assert "contentType: audioVisual" in result.markdown
|
||||
|
||||
def test_wav_returns_markdown(self):
|
||||
result = self._run_convert(
|
||||
".wav", "audio/wav", "---\ncontentType: audioVisual\n---\nSpeaker 1: Hi"
|
||||
)
|
||||
assert "audioVisual" in result.markdown
|
||||
|
||||
def test_empty_result(self):
|
||||
result = self._run_convert(".pdf", "application/pdf", "")
|
||||
assert result.markdown == ""
|
||||
|
||||
def test_jpeg_returns_markdown(self):
|
||||
result = self._run_convert(
|
||||
".jpg", "image/jpeg", "---\ncontentType: document\n---\n# Photo"
|
||||
)
|
||||
assert "contentType: document" in result.markdown
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Init-time get_analyzer() error wrapping
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestGetAnalyzerError:
|
||||
"""Test that get_analyzer() failures at init produce a clear error."""
|
||||
|
||||
def test_nonexistent_analyzer_raises_value_error(self):
|
||||
"""A failed get_analyzer() should raise ValueError with analyzer name."""
|
||||
with patch(
|
||||
"markitdown.converters._cu_converter._dependency_exc_info", None
|
||||
), patch(
|
||||
"markitdown.converters._cu_converter.ContentUnderstandingClient"
|
||||
) as MockClient, patch(
|
||||
"markitdown.converters._cu_converter.DefaultAzureCredential"
|
||||
):
|
||||
mock_client = MagicMock()
|
||||
mock_client.get_analyzer.side_effect = Exception("not found")
|
||||
MockClient.return_value = mock_client
|
||||
|
||||
with pytest.raises(ValueError, match="Failed to resolve analyzer 'bad-id'"):
|
||||
ContentUnderstandingConverter(
|
||||
endpoint="https://fake", analyzer_id="bad-id"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Registration priority test
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestRegistrationPriority:
|
||||
"""Test that CU converter is registered with higher priority than Doc Intel."""
|
||||
|
||||
def test_cu_registered_before_docintel(self):
|
||||
"""When both endpoints are provided, CU should appear before Doc Intel."""
|
||||
with patch(
|
||||
"markitdown.converters._cu_converter._dependency_exc_info", None
|
||||
), patch(
|
||||
"markitdown.converters._cu_converter.ContentUnderstandingClient"
|
||||
), patch(
|
||||
"markitdown.converters._cu_converter.DefaultAzureCredential"
|
||||
), patch(
|
||||
"markitdown.converters._doc_intel_converter._dependency_exc_info", None
|
||||
), patch(
|
||||
"markitdown.converters._doc_intel_converter.DocumentIntelligenceClient"
|
||||
), patch(
|
||||
"markitdown.converters._doc_intel_converter.DefaultAzureCredential"
|
||||
):
|
||||
from markitdown import MarkItDown
|
||||
from markitdown.converters import (
|
||||
ContentUnderstandingConverter,
|
||||
DocumentIntelligenceConverter,
|
||||
)
|
||||
|
||||
md = MarkItDown(
|
||||
cu_endpoint="https://fake-cu",
|
||||
docintel_endpoint="https://fake-di",
|
||||
)
|
||||
|
||||
converter_types = [type(reg.converter) for reg in md._converters]
|
||||
cu_idx = converter_types.index(ContentUnderstandingConverter)
|
||||
di_idx = converter_types.index(DocumentIntelligenceConverter)
|
||||
assert (
|
||||
cu_idx < di_idx
|
||||
), "CU should have higher priority (lower index) than Doc Intel"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI argument tests
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestCLIArgs:
|
||||
"""Test CLI argument parsing for CU flags."""
|
||||
|
||||
def test_use_cu_without_endpoint_exits(self):
|
||||
"""--use-cu without --cu-endpoint should exit with error."""
|
||||
import subprocess
|
||||
|
||||
result = subprocess.run(
|
||||
[sys.executable, "-m", "markitdown", "--use-cu", "fake.pdf"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
assert result.returncode != 0
|
||||
assert (
|
||||
"cu-endpoint" in result.stderr.lower()
|
||||
or "cu-endpoint" in (result.stdout or "").lower()
|
||||
)
|
||||
|
||||
def test_use_cu_and_use_docintel_mutually_exclusive(self):
|
||||
"""--use-cu and --use-docintel cannot be used together."""
|
||||
import subprocess
|
||||
|
||||
result = subprocess.run(
|
||||
[
|
||||
sys.executable,
|
||||
"-m",
|
||||
"markitdown",
|
||||
"--use-cu",
|
||||
"--cu-endpoint",
|
||||
"https://fake",
|
||||
"--use-docintel",
|
||||
"-e",
|
||||
"https://fake-di",
|
||||
"fake.pdf",
|
||||
],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
assert result.returncode != 0
|
||||
|
||||
def test_cu_file_types_parsing(self):
|
||||
"""--cu-file-types should parse comma-separated values into enum list."""
|
||||
from markitdown.converters import ContentUnderstandingFileType
|
||||
|
||||
raw = "pdf,jpeg,mp4"
|
||||
type_names = [t.strip().lower() for t in raw.split(",") if t.strip()]
|
||||
cu_types = [ContentUnderstandingFileType(name) for name in type_names]
|
||||
|
||||
assert cu_types == [
|
||||
ContentUnderstandingFileType.PDF,
|
||||
ContentUnderstandingFileType.JPEG,
|
||||
ContentUnderstandingFileType.MP4,
|
||||
]
|
||||
|
||||
def test_cu_file_types_invalid_value(self):
|
||||
"""Unknown file type name should raise ValueError."""
|
||||
from markitdown.converters import ContentUnderstandingFileType
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
ContentUnderstandingFileType("nonsense")
|
||||
|
||||
def test_cu_file_types_single_value(self):
|
||||
"""Single file type (no comma) should parse correctly."""
|
||||
from markitdown.converters import ContentUnderstandingFileType
|
||||
|
||||
cu_types = [
|
||||
ContentUnderstandingFileType(t.strip().lower())
|
||||
for t in "wav".split(",")
|
||||
if t.strip()
|
||||
]
|
||||
assert cu_types == [ContentUnderstandingFileType.WAV]
|
||||
|
||||
def test_use_cu_wires_kwargs_to_markitdown(self, capsys):
|
||||
"""--use-cu should pass CU options through to MarkItDown."""
|
||||
import markitdown.__main__ as markitdown_cli
|
||||
|
||||
markitdown_instance = MagicMock()
|
||||
markitdown_instance.convert.return_value.markdown = "converted"
|
||||
markitdown_cls = MagicMock(return_value=markitdown_instance)
|
||||
|
||||
with patch.object(
|
||||
sys,
|
||||
"argv",
|
||||
[
|
||||
"markitdown",
|
||||
"--use-cu",
|
||||
"--cu-endpoint",
|
||||
"https://fake-cu",
|
||||
"--cu-analyzer",
|
||||
"custom-analyzer",
|
||||
"--cu-file-types",
|
||||
"pdf,jpeg,mp4",
|
||||
"fake.pdf",
|
||||
],
|
||||
), patch.object(markitdown_cli, "MarkItDown", markitdown_cls):
|
||||
markitdown_cli.main()
|
||||
|
||||
markitdown_cls.assert_called_once_with(
|
||||
enable_plugins=False,
|
||||
cu_endpoint="https://fake-cu",
|
||||
cu_analyzer_id="custom-analyzer",
|
||||
cu_file_types=[
|
||||
ContentUnderstandingFileType.PDF,
|
||||
ContentUnderstandingFileType.JPEG,
|
||||
ContentUnderstandingFileType.MP4,
|
||||
],
|
||||
)
|
||||
markitdown_instance.convert.assert_called_once_with(
|
||||
"fake.pdf", stream_info=None, keep_data_uris=False
|
||||
)
|
||||
assert capsys.readouterr().out == "converted\n"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# MissingDependencyException test
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestMissingDependency:
|
||||
"""Test that MissingDependencyException is raised when CU SDK is not installed."""
|
||||
|
||||
def test_missing_deps_message(self):
|
||||
"""Converter construction should surface the optional install hint."""
|
||||
import markitdown.converters._cu_converter as cu_converter_module
|
||||
from markitdown._exceptions import MissingDependencyException
|
||||
|
||||
import_error = ImportError("No module named 'azure.ai.contentunderstanding'")
|
||||
dependency_exc_info = (ImportError, import_error, None)
|
||||
|
||||
with patch.object(
|
||||
cu_converter_module, "_dependency_exc_info", dependency_exc_info
|
||||
), pytest.raises(MissingDependencyException) as exc_info:
|
||||
ContentUnderstandingConverter(endpoint="https://fake-cu")
|
||||
|
||||
assert "az-content-understanding" in str(exc_info.value)
|
||||
assert exc_info.value.__cause__ is import_error
|
||||
Reference in New Issue
Block a user