feat: Add Azure Content Understanding converter (#1865)

* inital version * improve mime type detection * prebuilt-image custom analzyer route to image * enhance cu priority over di * fix: apply black formatting * update cache of known prebuilt name and README improvement * add test cases, run black * update readme and deriving content_type from the resolved file_type * update readme
2026-07-03 12:28:14 +08:00 · 2026-05-21 21:59:41 -07:00
parent a51f725d7f
commit a01d74dda7
7 changed files with 1667 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -107,6 +107,7 @@ At the moment, the following optional dependencies are available:
 * `[pdf]` Installs dependencies for PDF files
 * `[outlook]` Installs dependencies for Outlook messages
 * `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
+* `[az-content-understanding]` Installs dependencies for Azure Content Understanding
 * `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
 * `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription

@@ -158,6 +159,83 @@ If no `llm_client` is provided the plugin still loads, but OCR is silently skipp

 See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.

+### Azure Content Understanding
+
+[Azure Content Understanding](https://learn.microsoft.com/azure/ai-services/content-understanding/) provides higher-quality conversion with structured field extraction (YAML front matter), multi-modal support (documents, images, audio, video), and configurable analyzers.
+
+Install: `pip install 'markitdown[az-content-understanding]'`
+
+#### When to use Content Understanding
+
+Content Understanding is ideal when you need capabilities beyond what built-in or Document Intelligence converters provide:
+
+- **Audio and video files** — CU is the only option for video, and the higher-quality cloud option for audio. Built-in converters have no video support and only basic audio transcription.
+- **Structured field extraction** — [Prebuilt](https://learn.microsoft.com/azure/ai-services/content-understanding/concepts/prebuilt-analyzers) or [custom-built](https://learn.microsoft.com/azure/ai-services/content-understanding/how-to/customize-analyzer-content-understanding-studio?tabs=portal) analyzers extract domain-specific fields (invoice amounts, receipt dates, contract clauses) serialized as YAML front matter. Neither built-in nor Doc Intel integration exposes fields.
+- **Higher-quality document extraction** — Cloud-based layout analysis and OCR for scanned PDFs, complex tables, and multi-page documents.
+- **Single API for all modalities** — One `cu_endpoint` handles documents, images, audio, and video with automatic analyzer routing.
+
+| Capability | Built-in converters | Azure Document Intelligence | Azure Content Understanding |
+|------------|---------------------|-----------------------------|-----------------------------|
+| Document conversion | Offline, format-specific extraction | Cloud layout extraction | Cloud multimodal extraction |
+| Structured fields | Not available | Not exposed by this integration | YAML front matter from analyzer fields |
+| Custom analyzers | Not available | Not configurable in this integration | Supported with `cu_analyzer_id` |
+| Audio and video | Basic audio, no video | Not supported | Audio and video analyzers |
+| Cost | Local compute only | Billable Azure API calls | Billable Azure API calls |
+
+**CLI:**
+
+```bash
+markitdown path-to-file.pdf --use-cu --cu-endpoint "<content_understanding_endpoint>"
+```
+
+**Python API:**
+
+```python
+from markitdown import MarkItDown
+
+# Zero-config — auto-selects analyzer per file type
+md = MarkItDown(cu_endpoint="<content_understanding_endpoint>")
+result = md.convert("report.pdf")   # documents → prebuilt-documentSearch
+result = md.convert("meeting.mp4")  # video → prebuilt-videoSearch
+result = md.convert("call.wav")     # audio → prebuilt-audioSearch
+print(result.markdown)
+```
+
+**With a custom analyzer** (for domain-specific field extraction):
+
+```python
+md = MarkItDown(
+    cu_endpoint="<content_understanding_endpoint>",
+    cu_analyzer_id="my-invoice-analyzer",
+)
+result = md.convert("invoice.pdf")
+print(result.markdown)
+# Output includes YAML front matter with extracted fields:
+# ---
+# contentType: document
+# fields:
+#   VendorName: CONTOSO LTD.
+#   InvoiceDate: '2019-11-15'
+# ---
+# <!-- page 1 -->
+# ...
+```
+
+When `cu_analyzer_id` is set, the converter automatically scopes it to compatible file types based on the analyzer's modality. Incompatible types (e.g., audio files with a document analyzer) auto-route to default prebuilt analyzers.
+
+**Cost note:** Each `convert()` call for a CU-routed format is a billable Azure API call. Use `cu_file_types` to restrict which formats route to CU:
+
+```python
+from markitdown.converters import ContentUnderstandingFileType
+
+md = MarkItDown(
+    cu_endpoint="<content_understanding_endpoint>",
+    cu_file_types=[ContentUnderstandingFileType.PDF],  # only PDFs use CU
+)
+```
+
+More information about Azure Content Understanding can be found [here](https://learn.microsoft.com/azure/ai-services/content-understanding/).
+
 ### Azure Document Intelligence

 To use Microsoft Document Intelligence for conversion:
--- a/packages/markitdown/pyproject.toml
+++ b/packages/markitdown/pyproject.toml
@@ -47,6 +47,7 @@ all = [
  "SpeechRecognition",
  "youtube-transcript-api~=1.0.0",
  "azure-ai-documentintelligence",
+  "azure-ai-contentunderstanding>=1.2.0b1",
  "azure-identity",
 ]
 pptx = ["python-pptx"]
@@ -58,6 +59,8 @@ outlook = ["olefile"]
 audio-transcription = ["pydub", "SpeechRecognition"]
 youtube-transcription = ["youtube-transcript-api"]
 az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
+# >=1.2.0b1 required for to_llm_input() helper used by ContentUnderstandingConverter
+az-content-understanding = ["azure-ai-contentunderstanding>=1.2.0b1", "azure-identity"]

 [project.urls]
 Documentation = "https://github.com/microsoft/markitdown#readme"
--- a/packages/markitdown/src/markitdown/main.py
+++ b/packages/markitdown/src/markitdown/main.py
@@ -4,6 +4,7 @@
 import argparse
 import sys
 import codecs
+from typing import Any, Dict
 from textwrap import dedent
 from importlib.metadata import entry_points
 from .__about__ import __version__
@@ -77,13 +78,22 @@ def main():
        help="Provide a hint about the file's charset (e.g, UTF-8).",
    )

-    parser.add_argument(
+    cloud_group = parser.add_mutually_exclusive_group()
+    cloud_group.add_argument(
        "-d",
        "--use-docintel",
        action="store_true",
        help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
    )

+    cloud_group.add_argument(
+        "--use-cu",
+        "--use-content-understanding",
+        action="store_true",
+        dest="use_cu",
+        help="Use Azure Content Understanding to extract text. Requires --cu-endpoint.",
+    )
+
    parser.add_argument(
        "-e",
        "--endpoint",
@@ -91,6 +101,24 @@ def main():
        help="Document Intelligence Endpoint. Required if using Document Intelligence.",
    )

+    parser.add_argument(
+        "--cu-endpoint",
+        type=str,
+        help="Content Understanding Endpoint. Required if using --use-cu.",
+    )
+
+    parser.add_argument(
+        "--cu-analyzer",
+        type=str,
+        help="Content Understanding analyzer ID. If not specified, auto-selects by file type.",
+    )
+
+    parser.add_argument(
+        "--cu-file-types",
+        type=str,
+        help="Comma-separated list of file types to route to Content Understanding (e.g., pdf,jpeg,mp4). If omitted, all supported types are routed.",
+    )
+
    parser.add_argument(
        "-p",
        "--use-plugins",
@@ -183,6 +211,36 @@ def main():
        markitdown = MarkItDown(
            enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
        )
+    elif args.use_cu:
+        if args.cu_endpoint is None:
+            _exit_with_error(
+                "Content Understanding Endpoint (--cu-endpoint) is required when using --use-cu."
+            )
+        elif args.filename is None:
+            _exit_with_error("Filename is required when using Content Understanding.")
+
+        cu_kwargs: Dict[str, Any] = {
+            "cu_endpoint": args.cu_endpoint,
+        }
+        if args.cu_analyzer is not None:
+            cu_kwargs["cu_analyzer_id"] = args.cu_analyzer
+        if args.cu_file_types is not None:
+            # Parse comma-separated file types into ContentUnderstandingFileType list
+            from .converters import ContentUnderstandingFileType
+
+            type_names = [
+                t.strip().lower() for t in args.cu_file_types.split(",") if t.strip()
+            ]
+            cu_types = []
+            for name in type_names:
+                # Try matching by value (e.g., "pdf", "jpeg", "mp4")
+                try:
+                    cu_types.append(ContentUnderstandingFileType(name))
+                except ValueError:
+                    _exit_with_error(f"Unknown file type: {name}")
+            cu_kwargs["cu_file_types"] = cu_types
+
+        markitdown = MarkItDown(enable_plugins=args.use_plugins, **cu_kwargs)
    else:
        markitdown = MarkItDown(enable_plugins=args.use_plugins)

--- a/packages/markitdown/src/markitdown/_markitdown.py
+++ b/packages/markitdown/src/markitdown/_markitdown.py
@@ -38,6 +38,7 @@ from .converters import (
    ZipConverter,
    EpubConverter,
    DocumentIntelligenceConverter,
+    ContentUnderstandingConverter,
    CsvConverter,
 )

@@ -225,6 +226,28 @@ class MarkItDown:
                    DocumentIntelligenceConverter(**docintel_args),
                )

+            # Register Content Understanding converter at the top of the stack if endpoint is provided
+            cu_endpoint = kwargs.get("cu_endpoint")
+            if cu_endpoint is not None:
+                cu_args: Dict[str, Any] = {}
+                cu_args["endpoint"] = cu_endpoint
+
+                cu_credential = kwargs.get("cu_credential")
+                if cu_credential is not None:
+                    cu_args["credential"] = cu_credential
+
+                cu_analyzer_id = kwargs.get("cu_analyzer_id")
+                if cu_analyzer_id is not None:
+                    cu_args["analyzer_id"] = cu_analyzer_id
+
+                cu_file_types = kwargs.get("cu_file_types")
+                if cu_file_types is not None:
+                    cu_args["file_types"] = cu_file_types
+
+                self.register_converter(
+                    ContentUnderstandingConverter(**cu_args),
+                )
+
            self._builtins_enabled = True
        else:
            warn("Built-in converters are already enabled.", RuntimeWarning)
--- a/packages/markitdown/src/markitdown/converters/init.py
+++ b/packages/markitdown/src/markitdown/converters/init.py
@@ -21,6 +21,10 @@ from ._doc_intel_converter import (
    DocumentIntelligenceConverter,
    DocumentIntelligenceFileType,
 )
+from ._cu_converter import (
+    ContentUnderstandingConverter,
+    ContentUnderstandingFileType,
+)
 from ._epub_converter import EpubConverter
 from ._csv_converter import CsvConverter

@@ -43,6 +47,8 @@ __all__ = [
    "ZipConverter",
    "DocumentIntelligenceConverter",
    "DocumentIntelligenceFileType",
+    "ContentUnderstandingConverter",
+    "ContentUnderstandingFileType",
    "EpubConverter",
    "CsvConverter",
 ]
--- a/packages/markitdown/src/markitdown/converters/_cu_converter.py
+++ b/packages/markitdown/src/markitdown/converters/_cu_converter.py
@@ -0,0 +1,570 @@
+"""Azure Content Understanding converter for MarkItDown.
+
+Converts files using Azure Content Understanding (CU) for high-quality,
+multi-modal extraction with structured field output. Supports documents,
+images, audio, and video. Fields are serialized as YAML front matter via
+the CU SDK's ``to_llm_input()`` helper.
+
+Install dependencies: ``pip install markitdown[az-content-understanding]``
+"""
+
+import sys
+import os
+from typing import BinaryIO, Any, List, Optional, Dict
+from enum import Enum
+
+from .._base_converter import DocumentConverter, DocumentConverterResult
+from .._stream_info import StreamInfo
+from .._exceptions import MissingDependencyException
+
+# Try loading optional dependencies — save error for later
+_dependency_exc_info = None
+try:
+    from azure.ai.contentunderstanding import ContentUnderstandingClient, to_llm_input
+    from azure.core.credentials import AzureKeyCredential, TokenCredential
+    from azure.core.pipeline.policies import UserAgentPolicy
+    from azure.identity import DefaultAzureCredential
+except ImportError:
+    _dependency_exc_info = sys.exc_info()
+
+    # Stub classes for type hinting
+    class AzureKeyCredential:  # type: ignore[no-redef]
+        pass
+
+    class TokenCredential:  # type: ignore[no-redef]
+        pass
+
+    class ContentUnderstandingClient:  # type: ignore[no-redef]
+        pass
+
+    class UserAgentPolicy:  # type: ignore[no-redef]
+        pass
+
+    class DefaultAzureCredential:  # type: ignore[no-redef]
+        pass
+
+    def to_llm_input(*args, **kwargs):  # type: ignore[no-redef]
+        pass
+
+
+# ---------------------------------------------------------------------------
+# File type enum and routing tables
+# ---------------------------------------------------------------------------
+
+
+class ContentUnderstandingFileType(str, Enum):
+    """Supported file types for Content Understanding conversion."""
+
+    # Documents
+    PDF = "pdf"
+    DOCX = "docx"
+    PPTX = "pptx"
+    XLSX = "xlsx"
+    HTML = "html"
+    TXT = "txt"
+    MD = "md"
+    RTF = "rtf"
+    XML = "xml"
+
+    # Email
+    EML = "eml"
+    MSG = "msg"
+
+    # Images (document modality)
+    JPEG = "jpeg"
+    PNG = "png"
+    BMP = "bmp"
+    TIFF = "tiff"
+    HEIF = "heif"
+
+    # Video
+    MP4 = "mp4"
+    M4V = "m4v"
+    MOV = "mov"
+    AVI = "avi"
+    MKV = "mkv"
+    WEBM = "webm"
+    FLV = "flv"
+    WMV = "wmv"
+
+    # Audio
+    WAV = "wav"
+    MP3 = "mp3"
+    M4A = "m4a"
+    FLAC = "flac"
+    OGG = "ogg"
+    AAC = "aac"
+    WMA = "wma"
+
+
+# Extension → file type
+_EXTENSION_MAP: Dict[str, ContentUnderstandingFileType] = {
+    # Documents
+    ".pdf": ContentUnderstandingFileType.PDF,
+    ".docx": ContentUnderstandingFileType.DOCX,
+    ".pptx": ContentUnderstandingFileType.PPTX,
+    ".xlsx": ContentUnderstandingFileType.XLSX,
+    ".html": ContentUnderstandingFileType.HTML,
+    ".txt": ContentUnderstandingFileType.TXT,
+    ".md": ContentUnderstandingFileType.MD,
+    ".rtf": ContentUnderstandingFileType.RTF,
+    ".xml": ContentUnderstandingFileType.XML,
+    # Email
+    ".eml": ContentUnderstandingFileType.EML,
+    ".msg": ContentUnderstandingFileType.MSG,
+    # Images
+    ".jpg": ContentUnderstandingFileType.JPEG,
+    ".jpeg": ContentUnderstandingFileType.JPEG,
+    ".jpe": ContentUnderstandingFileType.JPEG,
+    ".png": ContentUnderstandingFileType.PNG,
+    ".bmp": ContentUnderstandingFileType.BMP,
+    ".tiff": ContentUnderstandingFileType.TIFF,
+    ".heif": ContentUnderstandingFileType.HEIF,
+    ".heic": ContentUnderstandingFileType.HEIF,
+    # Video
+    ".mp4": ContentUnderstandingFileType.MP4,
+    ".m4v": ContentUnderstandingFileType.M4V,
+    ".mov": ContentUnderstandingFileType.MOV,
+    ".avi": ContentUnderstandingFileType.AVI,
+    ".mkv": ContentUnderstandingFileType.MKV,
+    ".webm": ContentUnderstandingFileType.WEBM,
+    ".flv": ContentUnderstandingFileType.FLV,
+    ".wmv": ContentUnderstandingFileType.WMV,
+    # Audio
+    ".wav": ContentUnderstandingFileType.WAV,
+    ".mp3": ContentUnderstandingFileType.MP3,
+    ".m4a": ContentUnderstandingFileType.M4A,
+    ".flac": ContentUnderstandingFileType.FLAC,
+    ".ogg": ContentUnderstandingFileType.OGG,
+    ".aac": ContentUnderstandingFileType.AAC,
+    ".wma": ContentUnderstandingFileType.WMA,
+}
+
+# MIME type prefixes for each file type
+_MIME_PREFIXES: Dict[ContentUnderstandingFileType, List[str]] = {
+    # Documents
+    ContentUnderstandingFileType.PDF: ["application/pdf", "application/x-pdf"],
+    ContentUnderstandingFileType.DOCX: [
+        "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+    ],
+    ContentUnderstandingFileType.PPTX: [
+        "application/vnd.openxmlformats-officedocument.presentationml"
+    ],
+    ContentUnderstandingFileType.XLSX: [
+        "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
+    ],
+    ContentUnderstandingFileType.HTML: ["text/html", "application/xhtml+xml"],
+    ContentUnderstandingFileType.TXT: ["text/plain"],
+    ContentUnderstandingFileType.MD: ["text/markdown"],
+    ContentUnderstandingFileType.RTF: ["text/rtf", "application/rtf"],
+    ContentUnderstandingFileType.XML: ["text/xml", "application/xml"],
+    # Email
+    ContentUnderstandingFileType.EML: ["message/rfc822"],
+    ContentUnderstandingFileType.MSG: ["application/vnd.ms-outlook"],
+    # Images
+    ContentUnderstandingFileType.JPEG: ["image/jpeg"],
+    ContentUnderstandingFileType.PNG: ["image/png"],
+    ContentUnderstandingFileType.BMP: ["image/bmp"],
+    ContentUnderstandingFileType.TIFF: ["image/tiff"],
+    ContentUnderstandingFileType.HEIF: ["image/heif", "image/heic"],
+    # Video
+    ContentUnderstandingFileType.MP4: ["video/mp4"],
+    ContentUnderstandingFileType.M4V: ["video/x-m4v"],
+    ContentUnderstandingFileType.MOV: ["video/quicktime"],
+    ContentUnderstandingFileType.AVI: ["video/x-msvideo"],
+    ContentUnderstandingFileType.MKV: ["video/x-matroska"],
+    ContentUnderstandingFileType.WEBM: ["video/webm"],
+    ContentUnderstandingFileType.FLV: ["video/x-flv"],
+    ContentUnderstandingFileType.WMV: ["video/x-ms-wmv"],
+    # Audio
+    ContentUnderstandingFileType.WAV: ["audio/wav", "audio/x-wav"],
+    ContentUnderstandingFileType.MP3: ["audio/mpeg", "audio/mp3"],
+    ContentUnderstandingFileType.M4A: ["audio/mp4", "audio/m4a", "audio/x-m4a"],
+    ContentUnderstandingFileType.FLAC: ["audio/flac", "audio/x-flac"],
+    ContentUnderstandingFileType.OGG: ["audio/ogg"],
+    ContentUnderstandingFileType.AAC: ["audio/aac"],
+    ContentUnderstandingFileType.WMA: ["audio/x-ms-wma"],
+}
+
+_MIME_ALIASES: Dict[str, str] = {
+    "audio/x-wav": "audio/wav",
+    "audio/x-flac": "audio/flac",
+    "audio/x-m4a": "audio/mp4",
+    "video/x-m4v": "video/mp4",
+}
+
+# File type → modality category
+_DOCUMENT_TYPES = {
+    ContentUnderstandingFileType.PDF,
+    ContentUnderstandingFileType.DOCX,
+    ContentUnderstandingFileType.PPTX,
+    ContentUnderstandingFileType.XLSX,
+    ContentUnderstandingFileType.HTML,
+    ContentUnderstandingFileType.TXT,
+    ContentUnderstandingFileType.MD,
+    ContentUnderstandingFileType.RTF,
+    ContentUnderstandingFileType.XML,
+    ContentUnderstandingFileType.EML,
+    ContentUnderstandingFileType.MSG,
+}
+
+_IMAGE_TYPES = {
+    ContentUnderstandingFileType.JPEG,
+    ContentUnderstandingFileType.PNG,
+    ContentUnderstandingFileType.BMP,
+    ContentUnderstandingFileType.TIFF,
+    ContentUnderstandingFileType.HEIF,
+}
+
+_VIDEO_TYPES = {
+    ContentUnderstandingFileType.MP4,
+    ContentUnderstandingFileType.M4V,
+    ContentUnderstandingFileType.MOV,
+    ContentUnderstandingFileType.AVI,
+    ContentUnderstandingFileType.MKV,
+    ContentUnderstandingFileType.WEBM,
+    ContentUnderstandingFileType.FLV,
+    ContentUnderstandingFileType.WMV,
+}
+
+_AUDIO_TYPES = {
+    ContentUnderstandingFileType.WAV,
+    ContentUnderstandingFileType.MP3,
+    ContentUnderstandingFileType.M4A,
+    ContentUnderstandingFileType.FLAC,
+    ContentUnderstandingFileType.OGG,
+    ContentUnderstandingFileType.AAC,
+    ContentUnderstandingFileType.WMA,
+}
+
+_PREBUILT_ANALYZERS = {
+    "document": "prebuilt-documentSearch",
+    "image": "prebuilt-documentSearch",
+    "video": "prebuilt-videoSearch",
+    "audio": "prebuilt-audioSearch",
+}
+
+# All supported file types (default set when file_types is None)
+_ALL_FILE_TYPES = list(ContentUnderstandingFileType)
+
+
+def _get_modality(file_type: ContentUnderstandingFileType) -> str:
+    """Get the modality category for a file type."""
+    if file_type in _DOCUMENT_TYPES:
+        return "document"
+    elif file_type in _IMAGE_TYPES:
+        return "image"
+    elif file_type in _VIDEO_TYPES:
+        return "video"
+    elif file_type in _AUDIO_TYPES:
+        return "audio"
+    raise ValueError(f"Unknown file type: {file_type}")
+
+
+def _detect_file_type(
+    stream_info: StreamInfo,
+    file_types: Optional[List[ContentUnderstandingFileType]] = None,
+) -> Optional[ContentUnderstandingFileType]:
+    """Detect a supported CU file type from extension or MIME type."""
+    allowed = set(file_types) if file_types is not None else None
+
+    extension = (stream_info.extension or "").lower()
+    file_type = _EXTENSION_MAP.get(extension)
+    if file_type is not None and (allowed is None or file_type in allowed):
+        return file_type
+
+    mimetype = _clean_mime_type(stream_info.mimetype)
+    if not mimetype:
+        return None
+
+    return _detect_file_type_from_mime(mimetype, allowed)
+
+
+def _clean_mime_type(mimetype: Optional[str]) -> str:
+    return (mimetype or "").split(";", 1)[0].strip().lower()
+
+
+def _canonical_mime_type(mimetype: Optional[str]) -> str:
+    cleaned = _clean_mime_type(mimetype)
+    return _MIME_ALIASES.get(cleaned, cleaned) or "application/octet-stream"
+
+
+def _content_type_for(
+    file_type: ContentUnderstandingFileType,
+    mimetype: Optional[str],
+) -> str:
+    """Resolve the content type to send to the CU API.
+
+    Uses the resolved ``file_type`` as the source of truth so analyzer
+    routing and payload metadata stay consistent. The caller-provided
+    ``mimetype`` is only used when it is consistent with ``file_type``
+    (e.g., to preserve subtype distinctions like ``image/heic`` vs
+    ``image/heif``). When ``mimetype`` disagrees with the resolved
+    ``file_type`` (e.g., ``.pdf`` extension with ``audio/mpeg``
+    mimetype), the canonical MIME type for ``file_type`` is used.
+    """
+    prefixes = _MIME_PREFIXES.get(file_type, [])
+    canonical = _canonical_mime_type(mimetype)
+
+    # Use caller-provided MIME if it's consistent with the resolved file_type
+    if prefixes and canonical != "application/octet-stream":
+        for prefix in prefixes:
+            if canonical.startswith(prefix):
+                return canonical
+
+    # Fallback: derive from the resolved file_type (single source of truth)
+    if prefixes:
+        return _canonical_mime_type(prefixes[0])
+
+    return canonical
+
+
+def _detect_file_type_from_mime(
+    mimetype: str,
+    allowed: Optional[set[ContentUnderstandingFileType]],
+) -> Optional[ContentUnderstandingFileType]:
+    for candidate, prefixes in _MIME_PREFIXES.items():
+        if allowed is not None and candidate not in allowed:
+            continue
+        for prefix in prefixes:
+            if mimetype.startswith(prefix):
+                return candidate
+    return None
+
+
+# ---------------------------------------------------------------------------
+# Smart routing: base_analyzer_id → modality mapping
+# ---------------------------------------------------------------------------
+
+_BASE_TO_MODALITY: Dict[str, str] = {
+    "prebuilt-document": "document",
+    "prebuilt-image": "image",
+    "prebuilt-audio": "audio",
+    "prebuilt-video": "video",
+}
+
+# Cache of known prebuilt analyzer name → modality (avoids API call)
+_KNOWN_PREBUILT_MODALITY: Dict[str, str] = {
+    # Document-based prebuilts
+    "prebuilt-documentSearch": "document",
+    "prebuilt-layout": "document",
+    "prebuilt-read": "document",
+    "prebuilt-document": "document",
+    "prebuilt-invoice": "document",
+    "prebuilt-receipt": "document",
+    "prebuilt-receipt.generic": "document",
+    "prebuilt-receipt.hotel": "document",
+    "prebuilt-idDocument": "document",
+    "prebuilt-idDocument.generic": "document",
+    "prebuilt-idDocument.passport": "document",
+    "prebuilt-healthInsuranceCard.us": "document",
+    "prebuilt-contract": "document",
+    "prebuilt-creditCard": "document",
+    "prebuilt-creditMemo": "document",
+    "prebuilt-bankStatement.us": "document",
+    "prebuilt-check.us": "document",
+    "prebuilt-purchaseOrder": "document",
+    "prebuilt-procurement": "document",
+    "prebuilt-payStub.us": "document",
+    "prebuilt-utilityBill": "document",
+    "prebuilt-marriageCertificate.us": "document",
+    "prebuilt-documentFieldSchema": "document",
+    "prebuilt-documentFields": "document",
+    # Tax prebuilts (all document-based)
+    "prebuilt-tax.us": "document",
+    "prebuilt-tax.us.w2": "document",
+    "prebuilt-tax.us.w4": "document",
+    "prebuilt-tax.us.1040": "document",
+    # Mortgage prebuilts
+    "prebuilt-mortgage.us": "document",
+    "prebuilt-mortgage.us.1003": "document",
+    "prebuilt-mortgage.us.closingDisclosure": "document",
+    # Image-based prebuilts
+    "prebuilt-image": "image",
+    "prebuilt-imageSearch": "image",
+    # Audio-based prebuilts
+    "prebuilt-audio": "audio",
+    "prebuilt-audioSearch": "audio",
+    "prebuilt-callCenter": "audio",
+    # Video-based prebuilts
+    "prebuilt-video": "video",
+    "prebuilt-videoSearch": "video",
+    "prebuilt-videoSynopsis": "video",
+}
+
+
+def _resolve_analyzer_modality(client: Any, analyzer_id: str) -> str:
+    """Resolve analyzer modality from cache or via get_analyzer() fallback.
+
+    For known prebuilt-* names, returns the modality from
+    ``_KNOWN_PREBUILT_MODALITY`` without an API call.  For unknown
+    prebuilt-* names or custom analyzers, calls ``get_analyzer()``
+    to inspect ``base_analyzer_id``.
+
+    Args:
+        client: A ``ContentUnderstandingClient`` instance.
+        analyzer_id: The analyzer ID to resolve.
+
+    Returns:
+        Modality string ("document", "image", "audio", or "video").
+
+    Raises:
+        ValueError: If ``get_analyzer()`` fails.
+    """
+    # Known prebuilt — use cache, no API call
+    if analyzer_id in _KNOWN_PREBUILT_MODALITY:
+        return _KNOWN_PREBUILT_MODALITY[analyzer_id]
+
+    # Unknown prebuilt or custom analyzer — call get_analyzer()
+    try:
+        analyzer_info = client.get_analyzer(analyzer_id)
+    except Exception as exc:
+        raise ValueError(f"Failed to resolve analyzer '{analyzer_id}': {exc}") from exc
+
+    if analyzer_info.base_analyzer_id:
+        return _BASE_TO_MODALITY.get(analyzer_info.base_analyzer_id, "document")
+    return "document"
+
+
+def _is_analyzer_compatible(file_modality: str, analyzer_modality: str) -> bool:
+    """Return True when an analyzer modality can process a file modality."""
+    if analyzer_modality == "document":
+        return file_modality in {"document", "image"}
+    return file_modality == analyzer_modality
+
+
+# ---------------------------------------------------------------------------
+# Converter
+# ---------------------------------------------------------------------------
+
+
+class ContentUnderstandingConverter(DocumentConverter):
+    """Converts files using Azure Content Understanding.
+
+    Provides high-quality document, image, audio, and video conversion
+    with structured field extraction via YAML front matter.
+    """
+
+    def __init__(
+        self,
+        *,
+        endpoint: str,
+        credential: AzureKeyCredential | TokenCredential | None = None,
+        analyzer_id: Optional[str] = None,
+        file_types: Optional[List[ContentUnderstandingFileType]] = None,
+    ):
+        """Initialize the Content Understanding converter.
+
+        Args:
+            endpoint: CU resource endpoint URL.
+            credential: Explicit credential. If None, falls back to
+                AZURE_API_KEY env var, then DefaultAzureCredential.
+            analyzer_id: Custom analyzer for compatible file types.
+                When set, the converter checks the analyzer's base modality
+                (via get_analyzer() at init) and routes only compatible
+                file types to it. Incompatible modalities auto-route to
+                default prebuilts. If None, auto-selects by extension/MIME.
+            file_types: Which file types to handle. If None, uses the
+                default set (all supported formats).
+        """
+        super().__init__()
+
+        # Raise if dependencies are missing
+        if _dependency_exc_info is not None:
+            raise MissingDependencyException(
+                "ContentUnderstandingConverter requires the optional dependency "
+                "[az-content-understanding] (or [all]) to be installed. "
+                "E.g., `pip install markitdown[az-content-understanding]`"
+            ) from _dependency_exc_info[
+                1
+            ].with_traceback(  # type: ignore[union-attr]
+                _dependency_exc_info[2]
+            )
+
+        self._file_types = file_types if file_types is not None else _ALL_FILE_TYPES
+        self._analyzer_id = analyzer_id
+        self._analyzer_modality: Optional[str] = None
+
+        # Resolve credential
+        if credential is None:
+            api_key = os.environ.get("AZURE_API_KEY")
+            if api_key is not None:
+                credential = AzureKeyCredential(api_key)
+            else:
+                credential = DefaultAzureCredential()
+
+        # User agent for telemetry
+        try:
+            from ..__about__ import __version__
+        except ImportError:
+            __version__ = "unknown"
+        user_agent = f"markitdown-cu/{__version__}"
+
+        # Create CU client
+        self._client = ContentUnderstandingClient(
+            endpoint=endpoint,
+            credential=credential,
+            user_agent_policy=UserAgentPolicy(user_agent=user_agent),
+        )
+
+        # Smart routing: resolve analyzer modality at init (at most one API call)
+        if self._analyzer_id is not None:
+            self._analyzer_modality = _resolve_analyzer_modality(
+                self._client, self._analyzer_id
+            )
+
+    def accepts(
+        self,
+        file_stream: BinaryIO,
+        stream_info: StreamInfo,
+        **kwargs: Any,
+    ) -> bool:
+        """Return True if the file type is in the configured set."""
+        return _detect_file_type(stream_info, self._file_types) is not None
+
+    def convert(
+        self,
+        file_stream: BinaryIO,
+        stream_info: StreamInfo,
+        **kwargs: Any,
+    ) -> DocumentConverterResult:
+        """Convert the file using CU and return Markdown with YAML front matter."""
+
+        # 1. Determine analyzer_id (smart routing: check modality)
+        file_type = _detect_file_type(stream_info, self._file_types)
+        if file_type is None:
+            raise ValueError(
+                "Unsupported file type for Content Understanding conversion."
+            )
+        file_modality = _get_modality(file_type)
+
+        if (
+            self._analyzer_id is not None
+            and self._analyzer_modality is not None
+            and _is_analyzer_compatible(file_modality, self._analyzer_modality)
+        ):
+            analyzer_id = self._analyzer_id
+        else:
+            analyzer_id = _PREBUILT_ANALYZERS.get(
+                file_modality, "prebuilt-documentSearch"
+            )
+
+        # 2. Read file bytes and determine MIME type
+        file_bytes = file_stream.read()
+        content_type = _content_type_for(file_type, stream_info.mimetype)
+
+        # 3. Call CU SDK
+        poller = self._client.begin_analyze_binary(
+            analyzer_id=analyzer_id,
+            binary_input=file_bytes,
+            content_type=content_type,
+        )
+
+        # 4. Block on result
+        result = poller.result()
+
+        # 5. Format output using to_llm_input()
+        text = to_llm_input(result)
+
+        # 6. Return
+        return DocumentConverterResult(markdown=text)
--- a/packages/markitdown/tests/test_cu_converter.py
+++ b/packages/markitdown/tests/test_cu_converter.py
@@ -0,0 +1,928 @@
+"""Tests for ContentUnderstandingConverter.
+
+Tests accepts() routing, smart routing modality logic, and convert() via mocks.
+Follows the same pattern as test_docintel_html.py.
+"""
+
+import io
+import sys
+from unittest.mock import MagicMock, patch
+
+import pytest
+
+from markitdown.converters._cu_converter import (
+    ContentUnderstandingConverter,
+    ContentUnderstandingFileType,
+    _resolve_analyzer_modality,
+    _get_modality,
+    _detect_file_type,
+    _canonical_mime_type,
+    _content_type_for,
+    _EXTENSION_MAP,
+)
+from markitdown._stream_info import StreamInfo
+
+# ---------------------------------------------------------------------------
+# Helper: create a converter with accepts() working but no SDK init
+# ---------------------------------------------------------------------------
+
+
+def _make_converter(file_types=None, analyzer_id=None, analyzer_modality=None):
+    """Create a converter bypassing __init__ (no SDK deps needed)."""
+    conv = ContentUnderstandingConverter.__new__(ContentUnderstandingConverter)
+    conv._analyzer_id = analyzer_id
+    conv._analyzer_modality = analyzer_modality
+
+    # Set accepted file types without running SDK-dependent initialization.
+    from markitdown.converters._cu_converter import (
+        _ALL_FILE_TYPES,
+    )
+
+    types = file_types if file_types is not None else _ALL_FILE_TYPES
+    conv._file_types = types
+
+    return conv
+
+
+# ---------------------------------------------------------------------------
+# accepts() tests — extension-based
+# ---------------------------------------------------------------------------
+
+
+class TestAcceptsExtension:
+    """Test accepts() for supported and unsupported file extensions."""
+
+    @pytest.mark.parametrize(
+        "ext",
+        [
+            ".pdf",
+            ".docx",
+            ".pptx",
+            ".xlsx",
+            ".html",
+            ".txt",
+            ".md",
+            ".rtf",
+            ".xml",
+            ".eml",
+            ".msg",
+            ".jpg",
+            ".jpeg",
+            ".jpe",
+            ".png",
+            ".bmp",
+            ".tiff",
+            ".heif",
+            ".heic",
+            ".mp4",
+            ".m4v",
+            ".mov",
+            ".avi",
+            ".mkv",
+            ".webm",
+            ".flv",
+            ".wmv",
+            ".wav",
+            ".mp3",
+            ".m4a",
+            ".flac",
+            ".ogg",
+            ".aac",
+            ".wma",
+        ],
+    )
+    def test_accepts_supported_extensions(self, ext):
+        conv = _make_converter()
+        assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=ext))
+
+    @pytest.mark.parametrize(
+        "ext",
+        [
+            ".csv",
+            ".json",
+            ".zip",
+            ".epub",
+            ".py",
+            ".rs",
+        ],
+    )
+    def test_rejects_unsupported_extensions(self, ext):
+        conv = _make_converter()
+        assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=ext))
+
+
+# ---------------------------------------------------------------------------
+# accepts() tests — MIME-based
+# ---------------------------------------------------------------------------
+
+
+class TestAcceptsMime:
+    """Test accepts() for MIME type matching."""
+
+    @pytest.mark.parametrize(
+        "mime",
+        [
+            "application/pdf",
+            "image/jpeg",
+            "video/mp4",
+            "audio/wav",
+            "audio/x-wav",
+            "text/html",
+            "audio/mpeg",
+            "audio/x-m4a",
+            "audio/x-flac",
+            "video/quicktime",
+            "video/webm",
+            "video/x-m4v",
+            "video/x-flv",
+            "video/x-ms-wmv",
+            "audio/aac",
+            "audio/x-ms-wma",
+        ],
+    )
+    def test_accepts_supported_mimetypes(self, mime):
+        conv = _make_converter()
+        assert conv.accepts(io.BytesIO(b""), StreamInfo(mimetype=mime))
+
+    @pytest.mark.parametrize(
+        "mime",
+        [
+            "text/csv",
+            "application/json",
+            "application/zip",
+        ],
+    )
+    def test_rejects_unsupported_mimetypes(self, mime):
+        conv = _make_converter()
+        assert not conv.accepts(io.BytesIO(b""), StreamInfo(mimetype=mime))
+
+
+# ---------------------------------------------------------------------------
+# accepts() tests — cu_file_types restriction
+# ---------------------------------------------------------------------------
+
+
+class TestAcceptsFileTypeRestriction:
+    """Test that cu_file_types restricts which formats are accepted."""
+
+    def test_restricted_to_pdf_only(self):
+        conv = _make_converter(file_types=[ContentUnderstandingFileType.PDF])
+        assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=".pdf"))
+        assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".mp4"))
+        assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".wav"))
+        assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".jpg"))
+
+    def test_restricted_to_audio(self):
+        conv = _make_converter(
+            file_types=[
+                ContentUnderstandingFileType.WAV,
+                ContentUnderstandingFileType.MP3,
+            ]
+        )
+        assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=".wav"))
+        assert conv.accepts(io.BytesIO(b""), StreamInfo(extension=".mp3"))
+        assert not conv.accepts(io.BytesIO(b""), StreamInfo(extension=".pdf"))
+
+    def test_webm_value_matches_cli_input(self):
+        assert ContentUnderstandingFileType("webm") == ContentUnderstandingFileType.WEBM
+
+    def test_m4v_value_matches_cli_input(self):
+        assert ContentUnderstandingFileType("m4v") == ContentUnderstandingFileType.M4V
+
+
+# ---------------------------------------------------------------------------
+# file type detection tests
+# ---------------------------------------------------------------------------
+
+
+class TestDetectFileType:
+    """Test extension and MIME based file type detection."""
+
+    def test_detects_video_from_mime_without_extension(self):
+        assert (
+            _detect_file_type(StreamInfo(mimetype="video/mp4"))
+            == ContentUnderstandingFileType.MP4
+        )
+
+    def test_detects_audio_from_mime_without_extension(self):
+        assert (
+            _detect_file_type(StreamInfo(mimetype="audio/mpeg"))
+            == ContentUnderstandingFileType.MP3
+        )
+
+    def test_detects_audio_alias_from_mime_without_extension(self):
+        assert (
+            _detect_file_type(StreamInfo(mimetype="audio/x-wav"))
+            == ContentUnderstandingFileType.WAV
+        )
+
+    def test_detects_video_alias_from_mime_without_extension(self):
+        assert (
+            _detect_file_type(StreamInfo(mimetype="video/x-m4v"))
+            == ContentUnderstandingFileType.M4V
+        )
+
+    @pytest.mark.parametrize(
+        ("mimetype", "expected"),
+        [
+            ("audio/x-wav", "audio/wav"),
+            ("audio/x-flac", "audio/flac"),
+            ("audio/x-m4a", "audio/mp4"),
+            ("video/x-m4v", "video/mp4"),
+            ("video/mp4", "video/mp4"),
+            (None, "application/octet-stream"),
+        ],
+    )
+    def test_canonical_mime_type(self, mimetype, expected):
+        assert _canonical_mime_type(mimetype) == expected
+
+    @pytest.mark.parametrize(
+        ("file_type", "mimetype", "expected"),
+        [
+            (ContentUnderstandingFileType.PDF, None, "application/pdf"),
+            (ContentUnderstandingFileType.M4V, None, "video/mp4"),
+            (ContentUnderstandingFileType.FLAC, "audio/x-flac", "audio/flac"),
+        ],
+    )
+    def test_content_type_for(self, file_type, mimetype, expected):
+        assert _content_type_for(file_type, mimetype) == expected
+
+    @pytest.mark.parametrize(
+        ("file_type", "mimetype", "expected"),
+        [
+            # Extension/file_type wins when mimetype disagrees — the
+            # resolved file_type is the single source of truth so that
+            # analyzer routing and payload metadata stay consistent.
+            (ContentUnderstandingFileType.PDF, "audio/mpeg", "application/pdf"),
+            (ContentUnderstandingFileType.MP3, "application/pdf", "audio/mpeg"),
+            (ContentUnderstandingFileType.MP4, "image/jpeg", "video/mp4"),
+            (ContentUnderstandingFileType.JPEG, "video/mp4", "image/jpeg"),
+            # Subtype distinctions are preserved when consistent
+            # (e.g., HEIC vs HEIF both map to file_type HEIF; if the
+            # caller passed image/heic explicitly, keep it).
+            (ContentUnderstandingFileType.HEIF, "image/heic", "image/heic"),
+            (ContentUnderstandingFileType.HEIF, "image/heif", "image/heif"),
+        ],
+    )
+    def test_content_type_for_resolves_conflicts_to_file_type(
+        self, file_type, mimetype, expected
+    ):
+        """When extension and mimetype disagree, file_type wins."""
+        assert _content_type_for(file_type, mimetype) == expected
+
+    def test_conflicting_extension_and_mimetype_in_convert(self):
+        """End-to-end: conflicting StreamInfo routes by extension and
+        sends a content_type consistent with the resolved file_type."""
+        conv = _make_converter()
+        conv._client = MagicMock()
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = MagicMock(contents=[])
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch(
+            "markitdown.converters._cu_converter.to_llm_input",
+            return_value="ok",
+        ):
+            conv.convert(
+                io.BytesIO(b"fake"),
+                # .pdf extension but bogus audio mimetype
+                StreamInfo(extension=".pdf", mimetype="audio/mpeg"),
+            )
+
+        call_kwargs = conv._client.begin_analyze_binary.call_args.kwargs
+        # Routed by extension: document modality → prebuilt-documentSearch
+        assert call_kwargs["analyzer_id"] == "prebuilt-documentSearch"
+        # content_type derived from file_type (PDF), not the conflicting mime
+        assert call_kwargs["content_type"] == "application/pdf"
+
+    def test_file_type_restriction_applies_to_mime(self):
+        assert (
+            _detect_file_type(
+                StreamInfo(mimetype="video/mp4"),
+                [ContentUnderstandingFileType.PDF],
+            )
+            is None
+        )
+
+
+# ---------------------------------------------------------------------------
+# Smart routing tests
+# ---------------------------------------------------------------------------
+
+
+class TestSmartRouting:
+    """Test modality-aware analyzer routing."""
+
+    def test_document_analyzer_routes_pdf_to_custom(self):
+        """Document-based analyzer should be used for PDF."""
+        conv = _make_converter(
+            analyzer_id="my-doc-analyzer",
+            analyzer_modality="document",
+        )
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake pdf"),
+                StreamInfo(extension=".pdf", mimetype="application/pdf"),
+            )
+
+        # Should use the custom analyzer for PDF (document modality)
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "my-doc-analyzer"
+
+    def test_document_analyzer_routes_mp3_to_prebuilt(self):
+        """Document-based analyzer should auto-route MP3 to prebuilt-audioSearch."""
+        conv = _make_converter(
+            analyzer_id="my-doc-analyzer",
+            analyzer_modality="document",
+        )
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake audio"),
+                StreamInfo(extension=".mp3", mimetype="audio/mpeg"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-audioSearch"
+
+    def test_document_analyzer_routes_mp4_to_prebuilt(self):
+        """Document-based analyzer should auto-route MP4 to prebuilt-videoSearch."""
+        conv = _make_converter(
+            analyzer_id="my-doc-analyzer",
+            analyzer_modality="document",
+        )
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake video"),
+                StreamInfo(extension=".mp4", mimetype="video/mp4"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-videoSearch"
+
+    def test_no_analyzer_id_uses_auto_routing(self):
+        """Without analyzer_id, PDF should auto-route to prebuilt-documentSearch."""
+        conv = _make_converter(analyzer_id=None, analyzer_modality=None)
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake pdf"),
+                StreamInfo(extension=".pdf", mimetype="application/pdf"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
+
+    def test_no_analyzer_id_routes_image_to_document_search(self):
+        """Default image routing should still use prebuilt-documentSearch."""
+        conv = _make_converter(analyzer_id=None, analyzer_modality=None)
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake image"),
+                StreamInfo(extension=".jpg", mimetype="image/jpeg"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
+
+    def test_document_analyzer_routes_image_to_custom(self):
+        """Document-based analyzers should still handle image documents."""
+        conv = _make_converter(
+            analyzer_id="my-doc-analyzer",
+            analyzer_modality="document",
+        )
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake image"),
+                StreamInfo(extension=".jpg", mimetype="image/jpeg"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "my-doc-analyzer"
+
+    def test_image_analyzer_routes_jpeg_to_custom(self):
+        """Image-based analyzers should be used for image files."""
+        conv = _make_converter(
+            analyzer_id="my-image-analyzer",
+            analyzer_modality="image",
+        )
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake image"),
+                StreamInfo(extension=".jpg", mimetype="image/jpeg"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "my-image-analyzer"
+
+    def test_image_analyzer_routes_pdf_to_document_prebuilt(self):
+        """Image-based analyzers should not claim non-image document files."""
+        conv = _make_converter(
+            analyzer_id="my-image-analyzer",
+            analyzer_modality="image",
+        )
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(
+                io.BytesIO(b"fake pdf"),
+                StreamInfo(extension=".pdf", mimetype="application/pdf"),
+            )
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
+
+    @pytest.mark.parametrize(
+        ("mimetype", "expected_analyzer"),
+        [
+            ("video/mp4", "prebuilt-videoSearch"),
+            ("video/x-m4v", "prebuilt-videoSearch"),
+            ("audio/mpeg", "prebuilt-audioSearch"),
+            ("audio/x-wav", "prebuilt-audioSearch"),
+        ],
+    )
+    def test_mime_only_input_uses_auto_routing(self, mimetype, expected_analyzer):
+        """MIME-only streams should route to the matching modality analyzer."""
+        conv = _make_converter(analyzer_id=None, analyzer_modality=None)
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(io.BytesIO(b"fake content"), StreamInfo(mimetype=mimetype))
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == expected_analyzer
+
+    def test_mime_alias_input_uses_canonical_content_type(self):
+        """Alias MIME types should be sent to CU as canonical content types."""
+        conv = _make_converter(analyzer_id=None, analyzer_modality=None)
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(io.BytesIO(b"fake video"), StreamInfo(mimetype="video/x-m4v"))
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-videoSearch"
+        assert call_args.kwargs["content_type"] == "video/mp4"
+
+    def test_extension_only_input_uses_file_type_content_type(self):
+        """Extension-only inputs should send CU a matching content type."""
+        conv = _make_converter(analyzer_id=None, analyzer_modality=None)
+        conv._client = MagicMock()
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch("markitdown.converters._cu_converter.to_llm_input", return_value=""):
+            conv.convert(io.BytesIO(b"fake pdf"), StreamInfo(extension=".pdf"))
+
+        call_args = conv._client.begin_analyze_binary.call_args
+        assert call_args.kwargs["analyzer_id"] == "prebuilt-documentSearch"
+        assert call_args.kwargs["content_type"] == "application/pdf"
+
+
+# ---------------------------------------------------------------------------
+# _infer_prebuilt_modality tests
+# ---------------------------------------------------------------------------
+
+
+class TestResolveAnalyzerModality:
+    """Test modality resolution from analyzer IDs."""
+
+    def test_known_document_prebuilts(self):
+        client = MagicMock()
+        assert (
+            _resolve_analyzer_modality(client, "prebuilt-documentSearch") == "document"
+        )
+        assert _resolve_analyzer_modality(client, "prebuilt-invoice") == "document"
+        assert _resolve_analyzer_modality(client, "prebuilt-layout") == "document"
+        assert _resolve_analyzer_modality(client, "prebuilt-receipt") == "document"
+        assert _resolve_analyzer_modality(client, "prebuilt-tax.us.w2") == "document"
+        # Known prebuilts should never call get_analyzer()
+        client.get_analyzer.assert_not_called()
+
+    def test_known_audio_prebuilts(self):
+        client = MagicMock()
+        assert _resolve_analyzer_modality(client, "prebuilt-audioSearch") == "audio"
+        assert _resolve_analyzer_modality(client, "prebuilt-callCenter") == "audio"
+        client.get_analyzer.assert_not_called()
+
+    def test_known_video_prebuilts(self):
+        client = MagicMock()
+        assert _resolve_analyzer_modality(client, "prebuilt-videoSearch") == "video"
+        assert _resolve_analyzer_modality(client, "prebuilt-videoSynopsis") == "video"
+        client.get_analyzer.assert_not_called()
+
+    def test_known_image_prebuilts(self):
+        client = MagicMock()
+        assert _resolve_analyzer_modality(client, "prebuilt-imageSearch") == "image"
+        assert _resolve_analyzer_modality(client, "prebuilt-image") == "image"
+        client.get_analyzer.assert_not_called()
+
+    def test_unknown_prebuilt_falls_back_to_get_analyzer(self):
+        """Unknown prebuilt-* names should call get_analyzer() for resolution."""
+        client = MagicMock()
+        mock_analyzer = MagicMock()
+        mock_analyzer.base_analyzer_id = "prebuilt-audio"
+        client.get_analyzer.return_value = mock_analyzer
+
+        result = _resolve_analyzer_modality(client, "prebuilt-newAnalyzer")
+        assert result == "audio"
+        client.get_analyzer.assert_called_once_with("prebuilt-newAnalyzer")
+
+    def test_custom_analyzer_calls_get_analyzer(self):
+        """Custom analyzers should call get_analyzer() to resolve modality."""
+        client = MagicMock()
+        mock_analyzer = MagicMock()
+        mock_analyzer.base_analyzer_id = "prebuilt-document"
+        client.get_analyzer.return_value = mock_analyzer
+
+        result = _resolve_analyzer_modality(client, "my-custom-doc-analyzer")
+        assert result == "document"
+        client.get_analyzer.assert_called_once_with("my-custom-doc-analyzer")
+
+    def test_custom_analyzer_no_base_defaults_to_document(self):
+        """Analyzer with no base_analyzer_id defaults to document."""
+        client = MagicMock()
+        mock_analyzer = MagicMock()
+        mock_analyzer.base_analyzer_id = None
+        client.get_analyzer.return_value = mock_analyzer
+
+        result = _resolve_analyzer_modality(client, "my-custom-analyzer")
+        assert result == "document"
+
+    def test_get_analyzer_failure_raises_value_error(self):
+        """Failed get_analyzer() should raise ValueError."""
+        client = MagicMock()
+        client.get_analyzer.side_effect = Exception("not found")
+
+        with pytest.raises(ValueError, match="Failed to resolve analyzer 'bad-id'"):
+            _resolve_analyzer_modality(client, "bad-id")
+
+
+# ---------------------------------------------------------------------------
+# _get_modality tests
+# ---------------------------------------------------------------------------
+
+
+class TestGetModality:
+    """Test file type → modality mapping."""
+
+    def test_document_types(self):
+        assert _get_modality(ContentUnderstandingFileType.PDF) == "document"
+        assert _get_modality(ContentUnderstandingFileType.DOCX) == "document"
+
+    def test_image_types(self):
+        assert _get_modality(ContentUnderstandingFileType.JPEG) == "image"
+        assert _get_modality(ContentUnderstandingFileType.PNG) == "image"
+
+    def test_video_types(self):
+        assert _get_modality(ContentUnderstandingFileType.MP4) == "video"
+        assert _get_modality(ContentUnderstandingFileType.MOV) == "video"
+
+    def test_audio_types(self):
+        assert _get_modality(ContentUnderstandingFileType.WAV) == "audio"
+        assert _get_modality(ContentUnderstandingFileType.MP3) == "audio"
+
+
+# ---------------------------------------------------------------------------
+# convert() mock tests
+# ---------------------------------------------------------------------------
+
+
+class TestConvertMock:
+    """Test convert() with mocked CU SDK."""
+
+    def _run_convert(self, extension, mimetype, expected_output="mock output"):
+        conv = _make_converter()
+        conv._client = MagicMock()
+
+        mock_result = MagicMock()
+        mock_result.contents = []
+        mock_poller = MagicMock()
+        mock_poller.result.return_value = mock_result
+        conv._client.begin_analyze_binary.return_value = mock_poller
+
+        with patch(
+            "markitdown.converters._cu_converter.to_llm_input",
+            return_value=expected_output,
+        ):
+            result = conv.convert(
+                io.BytesIO(b"fake content"),
+                StreamInfo(extension=extension, mimetype=mimetype),
+            )
+        return result
+
+    def test_pdf_returns_markdown(self):
+        result = self._run_convert(
+            ".pdf", "application/pdf", "---\ncontentType: document\n---\n# Test"
+        )
+        assert "contentType: document" in result.markdown
+
+    def test_mp4_returns_markdown(self):
+        result = self._run_convert(
+            ".mp4", "video/mp4", "---\ncontentType: audioVisual\n---\nSpeaker 1: Hello"
+        )
+        assert "contentType: audioVisual" in result.markdown
+
+    def test_wav_returns_markdown(self):
+        result = self._run_convert(
+            ".wav", "audio/wav", "---\ncontentType: audioVisual\n---\nSpeaker 1: Hi"
+        )
+        assert "audioVisual" in result.markdown
+
+    def test_empty_result(self):
+        result = self._run_convert(".pdf", "application/pdf", "")
+        assert result.markdown == ""
+
+    def test_jpeg_returns_markdown(self):
+        result = self._run_convert(
+            ".jpg", "image/jpeg", "---\ncontentType: document\n---\n# Photo"
+        )
+        assert "contentType: document" in result.markdown
+
+
+# ---------------------------------------------------------------------------
+# Init-time get_analyzer() error wrapping
+# ---------------------------------------------------------------------------
+
+
+class TestGetAnalyzerError:
+    """Test that get_analyzer() failures at init produce a clear error."""
+
+    def test_nonexistent_analyzer_raises_value_error(self):
+        """A failed get_analyzer() should raise ValueError with analyzer name."""
+        with patch(
+            "markitdown.converters._cu_converter._dependency_exc_info", None
+        ), patch(
+            "markitdown.converters._cu_converter.ContentUnderstandingClient"
+        ) as MockClient, patch(
+            "markitdown.converters._cu_converter.DefaultAzureCredential"
+        ):
+            mock_client = MagicMock()
+            mock_client.get_analyzer.side_effect = Exception("not found")
+            MockClient.return_value = mock_client
+
+            with pytest.raises(ValueError, match="Failed to resolve analyzer 'bad-id'"):
+                ContentUnderstandingConverter(
+                    endpoint="https://fake", analyzer_id="bad-id"
+                )
+
+
+# ---------------------------------------------------------------------------
+# Registration priority test
+# ---------------------------------------------------------------------------
+
+
+class TestRegistrationPriority:
+    """Test that CU converter is registered with higher priority than Doc Intel."""
+
+    def test_cu_registered_before_docintel(self):
+        """When both endpoints are provided, CU should appear before Doc Intel."""
+        with patch(
+            "markitdown.converters._cu_converter._dependency_exc_info", None
+        ), patch(
+            "markitdown.converters._cu_converter.ContentUnderstandingClient"
+        ), patch(
+            "markitdown.converters._cu_converter.DefaultAzureCredential"
+        ), patch(
+            "markitdown.converters._doc_intel_converter._dependency_exc_info", None
+        ), patch(
+            "markitdown.converters._doc_intel_converter.DocumentIntelligenceClient"
+        ), patch(
+            "markitdown.converters._doc_intel_converter.DefaultAzureCredential"
+        ):
+            from markitdown import MarkItDown
+            from markitdown.converters import (
+                ContentUnderstandingConverter,
+                DocumentIntelligenceConverter,
+            )
+
+            md = MarkItDown(
+                cu_endpoint="https://fake-cu",
+                docintel_endpoint="https://fake-di",
+            )
+
+            converter_types = [type(reg.converter) for reg in md._converters]
+            cu_idx = converter_types.index(ContentUnderstandingConverter)
+            di_idx = converter_types.index(DocumentIntelligenceConverter)
+            assert (
+                cu_idx < di_idx
+            ), "CU should have higher priority (lower index) than Doc Intel"
+
+
+# ---------------------------------------------------------------------------
+# CLI argument tests
+# ---------------------------------------------------------------------------
+
+
+class TestCLIArgs:
+    """Test CLI argument parsing for CU flags."""
+
+    def test_use_cu_without_endpoint_exits(self):
+        """--use-cu without --cu-endpoint should exit with error."""
+        import subprocess
+
+        result = subprocess.run(
+            [sys.executable, "-m", "markitdown", "--use-cu", "fake.pdf"],
+            capture_output=True,
+            text=True,
+        )
+        assert result.returncode != 0
+        assert (
+            "cu-endpoint" in result.stderr.lower()
+            or "cu-endpoint" in (result.stdout or "").lower()
+        )
+
+    def test_use_cu_and_use_docintel_mutually_exclusive(self):
+        """--use-cu and --use-docintel cannot be used together."""
+        import subprocess
+
+        result = subprocess.run(
+            [
+                sys.executable,
+                "-m",
+                "markitdown",
+                "--use-cu",
+                "--cu-endpoint",
+                "https://fake",
+                "--use-docintel",
+                "-e",
+                "https://fake-di",
+                "fake.pdf",
+            ],
+            capture_output=True,
+            text=True,
+        )
+        assert result.returncode != 0
+
+    def test_cu_file_types_parsing(self):
+        """--cu-file-types should parse comma-separated values into enum list."""
+        from markitdown.converters import ContentUnderstandingFileType
+
+        raw = "pdf,jpeg,mp4"
+        type_names = [t.strip().lower() for t in raw.split(",") if t.strip()]
+        cu_types = [ContentUnderstandingFileType(name) for name in type_names]
+
+        assert cu_types == [
+            ContentUnderstandingFileType.PDF,
+            ContentUnderstandingFileType.JPEG,
+            ContentUnderstandingFileType.MP4,
+        ]
+
+    def test_cu_file_types_invalid_value(self):
+        """Unknown file type name should raise ValueError."""
+        from markitdown.converters import ContentUnderstandingFileType
+
+        with pytest.raises(ValueError):
+            ContentUnderstandingFileType("nonsense")
+
+    def test_cu_file_types_single_value(self):
+        """Single file type (no comma) should parse correctly."""
+        from markitdown.converters import ContentUnderstandingFileType
+
+        cu_types = [
+            ContentUnderstandingFileType(t.strip().lower())
+            for t in "wav".split(",")
+            if t.strip()
+        ]
+        assert cu_types == [ContentUnderstandingFileType.WAV]
+
+    def test_use_cu_wires_kwargs_to_markitdown(self, capsys):
+        """--use-cu should pass CU options through to MarkItDown."""
+        import markitdown.__main__ as markitdown_cli
+
+        markitdown_instance = MagicMock()
+        markitdown_instance.convert.return_value.markdown = "converted"
+        markitdown_cls = MagicMock(return_value=markitdown_instance)
+
+        with patch.object(
+            sys,
+            "argv",
+            [
+                "markitdown",
+                "--use-cu",
+                "--cu-endpoint",
+                "https://fake-cu",
+                "--cu-analyzer",
+                "custom-analyzer",
+                "--cu-file-types",
+                "pdf,jpeg,mp4",
+                "fake.pdf",
+            ],
+        ), patch.object(markitdown_cli, "MarkItDown", markitdown_cls):
+            markitdown_cli.main()
+
+        markitdown_cls.assert_called_once_with(
+            enable_plugins=False,
+            cu_endpoint="https://fake-cu",
+            cu_analyzer_id="custom-analyzer",
+            cu_file_types=[
+                ContentUnderstandingFileType.PDF,
+                ContentUnderstandingFileType.JPEG,
+                ContentUnderstandingFileType.MP4,
+            ],
+        )
+        markitdown_instance.convert.assert_called_once_with(
+            "fake.pdf", stream_info=None, keep_data_uris=False
+        )
+        assert capsys.readouterr().out == "converted\n"
+
+
+# ---------------------------------------------------------------------------
+# MissingDependencyException test
+# ---------------------------------------------------------------------------
+
+
+class TestMissingDependency:
+    """Test that MissingDependencyException is raised when CU SDK is not installed."""
+
+    def test_missing_deps_message(self):
+        """Converter construction should surface the optional install hint."""
+        import markitdown.converters._cu_converter as cu_converter_module
+        from markitdown._exceptions import MissingDependencyException
+
+        import_error = ImportError("No module named 'azure.ai.contentunderstanding'")
+        dependency_exc_info = (ImportError, import_error, None)
+
+        with patch.object(
+            cu_converter_module, "_dependency_exc_info", dependency_exc_info
+        ), pytest.raises(MissingDependencyException) as exc_info:
+            ContentUnderstandingConverter(endpoint="https://fake-cu")
+
+        assert "az-content-understanding" in str(exc_info.value)
+        assert exc_info.value.__cause__ is import_error