microsoft-markitdown

mirror of https://github.com/microsoft/markitdown.git synced 2026-07-03 12:28:14 +08:00

Author	SHA1	Message	Date
Chien Yuan Chang	a01d74dda7	feat: Add Azure Content Understanding converter (#1865 ) * inital version * improve mime type detection * prebuilt-image custom analzyer route to image * enhance cu priority over di * fix: apply black formatting * update cache of known prebuilt name and README improvement * add test cases, run black * update readme and deriving content_type from the resolved file_type * update readme	2026-05-21 21:59:41 -07:00
jigangz	604bba13da	fix: handle deeply nested HTML that triggers RecursionError (#1644 ) * fix: handle deeply nested HTML that triggers RecursionError (#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML * refactor: address review feedback on RecursionError fallback - Move 'import warnings' to module top level (was inside except block) - Make test environment-independent by temporarily lowering sys.setrecursionlimit(200) instead of relying on depth=500 being sufficient on all platforms; original limit restored in finally block - Add strict=True keyword argument to opt out of the plain-text fallback and let RecursionError propagate to the caller * test: use result.markdown instead of deprecated result.text_content --------- Co-authored-by: jigangz <jigangz@github.com>	2026-04-15 15:26:44 -07:00
lesyk	a6c8ac46a6	Fix O(n) memory growth in PDF conversion by calling page.close() afte… (#1612 ) * Fix O(n) memory growth in PDF conversion by calling page.close() after each page * Refactor PDF memory optimization tests for improved readability and consistency * Add memory benchmarking tests for PDF conversion with page.close() fix * Remove unnecessary blank lines in PDF memory optimization tests for cleaner code * Bump version to 0.1.6b2 in __about__.py * Update PDF conversion tests to include mimetype in StreamInfo	2026-03-16 10:35:24 -07:00
lesyk	c83de14a9c	[MS] Extend table support for wide tables (#1552 ) * feat: enhance PDF table extraction to support complex forms and add new test cases * feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases * fix: correct formatting and improve assertions in PDF table tests	2026-02-13 10:45:39 -08:00
lesyk	7fdaefb724	Fix: PDF parsing doesn't support partially numbered lists (#1525 ) * Fix: PDF parsing doesn't support partially numbered lists * Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file * Refactor: Improve assertion formatting in partial numbering tests	2026-01-08 15:15:22 -08:00
lesyk	251dddcf0c	[MS] Update PDF table extraction to support aligned Markdown (#1499 ) * Added PDF table extraction feature with aligned Markdown (#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com>	2026-01-07 16:38:45 -08:00
afourney	447c047731	Test if mammoth resolves rlinks. (#1451 )	2025-10-20 15:54:05 -07:00
Stefan Rink	b81a387616	fix: correctly pass custom llm prompt parameter (#1319 ) * fix: correctly pass custom llm prompt parameter	2025-08-26 14:51:10 -07:00
safen0s	ea1a3dfb60	Add HTML support to DocumentIntelligenceConverter (#1352 )	2025-08-26 14:34:43 -07:00
t3tra	cb421cf9ea	Chore: Make linter happy (#1256 ) * refactor: remove unused imports * fix: replace NotImplemented with NotImplementedError * refactor: resolve E722 (do not use bare 'except') * refactor: remove unused variable * refactor: remove unused imports * refactor: ignore unused imports that will be used in the future * refactor: resolve W293 (blank line contains whitespace) * refactor: resolve F541 (f-string is missing placeholders) --------- Co-authored-by: afourney <adamfo@microsoft.com>	2025-05-21 10:02:16 -07:00
Turdıbek	8576f1d915	Add CSV to Markdown table conversion - fixes #1144 (#1176 ) * feat: Add CSV to Markdown table converter - Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class ---- Thanks also to @benny123tw who submitted a very similar PR in #1171	2025-04-13 09:19:00 -07:00
Sathindu	3fcd48cdfc	feat: render math equations in .docx documents (#1160 ) * feat: math equation rendering in .docx files * fix: import fix on .docx pre processing * test: add test cases for docx equation rendering * docs: add ThirdPartyNotices.md * refactor: reformatted with black	2025-03-28 15:36:38 -07:00
afourney	e928b43afb	convert_url renamed to convert_uri, and now handles data and file URIs (#1153 )	2025-03-24 21:43:04 -07:00
Yuzhong Zhang	52432bd228	Add support for preserving base64 encoded images (#1140 ) * optional reserve base64 string in markdown _CustomMarkdownify and pptx * add other converter para support * fix linter * Use kwarg to pass keep_data_uri para. Add module cli vector tests * Fixed formatting, and adjusted tests.	2025-03-20 18:50:23 -07:00
afourney	c0a511ecff	Updated docx file to include an image. (#1146 )	2025-03-20 12:25:56 -07:00
afourney	a93e0567e6	EPub Support. Adapted #123 to not use epublib. (#1131 ) * Adapted #123 to not use epublib. * Updated README.md	2025-03-17 07:48:15 -07:00
afourney	c5f70b904f	Have magika read from the stream. (#1136 )	2025-03-17 07:39:19 -07:00
afourney	5c565b7d79	Fix remaining mypy errors. (#1132 )	2025-03-15 23:12:48 -07:00
afourney	a78857bd43	Added epub test file. (#1130 )	2025-03-15 18:34:51 -07:00
afourney	5f75e16d20	Refactored tests. (#1120 ) * Refactored tests. * Fixed CI errors, and included misc tests. * Omit mskanji from streaminfo test. * Omit mskanji from no hints test. * Log results of debugging in comments (linked to Magika issue) * Added docs as to when to use misc tests.	2025-03-12 11:08:06 -07:00
afourney	8f8e58c9bb	Minimize guesses when guesses are compatible. (#1114 ) * Minimize guesses when guesses are compatible.	2025-03-10 15:30:44 -07:00
afourney	99d8e562db	Fix exiftool in well-known paths. (#1106 )	2025-03-07 21:47:20 -08:00
afourney	e921497f79	Update converter API, user streams rather than file paths (#1088 ) * Updated DocumentConverter interface * Updated all DocumentConverter classes * Added support for various new audio files. * Updated sample plugin to new DocumentConverter interface. * Updated project README with notes about changes, and use-cases. * Updated DocumentConverter documentation. * Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple. --------- Co-authored-by: Kenny Zhang <kzhang678@gmail.com>	2025-03-05 21:16:55 -08:00
afourney	43bd79adc9	Print and log better exceptions when file conversions fail. (#1080 ) * Print and log better exceptions when file conversions fail. * Added unit tests for exceptions.	2025-02-28 16:07:47 -08:00
Matthew Powers	e82e0c1372	Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) (#331 ) * Adds support for Shape Groups * Update to Test PPtx for nested shape * This line was accidentally removed and is added back here	2025-02-27 23:21:51 -08:00
Nima Akbarzadeh	a394cc7c27	fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue (#1035 ) * fix: add error handling, refactor _findKey to use json.items() * fix: improve metadata and description extraction logic * fix: improve YouTube transcript extraction reliability * fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue * fix(readme): add youtube URLs as markitdown supports	2025-02-27 23:17:54 -08:00
afourney	dbdf2c0c10	Added CLI tests. (#327 )	2025-02-11 20:42:50 -08:00
afourney	935da9976c	Added priority argument to all converter constructors. (#324 ) * Added priority argument to all converter constructors.	2025-02-11 12:36:32 -08:00
afourney	c73afcffea	Cleanup and refactor, in preparation for plugin support. (#318 ) * Work started moving converters to individual files. * Significant cleanup and refactor. * Moved everything to a packages subfolder. * Added sample plugin. * Added instructions to the README.md * Bumped version, and added a note about compatibility.	2025-02-10 15:21:44 -08:00

29 Commits