* inital version
* improve mime type detection
* prebuilt-image custom analzyer route to image
* enhance cu priority over di
* fix: apply black formatting
* update cache of known prebuilt name and README improvement
* add test cases, run black
* update readme and deriving content_type from the resolved file_type
* update readme
* fix: handle deeply nested HTML that triggers RecursionError (#1636)
Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.
This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.
Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML
* refactor: address review feedback on RecursionError fallback
- Move 'import warnings' to module top level (was inside except block)
- Make test environment-independent by temporarily lowering
sys.setrecursionlimit(200) instead of relying on depth=500 being
sufficient on all platforms; original limit restored in finally block
- Add strict=True keyword argument to opt out of the plain-text
fallback and let RecursionError propagate to the caller
* test: use result.markdown instead of deprecated result.text_content
---------
Co-authored-by: jigangz <jigangz@github.com>
* Fix O(n) memory growth in PDF conversion by calling page.close() after each page
* Refactor PDF memory optimization tests for improved readability and consistency
* Add memory benchmarking tests for PDF conversion with page.close() fix
* Remove unnecessary blank lines in PDF memory optimization tests for cleaner code
* Bump version to 0.1.6b2 in __about__.py
* Update PDF conversion tests to include mimetype in StreamInfo
* feat: enhance PDF table extraction to support complex forms and add new test cases
* feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases
* fix: correct formatting and improve assertions in PDF table tests
* Fix: PDF parsing doesn't support partially numbered lists
* Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file
* Refactor: Improve assertion formatting in partial numbering tests
* Added PDF table extraction feature with aligned Markdown (#1419)
* Add PDF test files and enhance extraction tests
- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.
* fix: update dependencies for PDF processing and improve table extraction logic
* Bumped version of pdfminer.six
---------
Authored-by: Ashok <ashh010101@gmail.com>
* refactor: remove unused imports
* fix: replace NotImplemented with NotImplementedError
* refactor: resolve E722 (do not use bare 'except')
* refactor: remove unused variable
* refactor: remove unused imports
* refactor: ignore unused imports that will be used in the future
* refactor: resolve W293 (blank line contains whitespace)
* refactor: resolve F541 (f-string is missing placeholders)
---------
Co-authored-by: afourney <adamfo@microsoft.com>
* feat: Add CSV to Markdown table converter
- Add new CsvConverter class to convert CSV files to Markdown tables\n- Support text/csv and application/csv MIME types\n- Preserve table structure with headers and data rows\n- Handle edge cases like empty cells and mismatched columns\n- Fix Azure Document Intelligence dependency handling\n- Register CsvConverter in MarkItDown class
----
Thanks also to @benny123tw who submitted a very similar PR in #1171
* feat: math equation rendering in .docx files
* fix: import fix on .docx pre processing
* test: add test cases for docx equation rendering
* docs: add ThirdPartyNotices.md
* refactor: reformatted with black
* optional reserve base64 string in markdown _CustomMarkdownify and pptx
* add other converter para support
* fix linter
* Use *kwarg to pass keep_data_uri para.
* Add module cli vector tests
* Fixed formatting, and adjusted tests.
* Refactored tests.
* Fixed CI errors, and included misc tests.
* Omit mskanji from streaminfo test.
* Omit mskanji from no hints test.
* Log results of debugging in comments (linked to Magika issue)
* Added docs as to when to use misc tests.
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.
---------
Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
* Work started moving converters to individual files.
* Significant cleanup and refactor.
* Moved everything to a packages subfolder.
* Added sample plugin.
* Added instructions to the README.md
* Bumped version, and added a note about compatibility.