Metadata-Version: 2.4
Name: docling-core
Version: 2.74.1
Summary: A python library to define and validate data types in Docling.
Author-email: Cesar Berrospi Ramis <ceb@zurich.ibm.com>, Panos Vagenas <pva@zurich.ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Christoph Auer <cau@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>
Maintainer-email: Panos Vagenas <pva@zurich.ibm.com>, Michele Dolfi <dol@zurich.ibm.com>, Christoph Auer <cau@zurich.ibm.com>, Peter Staar <taa@zurich.ibm.com>, Cesar Berrospi Ramis <ceb@zurich.ibm.com>
License-Expression: MIT
Project-URL: homepage, https://github.com/docling-project
Project-URL: repository, https://github.com/docling-project/docling-core
Project-URL: issues, https://github.com/docling-project/docling-core/issues
Project-URL: changelog, https://github.com/docling-project/docling-core/blob/main/CHANGELOG.md
Keywords: docling,discovery,etl,information retrieval,analytics,database,database schema,schema,JSON
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Python: <4.0,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: jsonschema<5.0.0,>=4.16.0
Requires-Dist: pydantic!=2.10.0,!=2.10.1,!=2.10.2,<3.0.0,>=2.6.0
Requires-Dist: jsonref<2.0.0,>=1.1.0
Requires-Dist: tabulate<0.11.0,>=0.9.0
Requires-Dist: pandas<4.0.0,>=2.1.4
Requires-Dist: pillow<13.0.0,>=10.0.0
Requires-Dist: pyyaml<7.0.0,>=5.1
Requires-Dist: typing-extensions<5.0.0,>=4.12.2
Requires-Dist: typer<0.25.0,>=0.12.5
Requires-Dist: latex2mathml<4.0.0,>=3.77.0
Requires-Dist: defusedxml<0.8.0,>=0.7.1
Requires-Dist: pydantic-settings>=2.14.0
Provides-Extra: chunking
Requires-Dist: semchunk<4.0.0,>=2.2.0; extra == "chunking"
Requires-Dist: tree-sitter<0.27.0,>=0.25.0; extra == "chunking"
Requires-Dist: tree-sitter-python>=0.23.6; extra == "chunking"
Requires-Dist: tree-sitter-c>=0.23.4; extra == "chunking"
Requires-Dist: tree-sitter-javascript>=0.23.1; extra == "chunking"
Requires-Dist: tree-sitter-typescript>=0.23.2; extra == "chunking"
Requires-Dist: transformers<6.0.0,>=4.34.0; extra == "chunking"
Provides-Extra: chunking-openai
Requires-Dist: semchunk<4.0.0,>=2.2.0; extra == "chunking-openai"
Requires-Dist: tree-sitter<0.27.0,>=0.25.0; extra == "chunking-openai"
Requires-Dist: tree-sitter-python>=0.23.6; extra == "chunking-openai"
Requires-Dist: tree-sitter-c>=0.23.4; extra == "chunking-openai"
Requires-Dist: tree-sitter-javascript>=0.23.1; extra == "chunking-openai"
Requires-Dist: tree-sitter-typescript>=0.23.2; extra == "chunking-openai"
Requires-Dist: tiktoken<0.13.0,>=0.9.0; extra == "chunking-openai"
Provides-Extra: examples
Requires-Dist: datasets>=4.0.0; extra == "examples"
Requires-Dist: matplotlib>=3.7.0; extra == "examples"
Requires-Dist: openpyxl>=3.1.5; extra == "examples"
Dynamic: license-file

# Docling Core

[![PyPI version](https://img.shields.io/pypi/v/docling-core)](https://pypi.org/project/docling-core/)
![Python](https://img.shields.io/badge/python-3.10%20%7C%20%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue)
[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/)
[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[![License MIT](https://img.shields.io/github/license/docling-project/docling-core)](https://opensource.org/licenses/MIT)

Docling Core is a library that defines core data types and transformations in [Docling](https://github.com/docling-project/docling).

## Installation

To use Docling Core, simply install `docling-core` from your package manager, e.g. pip:
```bash
pip install docling-core
```

### Development setup

To develop for Docling Core, you need Python 3.10 through 3.14 and the `uv` package. You can then install it from your local clone's root directory:
```bash
uv sync --all-extras
```

To run the pytest suite, execute:
```
uv run pytest -s test
```

## Main features

Docling Core provides the foundational DoclingDocument data model and API, as well as
additional APIs for tasks like serialization and chunking, which are key to developing
generative AI applications using Docling.

### DoclingDocument

Docling Core defines the DoclingDocument as a Pydantic model, allowing for advanced
data model control, customizability, and interoperability.

In addition to specifying the schema, it provides a handy API for building documents,
as well as for basic operations, e.g. exporting to various formats, like Markdown, HTML,
and others.

👉 More details:
- [Architecture docs](https://docling-project.github.io/docling/concepts/architecture/)
- [DoclingDocument docs](https://docling-project.github.io/docling/concepts/docling_document/)

### Serialization

Different users can have varying requirements when it comes to serialization.
To address this, the Serialization API introduces a design that allows easy extension,
while providing feature-rich built-in implementations (on which the respective
DoclingDocument helpers are actually based).

👉 More details:
- [Serialization docs](https://docling-project.github.io/docling/concepts/serialization/)
- [Serialization example](https://docling-project.github.io/docling/examples/serialization/)

### Chunking

Similarly to above, the Chunking API provides built-in chunking capabilities as well as
a design that enables easy extension, this way tackling customization requirements of
different use cases.

👉 More details:
- [Chunking docs](https://docling-project.github.io/docling/concepts/chunking/)
- [Hybrid chunking example](https://docling-project.github.io/docling/examples/hybrid_chunking/)
- [Advanced chunking and serialization](https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/)

### Profiling

The Profiling API enables extraction of comprehensive statistics from DoclingDocument objects,
both for individual documents and collections. It provides metrics on document structure
(pages, tables, pictures, text items) along with statistical distributions (deciles, histograms)
and visualization capabilities for analyzing document collections at scale.

👉 More details:
- [Document profiling example](./examples/document_profiling.py)
- [Collection statistics visualization](./examples/visualize_collection_stats.py)

## Contributing

Please read [Contributing to Docling Core](./CONTRIBUTING.md) for details.

## References

If you use Docling Core in your projects, please consider citing the following:

```bib
@techreport{Docling,
  author = "Deep Search Team",
  month = 8,
  title = "Docling Technical Report",
  url = "https://arxiv.org/abs/2408.09869",
  eprint = "2408.09869",
  doi = "10.48550/arXiv.2408.09869",
  version = "1.0.0",
  year = 2024
}
```

## License

The Docling Core codebase is under MIT license.
For individual model usage, please refer to the model licenses found in the original packages.
