Metadata-Version: 2.4
Name: pycantonese
Version: 4.2.0
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Natural Language :: Cantonese
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Human Machine Interfaces
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: rustling>=0.8.0
Requires-Dist: black>=26.3.0 ; extra == 'dev'
Requires-Dist: build>=1.4.0 ; extra == 'dev'
Requires-Dist: flake8>=7.3.0 ; extra == 'dev'
Requires-Dist: maturin>=1.0 ; extra == 'dev'
Requires-Dist: pytest>=9.0.2 ; extra == 'dev'
Requires-Dist: requests>=2.32.5 ; extra == 'dev'
Requires-Dist: sybil>=9.3.0 ; extra == 'dev'
Requires-Dist: mypy>=1.19.1 ; extra == 'dev'
Requires-Dist: twine>=6.2.0 ; extra == 'dev'
Requires-Dist: pyarrow>=18.0.0 ; extra == 'dev'
Requires-Dist: sphinx>=4.3.0 ; extra == 'docs'
Requires-Dist: sphinx-copybutton>=0.5.1 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.1.0 ; extra == 'docs'
Requires-Dist: sphinxcontrib-googleanalytics>=0.5 ; extra == 'docs'
Provides-Extra: dev
Provides-Extra: docs
License-File: LICENSE.txt
Summary: Cantonese Linguistics and NLP in Python
Keywords: computational linguistics,natural language processing,NLP,Cantonese,linguistics,corpora,speech,language,Chinese,Jyutping
Author-email: "Jackson L. Lee" <jacksonlunlee@gmail.com>
License-Expression: MIT
Requires-Python: >=3.10
Description-Content-Type: text/x-rst; charset=UTF-8
Project-URL: Homepage, https://pycantonese.org
Project-URL: Source, https://github.com/jacksonllee/pycantonese

PyCantonese: Cantonese Linguistics and NLP in Python
====================================================

.. image:: https://jacksonllee.com/logos/pycantonese-logo.png
   :width: 250px

Full Documentation: https://pycantonese.org

|

.. image:: https://img.shields.io/pypi/v/pycantonese.svg
   :target: https://pypi.org/project/pycantonese/
   :alt: PyPI version

.. image:: https://img.shields.io/conda/vn/conda-forge/pycantonese.svg
   :target: https://anaconda.org/conda-forge/pycantonese
   :alt: Conda version

|

.. start-sphinx-website-index-page

PyCantonese is a Python library for Cantonese linguistics and natural language
processing (NLP). Currently implemented features:

- Accessing and searching corpus data
- Parsing and conversion tools for Jyutping romanization
- Parsing Cantonese text
- Stop words
- Word segmentation
- Part-of-speech tagging

The design of PyCantonese prioritizes ease of use and linguistic knowledge.
It has been successfully used by both academic and commercial organizations,
including major US tech companies.

Since v4.0.0 (March 2026), PyCantonese depends on
`Rustling <https://rustling.io>`_, a library for efficient
CHAT data handling, word segmentation, and part-of-speech tagging.

.. _download_install:

Download and Install
--------------------

Using pip::

   pip install --upgrade pycantonese

Using conda::

   conda install -c conda-forge pycantonese

PyCantonese also works
`in JavaScript <https://docs.pycantonese.org/stable/quickstart.html#javascript>`_.

Ready for more?
Check out `Quickstart <https://docs.pycantonese.org/stable/quickstart.html>`_.

Links
-----

* Author: `Jackson L. Lee <https://jacksonllee.com>`_
* Source code: https://github.com/jacksonllee/pycantonese
* Social media:
  `Facebook <https://www.facebook.com/pycantonese>`_

How to Cite
-----------

Lee, Jackson L., Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022.
`PyCantonese: Cantonese Linguistics and NLP in Python <https://jacksonllee.com/papers/pycantonese_lrec_2022-05-06.pdf>`_.
*Proceedings of the 13th Language Resources and Evaluation Conference*.

.. code-block:: latex

      @inproceedings{lee-etal-2022-pycantonese,
         title = "PyCantonese: Cantonese Linguistics and NLP in Python",
         author = "Lee, Jackson L.  and
            Chen, Litong  and
            Lam, Charles  and
            Lau, Chaak Ming  and
            Tsui, Tsz-Him",
         booktitle = "Proceedings of The 13th Language Resources and Evaluation Conference",
         month = jun,
         year = "2022",
         publisher = "European Language Resources Association",
      }

License
-------

MIT License.

Please note that PyCantonese includes data from the following sources,
all of which are permissively licensed:

- Hong Kong Cantonese Corpus (CC BY)
- CantoMap (GPL-3.0)
- rime-cantonese (CC BY 4.0)
- Common Voice Cantonese (Mozilla Public License 2.0)
- Cantonese-Traditional Chinese Parallel Corpus (CC0 1.0 Universal)

For details about these datasets,
please see `their documentation <https://github.com/jacksonllee/pycantonese/tree/main/src/pycantonese/data>`_.

Logo
----

The PyCantonese logo is the Chinese character 粵 meaning Cantonese,
with artistic design by albino.snowman (Instagram handle).

.. end-sphinx-website-index-page

