Metadata-Version: 2.4
Name: embetter
Version: 0.9.0
Summary: Just a bunch of useful embeddings to get started quickly.
Author: Vincent D. Warmerdam
License: MIT License
        
        Copyright (c) 2022 Vincent D. Warmerdam
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Documentation, https://koaning.github.io/embetter/
Project-URL: Source Code, https://github.com/koaning/embetter/
Project-URL: Issue Tracker, https://github.com/koaning/embetter/issues
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENCE
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: pandas>=1.0.0
Requires-Dist: diskcache>=5.6.1
Requires-Dist: skops>=0.8.0
Requires-Dist: model2vec
Provides-Extra: sbert
Requires-Dist: sentence-transformers>=2.2.2; extra == "sbert"
Provides-Extra: text
Requires-Dist: sentence-transformers>=2.2.2; extra == "text"
Provides-Extra: vision
Requires-Dist: timm>=0.6.7; extra == "vision"
Provides-Extra: pytorch
Requires-Dist: torch>=1.12.0; extra == "pytorch"
Provides-Extra: openai
Requires-Dist: openai>=1.59.8; extra == "openai"
Provides-Extra: cohere
Requires-Dist: cohere>=4.11.2; extra == "cohere"
Provides-Extra: ollama
Requires-Dist: ollama>=0.5.3; extra == "ollama"
Provides-Extra: all
Requires-Dist: sentence-transformers>=2.2.2; extra == "all"
Requires-Dist: timm>=0.6.7; extra == "all"
Requires-Dist: openai>=1.59.8; extra == "all"
Provides-Extra: docs
Requires-Dist: mkdocs-material==9.6.9; extra == "docs"
Requires-Dist: mkdocstrings==0.29.0; extra == "docs"
Requires-Dist: mkdocstrings-python==1.16.0; extra == "docs"
Requires-Dist: mktestdocs==0.2.4; extra == "docs"
Provides-Extra: dev
Requires-Dist: embetter[all,docs]; extra == "dev"
Requires-Dist: interrogate>=1.5.0; extra == "dev"
Requires-Dist: pytest>=4.0.2; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pre-commit>=2.2.0; extra == "dev"
Requires-Dist: mktestdocs==0.2.4; extra == "dev"
Requires-Dist: datasets==2.8.0; extra == "dev"
Requires-Dist: pyarrow==20.0.0; extra == "dev"
Requires-Dist: matplotlib; extra == "dev"
Requires-Dist: pytest-xdist; extra == "dev"
Dynamic: license-file


# embetter

> "Just a bunch of useful embeddings for scikit-learn pipelines, to get started quickly."

<img src="https://raw.githubusercontent.com/koaning/embetter/main/docs/images/icon.png" width="125" height="125" align="right" />

<br> 

Embetter implements scikit-learn compatible embeddings for computer vision and text. It should make it very easy to quickly build proof of concepts using scikit-learn pipelines and, in particular, should help with [bulk labelling](https://www.youtube.com/watch?v=gDk7_f3ovIk). It's also meant to play nice with [bulk](https://github.com/koaning/bulk) and [scikit-partial](https://github.com/koaning/scikit-partial) but it can also be used together with your favorite ANN solution like [lancedb](https://lancedb.github.io/lancedb/).

## Install 

You can install via pip.

```
python -m pip install embetter
```

Many of the embeddings are optional depending on your use-case, so if you
want to nit-pick to download only the tools that you need: 

```
python -m pip install "embetter[text]"
python -m pip install "embetter[vision]"
python -m pip install "embetter[all]"
```

## API Design 

This is what's being implemented now. 

```python
# Helpers to grab text or image from pandas column.
from embetter.grab import ColumnGrabber

# Representations/Helpers for computer vision
from embetter.vision import ImageLoader, TimmEncoder, ColorHistogramEncoder

# Representations for text
from embetter.text import SentenceEncoder, MatryoshkaEncoder, TextEncoder

# Representations from multi-modal models
from embetter.multi import ClipEncoder

# Finetuning components 
from embetter.finetune import FeedForwardTuner, ContrastiveTuner, ContrastiveLearner, SbertLearner

# External embedding providers, typically needs an API key
from embetter.external import CohereEncoder, OpenAIEncoder
```

All of these components are scikit-learn compatible, which means that you
can apply them as you would normally in a scikit-learn pipeline. Just be aware
that these components are stateless. They won't require training as these 
are all pretrained tools. 

## Text Example

To run this example, make sure that you `pip install 'embetter[sbert]'`. 

```python
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

# This pipeline grabs the `text` column from a dataframe
# which then get fed into Sentence-Transformers' all-MiniLM-L6-v2.
text_emb_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2')
)

dataf = pd.DataFrame({
  "text": ["positive sentiment", "super negative"],
  "label_col": ["pos", "neg"]
})
X = text_emb_pipeline.fit_transform(dataf, dataf['label_col'])

# This pipeline can also be trained to make predictions, using
# the embedded features.
text_clf_pipeline = make_pipeline(
  ColumnGrabber("text"),
  SentenceEncoder('all-MiniLM-L6-v2'),
  LogisticRegression()
)
text_clf_pipeline.fit(dataf, dataf['label_col']).predict(dataf)
```

## Image Example

The goal of the API is to allow pipelines like this: 

```python
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import LogisticRegression

from embetter.grab import ColumnGrabber
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder

# This pipeline grabs the `img_path` column from a dataframe
# then it grabs the image paths and turns them into `PIL.Image` objects
# which then get fed into CLIP which can also handle images.
image_emb_pipeline = make_pipeline(
  ColumnGrabber("img_path"),
  ImageLoader(convert="RGB"),
  ClipEncoder()
)

dataf = pd.DataFrame({
  "img_path": ["tests/data/thiscatdoesnotexist.jpeg"]
})
image_emb_pipeline.fit_transform(dataf)
```

## Batched Learning 

All of the encoding tools you've seen here are also compatible
with the [`partial_fit` mechanic](https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning) 
in scikit-learn. That means
you can leverage [scikit-partial](https://github.com/koaning/scikit-partial)
to build pipelines that can handle out-of-core datasets. 

