Clip Similarity Search

ยท 555 words ยท 3 minute read

I came across a cool post by Drew Breunig about finding bathroom faucets with the CLIP model: https://www.dbreunig.com/2023/09/26/faucet-finder.html.

Multi-modal embedding models let you embed both text and images in the same embedding space, enabling search across both images and text. Although multimodal embedding models are seemed a mostly a blank slate, there is at least one multimodal embedding model available: openAI’s CLIP.

Simon Willison, who makes the llm cli tool, has also made a plugin for the CLIP model, so taking the CLIP model out for a spin is really simple. Just like Drew did.

Except I don’t have 20 000 images of faucets at hand, so I had to find something a little different. TL;DR: Drew’s writeup is good, easy to follow, and the results largely match my experience.

On to the test-drive ๐Ÿ”—

Installing llm and the related utilities is easy, I copy-pasted the commands from Drew’s post without issue.

pip install llm
llm install llm-sentence-transformers
llm install llm-clip

OK, so far so good. But we need some images. After dismissing a few projects that would have led to yak-shaving, I found a large-ish fashion dataset on huggingface named “fashionpedia”: https://huggingface.co/datasets/detection-datasets/fashionpedia

The images are embedded in a parquet file, so there is a little bit of wrangling to write them to images on a disk (which is where we want them). After the parqet files were all downloaded, this little script did the trick:

import glob
import pandas as pd
from PIL import Image
import io

def save_image(row):
    pil_img = Image.open(io.BytesIO(row["image"]["bytes"]))
    pil_img.save(f"images/{str(row['image_id'])}.jpg", "JPEG")


parquetfiles = glob.glob('*.parquet')

for fnm in parquetfiles:
    df = pd.read_parquet(fnm)
    df.apply(save_image, axis=1)
    print(f"Done with {fnm}")

Remember to create a folder named images, and the libraries: pip install pillow pandas pyarrow.

Once we have all the images in the images folder, we can go back to Drew’s notes, and create embeddings for the images.

llm embed-multi fashion --files images/ '*.jpg' --binary -m clip

The only thing I changed from Drew’s post was the name of the embedding dataset - I am working with fashion images, not faucets.

I did all of this on a Microsoft Surface, with a normal processor and normal RAM and no GPU. I didn’t time the embedding process, but it must have taken about an hour. Fairly impressive actually, given how even inference with these types of models is intensive.

By pure luck, I stumbled across an image of a dress on twitter, that I could use as a test.

llm similar fashion -i IMG_0200.jpeg --binary

My photoshop (MS Paint) skills are nonexistent, but below is the original image on the left, followed by the first two recommendations.

Image similarity search

Although I doubt the woman who posted the first image on twitter would have liked any of the other dresses, there is undeniably a lot of similarities.

Because we already saw which faucets match a bond villain and which match “Nintendo 64”, let’s find some fashion that goes with it.

Bond villains ๐Ÿ”—

llm similar fashion -c "Bond villain"

Bond villain fashion

Nintendo 64 ๐Ÿ”—

llm similar fashion -c "Nintendo 64"

Nintendo 64

Unlike the Nintendo 64 faucets, the Nintendo 64 fashion does not look like 64-bit pieces of clothing - there seems to be some intangible game-themed aesthetic though - including bright colors, toys and cereal.

Still, I am impressed. And with the llm library it was also really easy to do.