I came across a cool post by Drew Breunig about finding bathroom faucets with the CLIP model: https://www.dbreunig.com/2023/09/26/faucet-finder.html.
Multi-modal embedding models let you embed both text and images in the same embedding space, enabling search across both images and text. Although multimodal embedding models are seemed a mostly a blank slate, there is at least one multimodal embedding model available: openAI’s CLIP.
Simon Willison, who makes the llm
cli tool, has also made a plugin for the CLIP model, so taking the CLIP model out for a spin is really simple. Just like Drew did.
Except I don’t have 20 000 images of faucets at hand, so I had to find something a little different. TL;DR: Drew’s writeup is good, easy to follow, and the results largely match my experience.
On to the test-drive ๐
Installing llm
and the related utilities is easy, I copy-pasted the commands from Drew’s post without issue.
pip install llm
llm install llm-sentence-transformers
llm install llm-clip
OK, so far so good. But we need some images. After dismissing a few projects that would have led to yak-shaving, I found a large-ish fashion dataset on huggingface named “fashionpedia”: https://huggingface.co/datasets/detection-datasets/fashionpedia
The images are embedded in a parquet file, so there is a little bit of wrangling to write them to images on a disk (which is where we want them). After the parqet files were all downloaded, this little script did the trick:
import glob
import pandas as pd
from PIL import Image
import io
def save_image(row):
pil_img = Image.open(io.BytesIO(row["image"]["bytes"]))
pil_img.save(f"images/{str(row['image_id'])}.jpg", "JPEG")
parquetfiles = glob.glob('*.parquet')
for fnm in parquetfiles:
df = pd.read_parquet(fnm)
df.apply(save_image, axis=1)
print(f"Done with {fnm}")
Remember to create a folder named images
, and the libraries: pip install pillow pandas pyarrow
.
Once we have all the images in the images folder, we can go back to Drew’s notes, and create embeddings for the images.
llm embed-multi fashion --files images/ '*.jpg' --binary -m clip
The only thing I changed from Drew’s post was the name of the embedding dataset - I am working with fashion images, not faucets.
I did all of this on a Microsoft Surface, with a normal processor and normal RAM and no GPU. I didn’t time the embedding process, but it must have taken about an hour. Fairly impressive actually, given how even inference with these types of models is intensive.
Image search ๐
By pure luck, I stumbled across an image of a dress on twitter, that I could use as a test.
llm similar fashion -i IMG_0200.jpeg --binary
My photoshop (MS Paint) skills are nonexistent, but below is the original image on the left, followed by the first two recommendations.
Although I doubt the woman who posted the first image on twitter would have liked any of the other dresses, there is undeniably a lot of similarities.
Text search ๐
Because we already saw which faucets match a bond villain and which match “Nintendo 64”, let’s find some fashion that goes with it.
Bond villains ๐
llm similar fashion -c "Bond villain"
Nintendo 64 ๐
llm similar fashion -c "Nintendo 64"
Unlike the Nintendo 64 faucets, the Nintendo 64 fashion does not look like 64-bit pieces of clothing - there seems to be some intangible game-themed aesthetic though - including bright colors, toys and cereal.
Still, I am impressed. And with the llm
library it was also really easy to do.