I started writing this post a while back, but now that it has stayed half-done for several months I’m posting what I have.
I wrote about CLIP models a while back, but from a high-level “what are they and what can they be used for” perspective. Now I have had the chance to work more with clip models directly in python, and they are still impressive.
You can use clip models with the transformers
library from huggingface, there are special CLIPModel
and CLIPProcessor
classes:
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
Using the model is a tiny bit different from a normal LLM model, but you can get pytorch vectors for some text like this:
text_vector = model.get_text_features(**tokenizer(["A bug in a rug"], truncation=True, return_tensors="pt"))
Similarly, you can create an embedding for an image:
image_vector = model.get_image_features(**processor(images=img1, return_tensors="pt"))
Often, these tensors are stored in a vector database such as chroma, weaviate, or maybe something more generic like postgres with the pgvector extension or in Snowflake with the new VECTOR
column type.