Clip image annotation filter#
Warning
This feature is just experimental and very low quality.
You can query object from annotated rosbags by prompt by using OpenAI CLIP.
Use with CLI#
amber automation clip_image_annotation_filter tests/automation/clip_image_annotation_filter.yaml tests/rosbag/ford_with_annotation/read_images_and_bounding_box.yaml tests/rosbag/ford_with_annotation/bounding_box.mcap output.mcap
Task description yaml for the detic_image_labaler is here.
target_objects: ["passenger car."] # Target object you want to find.
# If the width of the bounding box overs min_width or the height of the bounding box overs min_height of the bounding box, it recoganize as candidate bounding box.
min_width: 30 # min width of the object in pixels.
min_height: 30 # min height of the object in pixels.
min_area: 50 # min area of the object in pixels.
# Classification method, you can choose from two method.
# You can choose from clip_with_lvis_and_custom_vocabulary, consider_annotation_with_bert
classify_method: consider_annotation_with_bert
# Configuration parameter for consider_annotation_with_bert, this value was used when you choose `consider_annotation_with_bert`
consider_annotation_with_bert_config:
positive_nagative_ratio: 1.0 # Ratio of the cosine similarity of positive and negative prompt. Negative prompt is "Not a photo of $target_object".
min_clip_cosine_similarity: 0.25 # Minimum values of cosine similarity with clip text/image embeddings.
min_clip_cosine_similarity_with_berf: 0.3 # Minimum values of cosine similarity with clip text embeddings and image enbeddings consider prompt similarity using bert.
Algorithms#
consider_annotation_with_bert#
python script of the consider_annotation_with_bert
is below.
# Pure clip cosine similarity.
clip_similarity = cosine_similarity(
clip_embeddings / torch.sum(clip_embeddings),
self.text_embeddings[target_object][0],
)
# Clip cosine similarity considering bert embeddings.
positive = cosine_similarity(
clip_embeddings / torch.sum(clip_embeddings)
+ annotation_text_embeddings
/ torch.sum(annotation_text_embeddings)
* self.text_encoder.cosine_similarity(
bounding_box.object_class, target_object
),
self.text_embeddings[target_object][0],
)
# Pure clip cosine similarity with negative prompt.
negative = cosine_similarity(
clip_embeddings / torch.sum(clip_embeddings),
self.text_embeddings[target_object][1],
)
clip_with_lvis_and_custom_vocabulary#
Embed all object categories in lvis and append custom vocabulary to the text embedding tensor. Then, find most nearest category.
if self.lvis_text_embeddings == None:
with torch.no_grad():
# Make text embeddings from all lvis objects.
self.lvis_text_embeddings = self.model.encode_text(
tokenize(self.lvis_prompts).to(self.device)
)
prompts: List[str] = []
# Construct a prompt for target object.
for text in texts:
prompts.append("A photo of a " + text)
with torch.no_grad():
text_embeddings = torch.cat(
[
self.lvis_text_embeddings,
self.model.encode_text(tokenize(prompts).to(self.device)),
],
dim=0,
)
image_embeddings /= image_embeddings.norm(dim=-1, keepdim=True)
text_embeddings /= text_embeddings.norm(dim=-1, keepdim=True)
# Find nearby category. This code is based on OpenAI official implementation.
# See also https://github.com/openai/CLIP/tree/main#zero-shot-prediction
similarity = (
image_embeddings.to(torch.float32) @ text_embeddings.to(torch.float32).T
).softmax(dim=-1)
values, indices = similarity.topk(1)
for value, index in zip(values, indices):
if index < len(self.lvis_classes):
return None
else:
return (texts[index - len(self.lvis_classes)], value.item())
Use with Python API#
current_path = Path(os.path.dirname(os.path.realpath(__file__)))
filter = ClipImageAnnotationFilter(str(current_path / "automation" / "clip_image_annotation_filter.yaml"))
dataset = ImagesAndAnnotationsDataset(
str(current_path / "rosbag" / "ford_with_annotation" / "bounding_box.mcap"),
str(current_path / "rosbag" / "ford_with_annotation" / "read_images_and_bounding_box.yaml"),
)
annotations = filter.inference(dataset)