Experimenting with Set-of-Mark Prompting with GPT4-Vision

Ferry Djaja
5 min readFeb 8, 2024

I came across this challenge from the GitHub issue where the GPT-4V error rate is quite high when estimating the position of mouse on the screen from this Git repo: https://github.com/OthersideAI/self-operating-computer/issues/3. From there I tried to understand how the Set-of-Mark Prompting is working by reading the paper and trying to run the demo code. Unfortunately, I wasn’t able to run the code on my Mac, hence I decided to give it a try on Google Colab.

And here is my walk through.

Set-of-Mark (SoM)

Set-of-Mark (SoM) represents a visual prompting technique utilized to tap into the visual grounding capabilities of large multi modal models (LLMs) like GPT-4V. It is simply adding a set of visual marks on top of image regions. In this approach, they utilize readily available interactive segmentation models, such as SEEM/SAM, to segment an image into regions at various levels of detail. These regions are then overlaid with a set of markings, such as alphanumeric, masks, or boxes. By feeding the marked image as input, GPT-4V becomes adept at answering questions that demand visual grounding. Their experiments show that SoM can drastically unleash the grounding capability of GPT-4V.

Comparing standard GPT-4V and its combination with Set-of-Mark (SoM) Prompting. it clearly shows that our proposed prompting method helps GPT-4V to see more precisely and finally induce the correct answer.
Figure 1: Comparisons of GPT-4V prompting techniques: (left) standard prompting and (right) the proposed Set-of-Mark Prompting. Simply overlaying ids on image regions unleashes visual grounding and corrects answers for GPT-4V. Note that no marks are leaked to user text inputs..

Installation

I will be using Google Colab to perform the installation steps since I have enough resources to run the app.

Install SEEM: Segment Everything Everywhere All at Once

pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package

Install Segment Anything

pip install git+https://github.com/facebookresearch/segment-anything.git

Install Semantic-SAM: Segment and Recognize Anything at Any Granularity

--

--