Experimenting with Set-of-Mark Prompting with GPT4-Vision

Ferry Djaja
5 min readFeb 8, 2024

I came across this challenge from the GitHub issue where the GPT-4V error rate is quite high when estimating the position of mouse on the screen from this Git repo: https://github.com/OthersideAI/self-operating-computer/issues/3. From there I tried to understand how the Set-of-Mark Prompting is working by reading the paper and trying to run the demo code. Unfortunately, I wasn’t able to run the code on my Mac, hence I decided to give it a try on Google Colab.

And here is my walk through.

Set-of-Mark (SoM)

Set-of-Mark (SoM) represents a visual prompting technique utilized to tap into the visual grounding capabilities of large multi modal models (LLMs) like GPT-4V. It is simply adding a set of visual marks on top of image regions. In this approach, they utilize readily available interactive segmentation models, such as SEEM/SAM, to segment an image into regions at various levels of detail. These regions are then overlaid with a set of markings, such as alphanumeric, masks, or boxes. By feeding the marked image as input, GPT-4V becomes adept at answering questions that demand visual grounding. Their experiments show that SoM can drastically unleash the grounding capability of GPT-4V.

--

--