Member-only story

Build a Screen Parsing with GPT-4o Vision and OmniParser

Ferry Djaja
7 min readDec 8, 2024

--

In this tutorial, we’ll explore building a screen parsing agent using GPT-4o and OmniParser. Our goal is to demonstrate locking a computer, specifically a MacOS device, using this agent.

Here’s a step-by-step breakdown of the process:

  1. Capture screenshot: utilize PyAutoGUI to take a screenshot of the current screen state.
  2. Parse screenshot: send the screenshot to OmniParser for analysis.
  3. Task Request: query LLM GPT-4o Vision with a specific task, such as locking the computer.
  4. Evaluate Task Completion: receive the result from LLM GPT-4o and determine if the task is complete.
  5. Conditional Action:
    - If the task is incomplete, use PyAutoGUI to move the cursor to a specific screen coordinate and simulate a click.
    - If the task is complete, terminate the flow.
  6. Iteration: If the task is not complete, repeat the process from step 1.
Flow Diagram

Parse Screenshot: Implementation for OmniParser Integration

To mitigate high memory usage issues encountered during multiple iterations, I have deployed FastAPI to host the parsing code. If you have alternative solutions or suggestions, please feel free to share!

Below is the Python code responsible for parsing with OmniParser on step 2 — Parse screenshot.

from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from PIL import Image
import base64
import uuid
import os
import asyncio
import torch
import gc
from transformers import AutoProcessor, AutoModelForCausalLM
from transformers.dynamic_module_utils import get_imports
from ultralytics import YOLO
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain import hub
from utils import (
check_ocr_box,
get_yolo_model,
get_caption_model_processor,
get_som_labeled_img,
)
from unittest.mock import patch

# Initialize FastAPI app
app = FastAPI()

def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
print(filename)
if not str(filename).endswith("modeling_florence2.py"):
return get_imports(filename)
imports = get_imports(filename)
if "flash_attn" in imports: #…

--

--

No responses yet