While operating, Operator displays a miniature browser window of its actions.
However, the technology behind Operator is still relatively new and far from perfect. The model would be more effective for repetitive web tasks such as creating shopping lists or reading lists. It struggles more with unfamiliar interfaces like tables and calendars, and does poorly with editing complex text (with a 40% success rate), according to internal testing data from OpenAI.
OpenAI reported that the system achieved an 87% success rate on the WebVoyager benchmark, which tests live sites like Amazon and Google Maps. On WebArena, which uses offline testing sites to train autonomous agents, Operator’s success rate fell to 58.1%. For computer operating system tasks, CUA set an apparent record of 38.1 percent success on the OSWorld benchmark, outperforming previous models but still falling short of human performance at 72.4 percent.
With this imperfect overview of the research, OpenAI hopes to gather user feedback and refine the system’s capabilities. The company acknowledges that CUA will not work reliably in all scenarios, but plans to improve its reliability across a wider range of tasks through user testing.
Security and privacy concerns
For any AI model that can see how you use your computer and even control aspects of it, privacy and security are very important. OpenAI claims to have built several security controls into Operator, requiring user confirmation before performing sensitive actions such as sending emails or making purchases. The operator also has limits on what it can browse, set by OpenAI. It cannot access certain categories of websites, including gambling and adult content.
Traditionally, large language model-style Transformer-based AI models like Operator have been relatively easy to fool with quick jailbreaks and injections.
To detect operator hijacking attempts, which could hypothetically be integrated into websites crawled by the AI model, OpenAI claims to have implemented real-time moderation and detection systems. OpenAI reports that the system recognized all but one instance of rapid injection attempts during an initial internal red team session.