New 3D benchmark leaves AI in knots
By Louis DiPietro
Today’s artificial intelligence models can’t even tie their own shoes.
In new research that puts the latest models to test in a 3D environment, Cornell scholars found that AI fares well with untangling basic knots but can’t quite tie knots from simple loops nor convert one knot to another.
The findings suggest that, for its value in generating text- and image-based information, AI still has a long way to go with spatial reasoning and manipulation, which will prove essential in other AI-powered areas like robotics.
“With current AI, it works great with big blocks of text. Once moved to reason in the 3D world, AI breaks,” said Zoe (Zizhao) Chen, a doctoral student in the field of computer science at Cornell Tech and lead author of “Knot So Simple: A Minimalistic Environment for Spatial Reasoning,” which was presented at the Annual Conference on Neural Information Processing Systems (NeurIPS) on Dec. 5 in San Diego, California. “Most reasoning we see from AI today is text-based. That’s great, but it’s not enough.”
In the paper, Chen and Yoav Artzi, associate professor of computer science at Cornell Tech and paper co-author, present KnotGym, a 3D simulator to test different kinds of reinforcement learning models and large language models (LLM), like GPT-4, in a virtual environment. KnotGym is a visual generalization test, which can help researchers better understand how well AI will perform beyond baseline training and on increasingly harder tasks. It features a “generalization ladder,” allowing researchers to crank up the difficulty of knots presented to the models.
“That’s missing from most computer vision tasks,” Chen said.
In KnotGym, the AI agents were shown simple illustrations of loops and various knots and prompted to either unknot, tie, or convert one knot into another.
AI did well with untying simple knots. It had a roughly 90% success rate in untying knots of up to four crossings, including a basic shoelace knot, which has three.
However, tying and converting knots proved harder for AI; the more crossings in the knot, the worse it did, researchers found. It had an 83% success rate in tying simple two-crossing knots, but plummeted to a 16% success rate for knots with three crossings. Knots with more than three crossings completely tied up AI, researchers found (The success rate for knot conversions was about the same as with tying knots).
The research suggests AI doesn’t yet know how to play and discover, Chen said.
“When kids play with this,” said Chen, reaching for a nearby Rubik’s Cube, “they just start fiddling around and eventually figure out a sequence of actions that get them to certain configurations without messing up the rest. They explore. They reuse lessons, build off previous knowledge, and work toward a grander goal. That’s an ability we want to see with AI, but it’s not there yet.”
In future work, Chen aims to improve KnotGym by, for instance, running it on Graphics Processing Units (GPUs) – computer hardware originally designed for gaming that is more powerful than Central Processing Units (CPUs), which are found in everyday computers. This change would improve evaluation speed, she said.
This research was partly funded by the National Science Foundation, Open Philanthropy, Nvidia Academic Grant and the National Artificial Intelligence Research Resource (NAIRR) Pilot.
Louis DiPietro is a writer for the Cornell Ann S. Bowers College of Computing and Information Science.
Media Contact
Get Cornell news delivered right to your inbox.
Subscribe