AI tools to help vision-impaired are good, but could be better

Artificial intelligence is touching nearly every aspect of life – including assistive technology for blind and low-vision (BLV) individuals.

And just like in other arenas, the AI used to assist BLV people is good – but far from perfect.

In a study involving 20 vision-impaired participants, researchers at Cornell Tech found that the large language model-enabled application they developed to help BLV individuals interpret their surroundings work well for general “What is this?” questions, but struggled when asked to provide more detailed assistance in complex tasks, such as describing artistic pieces. The researchers also proposed nine “skills” that would improve the models powering the intelligence of these apps.

“The technology has come a long way in interpreting visuals and are really helping people. But at the same time, we want to make sure that we are continuing to improve them, so we want to identify the gaps,” said Shiri Azenkot, associate professor at Cornell Tech, the Jacobs Technion-Cornell Institute and the Cornell Ann S. Bowers College of Computing and Information Science.

Ricardo Gonzalez, doctoral candidate in human-AI interaction, is lead author of “How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People.” Gonzalez presented the work at the Association for Computing Machinery’s Conference on Human Factors in Computing Systems (CHI ’26), April 13-17 in Barcelona, Spain, where it received honorable mention.

In addition to senior author Azenkot, co-authors include Crescentia Jung, M.S. ’25, a doctoral student in information science; Sharon Lin, M.S. ’25, a graduate research assistant; and Ruiying Hu, M.S. ’25, now a software engineer at Snap Inc.

Azenkot, director of the Enhancing Ability Lab, focuses her research on enabling people with disabilities to have equal access to information. Her research is also personal: She is legally blind, and walks with the help of a guide dog.

She said AI’s impact on disability and access has improved in recent years. “As someone who has a visual impairment who also uses these technologies, I can feel the difference in my daily life,” she said.

For this research, Gonzalez developed a visual interpretation application – a smartphone app, he said, with which the user takes a picture of a scene, then receives feedback and interpretation.

“With the app, you can look at a page of text, you can point it at your computer, you can point it at the moving train,” Azenkot said, “and it’ll attempt to interpret or tell you what’s in the scene.”

They employed a multimodal large language model (MLLM), which merges the reasoning capabilities of LLMs with the ability to receive, reason and output information from images, audio and video. Gonzalez said it was important that they conduct their experiment using a system they developed.

“By having full control of the application’s inner workings,” he said, “we can deploy our study in the real world and collect data that truly reflects how these systems are affecting people’s lives.”

Using GPT-4o, the team developed its own iPhone app, called VisionPal, and recruited 20 BLV individuals with varying levels of vision impairment. The researchers conducted a diary study from October to December 2024, in which each participant used VisionPal for two weeks and documented their experiences.

Each diary entry consisted of the participant’s captured photo, the photo description, the chat messages exchanged between the participant and the MLLM, and participants’ responses to a short survey, which participants were asked to complete at least once per day.

The researchers collected hundreds of interactions and found that, while participants reported high levels of satisfaction and trust, some limitations remain. When engaging in conversation to follow up on initial visual interpretations – such as when reading cooking instructions or medication dosages – VisionPal correctly responded to only 56.6% of user queries, and 22.2% of responses included false information.

“These models are advancing at a dizzying pace,” Gonzalez said, “and while they are definitely useful, we need to continue to audit them to understand the risks they introduce in people’s lives.”

The research team proposed the “visual assistant” skill – nine behaviors MLLMs should exhibit to assure goal-relevant, reliable visual assistance. They include neutral factual communication (be objective); adaptive communication protocols (establishing communication preferences early); goal-oriented collaboration (communicating only relevant information, based on the user’s setting); transparent uncertainty handling (acknowledge limitations); and graceful handoff (either hand off to appropriate resources or acknowledge a lack of expertise).

“I want to emphasize that I really think the we’ve come a long way with this technology,” Azenkot said. “But the first questions I always come across in trying to improve the technology are human questions: What do people need, and how can we best support those needs?”

“People should remain at the center,” Gonzalez said, “because ultimately these tools exist to serve their needs and improve their lives.”

This work was supported by funding from the National Science Foundation.

Media Contact

Becka Bowyer