The patient was a 39-year-old woman who came to the emergency department at Beth Israel Deaconess Medical Center in Boston. Her left knee had been hurting for several days. The day before, she had a fever of 102 degrees. It’s gone now, but she still has chills. Her knee was red and swollen.
What was the diagnosis?
On a recent steamy Friday, resident Dr. Megan Landon showed this true case to a room full of medical students and residents. They gathered to learn a skill that can be difficult to teach – how to think like a doctor.
“Doctors are bad at teaching other doctors how to think,” said Dr. Adam Rodman, an internist, medical historian and event organizer at Beth Israel Deaconess.
But this time, they can call in an expert to help come up with a diagnosis — GPT-4, the latest version of the chatbot released by OpenAI.
Artificial intelligence is changing many aspects of the practice of medicine, and some medical professionals are using these tools to assist them in diagnosis. Doctors at Beth Israel Deaconess, a teaching hospital affiliated with Harvard Medical School, decided to explore how chatbots could be used — and abused — in training future doctors.
Educators like Dr. Rodman hope medical students can turn to GPT-4 and other chatbots for something akin to what doctors call sidewalk consultation — when they pull a colleague aside and ask for an opinion on a difficult case. The idea is to use the chatbot in the same way doctors approach each other to get suggestions and ideas.
For more than a century, the Doctor has been portrayed as a detective who gathers evidence and uses it to find the culprit. But experienced doctors actually use a different method — pattern recognition — to see what’s wrong. In medicine, this is called a disease script: signs, symptoms, and test results collected by doctors to tell a coherent story based on similar conditions they know about or have seen themselves.
If a disease script doesn’t help, Dr. Rodman said, doctors turn to other strategies, such as identifying probabilities for different diagnoses that might be appropriate.
Researchers have tried for more than half a century to design computer programs to make medical diagnoses, but nothing has really worked.
Doctors say GPT-4 is different. “It would create something remarkably similar to the disease scenario,” said Dr. Rodman. In this way, he added, it is “fundamentally different from a search engine”.
Dr. Rodman and other physicians at Beth Israel Deaconess asked the GPT-4 about possible diagnoses in difficult cases. in Stady Released last month in the medical journal JAMA, they found that he performed better than most physicians on weekly diagnostic challenges published in the New England Journal of Medicine.
But they have learned that there is an art to using software, and there are drawbacks.
It is definitely used by medical students and residents, said Dr. Christopher Smith, director of the residency program in internal medicine at the medical center. But, he added, “whether they learn anything is an open question.”
The concern is that they may rely on AI to make a diagnosis in the same way they rely on the calculator on their phone to solve a math problem. This is dangerous, said Dr. Smith.
Learning involves trying to figure things out, he said: “This is how we keep things. The struggle is part of the learning. If you outsource learning GPT, that struggle is over.”
At the meeting, the students and residents divided into groups and tried to figure out what was wrong with the patient with the swollen knee. Then they switched to GPT-4.
The groups tried different approaches.
One used GPT-4 to do an internet search, similar to how Google uses it. The chatbot spat out a list of possible diagnoses, including trauma. But when asked by group members to explain its reasons, the robot was disappointed, explaining its choice by saying, “Trauma is a common cause of knee injury.”
Another group thought of possible hypotheses and asked GPT-4 to verify them. The chatbot’s list lined up with the group’s list: infections, including Lyme disease; arthritis, including gout, which is a type of arthritis that involves crystals in the joints; and shock.
GPT-4 added rheumatoid arthritis to the top odds, although it wasn’t high on the bucket list. The trainers later told the group that gout was unbearable for this patient because she was young and female. Rheumatoid arthritis may be excluded because of only one joint inflammation and for only two days.
As a sidewalk advisor, GPT-4 seemed to pass the test or at least agree with the students and residents. But in this exercise, he provided no insights, no satisfactory scenario.
One reason may be that students and residents used the bot more as a search engine than as a sidewalk consultation.
To properly use the bot, the trainers said, they should start by telling GPT-4 something like, “You’re a doctor seeing a 39-year-old woman with knee pain.” Then, they would need to list her symptoms before asking for a diagnosis and following up with questions about the robot’s logic, the way they would for a medical colleague.
The instructors said this is a way to harness the power of GPT-4. But it’s also important to realize that chatbots can make mistakes and “hallucinate” — providing answers that have no basis in fact. Using it requires knowing when it is incorrect.
“It’s not wrong to use these tools,” said Dr. Byron Crowe, an internist at the hospital. “You just have to use it the right way.”
Provide a group analogy.
“Pilots use GPS,” Dr. Crowe said. But he added that airlines “have a very high level of reliability”. He said the use of chatbots in medicine was “very tempting”, but the same high standards should apply.
“It’s a wonderful intellectual partner,” he said, “but it doesn’t take the place of deep mental experience.”
As the session ended, the coaches revealed the true cause of the patient’s knee swelling.
It turned out to be a possibility that every group had considered, and that GPT-4 had suggested.
She had Lyme disease.
Olivia Allison contributed to this report.