Skip to content

Since we've been able to reproduce the results of "Do LLMs Know About Hallucination?" we have begun to ask about whether or not the authors' technique can be applied to more realistic content. The datasets in this paper are composed almost entirely of questions with answers that are single-sentence of single-word factoids. In reality, however, LLMs can be quite wordy. The question is now, can we find the needles (confabulations) in this haystack (paragraph)?

To test this, I took inspiration from their lists of the top 10 tokens associated with the directions of correctness and hallucination (Table 1), and I attempted to apply it to our new task. Specifically, I tracked how close the transition vectors were to the word "True" (based on the model's token classification layer) over the course of the generation of the answer.

The dataset used is a subset of 20 SOAP notes from our EHA dataset along with each note's respective AI-generated response. So, our "question" for each datapoint in this set is, "I want you to act as an attending physician and give feedback on the SOAP note below," followed by the note. The "answers" are each note's AI-generated feedback. However, for each note, I've manually changed the feedback in the medication section of each note to say, "I think you prescribe Benadryl." Benadryl has not been originally prescribed in any of the unedited feedback, so that means that in each case, Benadryl is an inappropriate medication to prescribe.

Above is a graph which just shows the "swings" in classification values for "True" of each token once the word "Benadryl" has been generated (in red), and the word right before it (in blue). So, each blue token is a word that is not in any way actually wrong, followed by a red dot representing a medication that is clearly inappropriate for that patient. Each blue-red pair of tokens are from the same response. This visualization makes a compelling case for there being a clear "swing" away from True in the model's embedding space when a wrong word has been generated.

However, when looking at a longer slice of each response, it becomes clear that the perceived downward swing in the first chart is not actually real. Especially looking at the dark blue line representing the average, the expected "drop-off" at the end of the graph does not occur. This means that sheer similarity to "True" is not enough for us to detect a single-word confabulation. However, this is just one possible approach, and going forward I will be investigating more.

The first one.

But wait! There's more! We may be able to learn something about the way LLMs work based on 2nd place to last place.

I'll back up and say a little bit more about what exactly I mean when I ask, "Which sentence receives the most attention?" In our paper we resented at ICHI we remarked on how in some cases it seemed like GPT was "paying lip service" to certain aspects of our prompt, and not actually changing its answers to match concerns we would raise about cost, time, etc.

We'd insert into our prompt something to the effect of, "These test can be very expensive, so do not recommend ones that are not relevant to the patient's acute condition." Then GPT would respond with the same test recommendations it did for our "base" prompt, but it would throw in a sentence about being careful to not recommend too many tests. This would suggest that for in the sentence with the actual recommendations, it was paying a lot of attention to the list of symptoms, and for the sentence after it (where it would warn against over-recommending tests), it would be paying a lot of attention to the warning in our prompt.

This raises the question of whether or not we can actually observe such a pattern at the sentence level. What would an attention map, like the one below, look like grouped into sentences?

In order to be able to investigate this question, we unfortunately will have to look at a model that isn't actually in the GPT-4 family of models, since access to those models' attention values are not available. Additionally, I wanted to build a proof-of-concept that I could run my own computer, so I picked the 7 billion parameter version of LLaMA 2 available on HuggingFace.

An attention matrix for the (defined below) input-output pair's processing through LLaMA 2. Here, the x and y axes are the tokens of the input-output pair, going from the first token at what is the (0, 0) point on most other graphs, to the last token. Here, the color scale is 0-1. This will change on other figures.

Above we can see the attention matrix for the following input and output to LLaMA 2:
Input: Tell me a fun fact about the Holy Roman Empire. Then, tell me a fun fact about nuclear fusion. Then, tell me a fun fact about the TV show Friends. Please tell me one fact for each.
Output: Fun fact about the Holy Roman Empire: The Holy Roman Empire was the first multinational state in European history. It was a political entity that existed from 962 to 1806 and included territories in modern-day Germany, Austria, Switzerland, Belgium, the Netherlands, and parts of France. Fun fact about nuclear fusion: Nuclear fusion is the process by which atomic nuclei combine to form a heavier nucleus, releasing energy in the process. It is the same process that powers the sun and other stars. Scientists have been working on developing a practical fusion reactor for decades, but it remains an elusive goal. Fun fact about the TV show Friends: The iconic Central Perk coffee shop from the show was actually filmed in a real coffee shop in Los Angeles called "The Coffee Bean & Tea Leaf." The show's creators chose this location because they wanted the coffee shop to have a cozy, inviting atmosphere that would feel like a second home to the characters.

This weird prompt was chosen as a "toy" example so that we would have a response conducive to seeing a shifting focus, from fact #1 to fact #3. However, at least here at the token level, the main thing we see instead is that a lot of attention is being paid to the first token in the prompt: the start token, "<s>." In fact, this level of attention paid to "<s>" is so great that, when we combine the tokens into sentences and look at how the attention is distributed over these sentences, that is the primary feature we see.

A sentence-level attention matrix. Here, the color scale is 0-0.09, a significant decrease from the previous figure.

Interestingly, there is a spike in attention on the sentence, "Fun fact ... history." However, this is the first sentence of the response, and even though special tokens are not shown on the axes of this graph, it also starts with the special token "<s>." Since that really skews the levels of attention, let's look at these sentences without the special tokens.

A "cleaner" version of the sentence-level attention matrix. Here, the color scale is 0-0.025. So, any changes represented here are very small.

A greater level of attention is still being given to the first sentence of the output, eve when not counting the special tokens. Outside of this observation, it's hard to tell to what degree the model is placing "extra" attention on what humans would perceive to be the "most relevant" sentence of the prompt. I had hoped for much more dramatic results that might suggest that the model pays a whole bunch of attention to the most relevant sentence and almost to none to the others. Perhaps a larger experiment would show a slightly statistically significant difference that would be more definitive. But I don't think that I'm going to go for it, mainly because I anticipate that, if the experiment did show the LLMs do pay slightly more attention to more relevant sentences, I'd be left with a lingering, "So what?"

2

Me presenting at ICHI All4Health workshop. Picture credit: Robert Pless

Earlier this week, I attended the ALL4Health workshop at ICHI, in which I presented some early results from our ongoing pilot study with EHA Clinics. Here, I want to share some thoughts I had while listening to the day's talks.

The day reaffirmed for me how impactful the integration of LLMs in LMICs can be. An obvious potential critique of the practice is, if people are really interested in making quality care more accessible, why are they not funding efforts to train people who live there to be doctors, as opposed to implementing some weird, convoluted AI system? There are several reasons that the workshop made apparent to me. Primarily, there are already many active healthcare institutions in these regions, as the existence of EHA Clinics and other organizations like it (many of which I learned about at the workshop) proves; upending these institutions and replacing them with something completely dependent on outside support and intervention is not ideal from a developmental standpoint. Additionally, an approach purely focused on skills development may ultimately undermine its own goals, since a skilled physician may end up leaving their home country for an opportunity to work a wealthier one. In perhaps one of the most striking remarks of the day, one of the speakers referenced a quip that an Ethiopian colleague of theirs had made: "There may just be more Ethiopian opthamologists in Chicago than in Ethiopia."

It should also be noted that this movement towards LLM integration in clinical systems is happening in rich countries as well. Keynote speaker Mark Drezde spoke very openly about how, just a few years ago (the days of GPT-2), he firmly believed that people would not be able to build a medical chatbot to include in a real clinical setting. He pointed to how an ability to answer medical licensing questions accurately does not make one a good doctor, as well as the facts that clinical EHR data are often ambiguous and that the truly hard cases for doctors are those that don't match textbook presentation. However, he humorously touted that he was dead wrong. LLMs are currently far more capable with medical data than he thought at the time that they could be. Today, major tech and medical companies are partnering to realize in-clinic chatbot assistants. Now, the questions to be addressed are not if they can be useful, but how exactly they should be used and how exactly their performance should be evaluated. Dr. Drezde personally advocated for LLMs being used to identify and recommend which specialist(s) a patient should see in their particular case, due to the incredible breadth of knowledge that LLMs have. This knowledge is useful not just for rich patients, and given how inexpensive a given query of, for example, GPT-4 is, it can and should be used in as many possible safe contexts as is feasible.

Like the paper I presented, the primary concern of all of the work I got to see focused on the question of how to make LLM integration safe and feasible. In most cases, the work was concentrated on a single case study---a specific clinic in a defined region of sub-saharan Africa, sometimes focused on a particular specialty, such as maternal care. A key objective for many of the studies was to identify and describe integration strategies that the human staff at the clinic liked and trusted. In line with this goal, many of the speakers presented rankings of models-prompt pairs by both algorithmic means (such as BLEU) and human feedback surveys. What I found to be perhaps one of the most interesting takeaways from the workshop is that (according to my personal notes), out of the four talks that presented a ranking of models and prompts based on scores from both algorithmic evaluations and human feedback, each of them reported a different "best" model for the algorithmic ranking and the human feedback. The models that performed best on benchmark evaluations did not perform best in the feedback surveys. Although this is a small sample, this suggests that there still does not exist an algorithmic metric that adequately approximates how likely a human is to have a positive perception of the language model's output. At the same time, it also is unclear what exactly makes a "good" response. Is it politeness? Is it how clearly organized and laid out the information presented is? These questions are particularly relevant to our ongoing project with eHealth Africa, as we continue to try to refine our prompt and understand what GPT does well and does poorly.

Based on a conversation I was able to have with some of the organizers and attendees, there was a strong belief among those at the workshop that there would be another iteration of it next year, and that it would be bigger, as LLMs continue to be more widely adopted throughout the healthcare industry. Based on what I saw at this year's event, I would say that there are plenty of interesting questions related to this subject that are unanswered, so another workshop next year (and for several more years after that) would certainly be valuable.