Visualizing Confabulation Detection in Long LLM Responses by Watching Transition Vectors Change

Since we've been able to reproduce the results of "Do LLMs Know About Hallucination?" we have begun to ask about whether or not the authors' technique can be applied to more realistic content. The datasets in this paper are composed almost entirely of questions with answers that are single-sentence of single-word factoids. In reality, however, LLMs can be quite wordy. The question is now, can we find the needles (confabulations) in this haystack (paragraph)?

To test this, I took inspiration from their lists of the top 10 tokens associated with the directions of correctness and hallucination (Table 1), and I attempted to apply it to our new task. Specifically, I tracked how close the transition vectors were to the word "True" (based on the model's token classification layer) over the course of the generation of the answer.

The dataset used is a subset of 20 SOAP notes from our EHA dataset along with each note's respective AI-generated response. So, our "question" for each datapoint in this set is, "I want you to act as an attending physician and give feedback on the SOAP note below," followed by the note. The "answers" are each note's AI-generated feedback. However, for each note, I've manually changed the feedback in the medication section of each note to say, "I think you prescribe Benadryl." Benadryl has not been originally prescribed in any of the unedited feedback, so that means that in each case, Benadryl is an inappropriate medication to prescribe.

Above is a graph which just shows the "swings" in classification values for "True" of each token once the word "Benadryl" has been generated (in red), and the word right before it (in blue). So, each blue token is a word that is not in any way actually wrong, followed by a red dot representing a medication that is clearly inappropriate for that patient. Each blue-red pair of tokens are from the same response. This visualization makes a compelling case for there being a clear "swing" away from True in the model's embedding space when a wrong word has been generated.

However, when looking at a longer slice of each response, it becomes clear that the perceived downward swing in the first chart is not actually real. Especially looking at the dark blue line representing the average, the expected "drop-off" at the end of the graph does not occur. This means that sheer similarity to "True" is not enough for us to detect a single-word confabulation. However, this is just one possible approach, and going forward I will be investigating more.

Leave a Reply Cancel reply