There is a big difference between what the GradCAM attention visualization looks like for the last hidden non-pooling layer of SigLIP 2's base version and its large version (more specifically, the Huggingface builds "google/siglip2-base-patch16-256" and "google/siglip2-large-patch16-256"). At first, I was just working with the base version, and the performance was bad enough to make me think that there was something wrong with my code. It turns out, though, the same code that produces dubious results with the base version produces great ones with the large version! Let's compare.
Below is a small set of examples of SigLIP trying to perform zero-shot classification on pictures of dogs, with the 5 classes being "This is a photo of a [X]," with [X] being "cat," "dog," "car," "person," and "shoe." For each example I'll show the attention map for the base version on the left, the large version on the right, and the value of the "dog" class for each version in the caption below.
In each of the four cases, the value assigned by the large version to the "dog" class is higher, indicating that the model has a higher confidence in that class. In fact, for the first three examples, "dog" is not even the highest-rated class by the base model, meaning that in the first three cases we are seeing a misclassification by the base model, and a correct classification by the large model. In the fourth case, both the base and the large model got it right.
It looks to me like the thing that the base model's attention maps have in common in the first three examples is that the attention is very "unfocused." As in, it is all over the place as opposed to actually focused on the dog in the image. Going forward, I'll be searching for literature that discusses something along the lines of "attention concentration," or, some sort of metric for how "focused" an activation map like this is on the pertinent features of the image. In other words, I wonder if I can use something like SegmentAnything to isolate the dog in each image, and then see how much of the attention, as a percentage of the total, is focused on the dog, and how much is focused outside of it.
Since we've been able to reproduce the results of "Do LLMs Know About Hallucination?" we have begun to ask about whether or not the authors' technique can be applied to more realistic content. The datasets in this paper are composed almost entirely of questions with answers that are single-sentence of single-word factoids. In reality, however, LLMs can be quite wordy. The question is now, can we find the needles (confabulations) in this haystack (paragraph)?
To test this, I took inspiration from their lists of the top 10 tokens associated with the directions of correctness and hallucination (Table 1), and I attempted to apply it to our new task. Specifically, I tracked how close the transition vectors were to the word "True" (based on the model's token classification layer) over the course of the generation of the answer.
The dataset used is a subset of 20 SOAP notes from our EHA dataset along with each note's respective AI-generated response. So, our "question" for each datapoint in this set is, "I want you to act as an attending physician and give feedback on the SOAP note below," followed by the note. The "answers" are each note's AI-generated feedback. However, for each note, I've manually changed the feedback in the medication section of each note to say, "I think you prescribe Benadryl." Benadryl has not been originally prescribed in any of the unedited feedback, so that means that in each case, Benadryl is an inappropriate medication to prescribe.
Above is a graph which just shows the "swings" in classification values for "True" of each token once the word "Benadryl" has been generated (in red), and the word right before it (in blue). So, each blue token is a word that is not in any way actually wrong, followed by a red dot representing a medication that is clearly inappropriate for that patient. Each blue-red pair of tokens are from the same response. This visualization makes a compelling case for there being a clear "swing" away from True in the model's embedding space when a wrong word has been generated.
However, when looking at a longer slice of each response, it becomes clear that the perceived downward swing in the first chart is not actually real. Especially looking at the dark blue line representing the average, the expected "drop-off" at the end of the graph does not occur. This means that sheer similarity to "True" is not enough for us to detect a single-word confabulation. However, this is just one possible approach, and going forward I will be investigating more.
But wait! There's more! We may be able to learn something about the way LLMs work based on 2nd place to last place.
I'll back up and say a little bit more about what exactly I mean when I ask, "Which sentence receives the most attention?" In our paper we resented at ICHI we remarked on how in some cases it seemed like GPT was "paying lip service" to certain aspects of our prompt, and not actually changing its answers to match concerns we would raise about cost, time, etc.
We'd insert into our prompt something to the effect of, "These test can be very expensive, so do not recommend ones that are not relevant to the patient's acute condition." Then GPT would respond with the same test recommendations it did for our "base" prompt, but it would throw in a sentence about being careful to not recommend too many tests. This would suggest that for in the sentence with the actual recommendations, it was paying a lot of attention to the list of symptoms, and for the sentence after it (where it would warn against over-recommending tests), it would be paying a lot of attention to the warning in our prompt.
This raises the question of whether or not we can actually observe such a pattern at the sentence level. What would an attention map, like the one below, look like grouped into sentences?
In order to be able to investigate this question, we unfortunately will have to look at a model that isn't actually in the GPT-4 family of models, since access to those models' attention values are not available. Additionally, I wanted to build a proof-of-concept that I could run my own computer, so I picked the 7 billion parameter version of LLaMA 2 available on HuggingFace.
An attention matrix for the (defined below) input-output pair's processing through LLaMA 2. Here, the x and y axes are the tokens of the input-output pair, going from the first token at what is the (0, 0) point on most other graphs, to the last token. Here, the color scale is 0-1. This will change on other figures.
Above we can see the attention matrix for the following input and output to LLaMA 2: Input:Tell me a fun fact about the Holy Roman Empire. Then, tell me a fun fact about nuclear fusion. Then, tell me a fun fact about the TV show Friends. Please tell me one fact for each. Output:Fun fact about the Holy Roman Empire: The Holy Roman Empire was the first multinational state in European history. It was a political entity that existed from 962 to 1806 and included territories in modern-day Germany, Austria, Switzerland, Belgium, the Netherlands, and parts of France. Fun fact about nuclear fusion: Nuclear fusion is the process by which atomic nuclei combine to form a heavier nucleus, releasing energy in the process. It is the same process that powers the sun and other stars. Scientists have been working on developing a practical fusion reactor for decades, but it remains an elusive goal. Fun fact about the TV show Friends: The iconic Central Perk coffee shop from the show was actually filmed in a real coffee shop in Los Angeles called "The Coffee Bean & Tea Leaf." The show's creators chose this location because they wanted the coffee shop to have a cozy, inviting atmosphere that would feel like a second home to the characters.
This weird prompt was chosen as a "toy" example so that we would have a response conducive to seeing a shifting focus, from fact #1 to fact #3. However, at least here at the token level, the main thing we see instead is that a lot of attention is being paid to the first token in the prompt: the start token, "<s>." In fact, this level of attention paid to "<s>" is so great that, when we combine the tokens into sentences and look at how the attention is distributed over these sentences, that is the primary feature we see.
A sentence-level attention matrix. Here, the color scale is 0-0.09, a significant decrease from the previous figure.
Interestingly, there is a spike in attention on the sentence, "Fun fact ... history." However, this is the first sentence of the response, and even though special tokens are not shown on the axes of this graph, it also starts with the special token "<s>." Since that really skews the levels of attention, let's look at these sentences without the special tokens.
A "cleaner" version of the sentence-level attention matrix. Here, the color scale is 0-0.025. So, any changes represented here are very small.
A greater level of attention is still being given to the first sentence of the output, eve when not counting the special tokens. Outside of this observation, it's hard to tell to what degree the model is placing "extra" attention on what humans would perceive to be the "most relevant" sentence of the prompt. I had hoped for much more dramatic results that might suggest that the model pays a whole bunch of attention to the most relevant sentence and almost to none to the others. Perhaps a larger experiment would show a slightly statistically significant difference that would be more definitive. But I don't think that I'm going to go for it, mainly because I anticipate that, if the experiment did show the LLMs do pay slightly more attention to more relevant sentences, I'd be left with a lingering, "So what?"
Me presenting at ICHI All4Health workshop. Picture credit: Robert Pless
Earlier this week, I attended the ALL4Health workshop at ICHI, in which I presented some early results from our ongoing pilot study with EHA Clinics. Here, I want to share some thoughts I had while listening to the day's talks.
The day reaffirmed for me how impactful the integration of LLMs in LMICs can be. An obvious potential critique of the practice is, if people are really interested in making quality care more accessible, why are they not funding efforts to train people who live there to be doctors, as opposed to implementing some weird, convoluted AI system? There are several reasons that the workshop made apparent to me. Primarily, there are already many active healthcare institutions in these regions, as the existence of EHA Clinics and other organizations like it (many of which I learned about at the workshop) proves; upending these institutions and replacing them with something completely dependent on outside support and intervention is not ideal from a developmental standpoint. Additionally, an approach purely focused on skills development may ultimately undermine its own goals, since a skilled physician may end up leaving their home country for an opportunity to work a wealthier one. In perhaps one of the most striking remarks of the day, one of the speakers referenced a quip that an Ethiopian colleague of theirs had made: "There may just be more Ethiopian opthamologists in Chicago than in Ethiopia."
It should also be noted that this movement towards LLM integration in clinical systems is happening in rich countries as well. Keynote speaker Mark Drezde spoke very openly about how, just a few years ago (the days of GPT-2), he firmly believed that people would not be able to build a medical chatbot to include in a real clinical setting. He pointed to how an ability to answer medical licensing questions accurately does not make one a good doctor, as well as the facts that clinical EHR data are often ambiguous and that the truly hard cases for doctors are those that don't match textbook presentation. However, he humorously touted that he was dead wrong. LLMs are currently far more capable with medical data than he thought at the time that they could be. Today, major tech and medical companies are partnering to realize in-clinic chatbot assistants. Now, the questions to be addressed are not if they can be useful, but how exactly they should be used and how exactly their performance should be evaluated. Dr. Drezde personally advocated for LLMs being used to identify and recommend which specialist(s) a patient should see in their particular case, due to the incredible breadth of knowledge that LLMs have. This knowledge is useful not just for rich patients, and given how inexpensive a given query of, for example, GPT-4 is, it can and should be used in as many possible safe contexts as is feasible.
Like the paper I presented, the primary concern of all of the work I got to see focused on the question of how to make LLM integration safe and feasible. In most cases, the work was concentrated on a single case study---a specific clinic in a defined region of sub-saharan Africa, sometimes focused on a particular specialty, such as maternal care. A key objective for many of the studies was to identify and describe integration strategies that the human staff at the clinic liked and trusted. In line with this goal, many of the speakers presented rankings of models-prompt pairs by both algorithmic means (such as BLEU) and human feedback surveys. What I found to be perhaps one of the most interesting takeaways from the workshop is that (according to my personal notes), out of the four talks that presented a ranking of models and prompts based on scores from both algorithmic evaluations and human feedback, each of them reported a different "best" model for the algorithmic ranking and the human feedback. The models that performed best on benchmark evaluations did not perform best in the feedback surveys. Although this is a small sample, this suggests that there still does not exist an algorithmic metric that adequately approximates how likely a human is to have a positive perception of the language model's output. At the same time, it also is unclear what exactly makes a "good" response. Is it politeness? Is it how clearly organized and laid out the information presented is? These questions are particularly relevant to our ongoing project with eHealth Africa, as we continue to try to refine our prompt and understand what GPT does well and does poorly.
Based on a conversation I was able to have with some of the organizers and attendees, there was a strong belief among those at the workshop that there would be another iteration of it next year, and that it would be bigger, as LLMs continue to be more widely adopted throughout the healthcare industry. Based on what I saw at this year's event, I would say that there are plenty of interesting questions related to this subject that are unanswered, so another workshop next year (and for several more years after that) would certainly be valuable.