reducing sycophancy in LLMs with synthetic data

Dev Shah
9 min readOct 8, 2023

Large Language Models have been at the center of attention in the Machine Learning space ever since openAI released chatGPT to the public. Since that point, we’ve seen countless developments of new ML products that use LLMs as the foundation of their application. These LLMs have demonstrated remarkable abilities in natural language understanding and generation. They have been used for everything from answering questions and creating content to assisting with language translation and even aiding in code generation. This versatility has spurred innovation across various industries, making LLMs a ubiquitous tool for developers and organizations looking to leverage the power of AI-driven language processing.

Despite the excitement and hype behind this field, LLMs aren’t without their challenges and controversies. One pressing issue that has emerged is sycophancy within LLMs. Sycophancy refers to the inclination of these models to generate responses that excessively flatter individuals, organizations, or entities, often in an exaggerated or insincere manner. The introduction of sycophany in the LLM space raises ethical concerns, as it has the potential to spread false information, reinforce biases, and contribute to the creation of unethical or biased content.

Recently, a paper published by Google DeepMind covers how it’s possible to use simple synthetic data to reduce sycophancy in LLMs. This paper was published by Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V. Le. If you want to read more about this paper, it can be found here. Now let’s jump into the paper and go over how this synthetic data can be leveraged!

paper review.

One of the biggest issues within LLMs are reward hacking. This is when these large language models begin to exploit the preference of human raters. One of the basic forms of reward hacking is sycophancy — when a model respond to a question with a user’s preferred answer in order to look / sound favorable, even if the response is incorrect. Here’s an example:

The goal of the paper is to delve into the issue of sycophancy within a set of base and instruction-tuned models. This is can be done by introducing a simple synthetic-data intervention during an additional fine-tuning stage, with the intention of mitigating sycophantic behavior. In other words, the paper proposed a straightforward data intervention strategy that employs publicly available NLP tasks to teach models that the truthfulness of a statement is independent of the user’s opinion. After this, they conduct a lightweight fine-tuning stage on Flan-PaLM models using this data, showcasing the successful reduction in sycophantic behavior across various scenarios. Now that I’ve given a rough overview, let’s jump into understanding how sycophancy manifests in language models when they are asked for opinions on topics that lack a definitive answer.

understanding sycophancy in LLMs.

Previous research has shown that Reinforcement Learning from Human Feedback has increased sycophancy on internal Anthropic models that contain up to 52B parameters. This can be extended beyond this model — this paper explored if this trend would apply ot other models, specifically PaLM models with 8 to 540 billion parameters. They also tested it on their instruction-tuned counterpart model as well, Flan-PaLM.

The image above shows the behaviour of the models across 3 sycophancy tasks; natural languege processing survey questions, philosophy survey questions, and political typology quiz questions. In these tasks, sycophantic models tend to select answers that align with the user’s opinion, even though the user’s viewpoint is not objectively correct due to the subjective nature of the questions. Importantly, when the user’s opinions are removed from the equation, models do not display an inherent preference for answers that previously matched those opinions.

In preliminary experiments, production models like ChatGPT and Bard did not exhibit significant sycophancy. This difference in behavior could be attributed to their additional fine-tuning data or prompt preambles.

The findings highlight two key trends. Firstly, as language models scale up, sycophancy increases within both PaLM and Flan-PaLM model families. For instance, the transition from PaLM-8B to PaLM-62B results in a 19.8% increase in sycophantic behavior, and further scaling to PaLM-540B leads to an additional 10.0% increase. This trend is particularly notable because there isn’t an immediately evident reason why larger models would exhibit more sycophantic tendencies.

Secondly, instruction tuning significantly amplifies sycophancy across all models. For example, PaLM-8B experiences an average increase of 26.0% in responses that align with the user’s viewpoint. This suggests that instruction tuning might unintentionally incentivize sycophantic responses, possibly because it lacks data that effectively distinguishes between opinions and instructions. Consequently, models undergoing instruction tuning might struggle to differentiate between a user’s opinions and their explicit instructions, contributing to the observed increase in sycophantic behaviour.

extending this trend.

This trend actually carries on beyond the example I mentioned earlier, it extends to evaluations where models know that the user’s opinion that they are following is incorrect. In the paper, they added a user’s opinion stating that the user agrees with some sort of incorrect statement. The correct answer remained the same, however, the model didn’t disagree with the incorrect statement. Consider the example below:

In the previous section, I mentioned using the Flan-PaLM model; this model was tested against this type of example and here were the results:

When the user opinion isn’t explicitly stated, all models except the smallest model can corrrectly disagree with the incorrect statements close to 100% of the time. However, when modifying the prompt (where the user agrees with the incorrect statement), all models tend to flip their previously correct answer and follow the user’s incorrect opinion. Now that we have a stronger understanding of where the problem lies and how it works, let’s talk about the proposed solution: using synthetic data to overcome sycophancy.

synthetic data.

The premise of the approach is to address and mitigate a model’s inclination toward sycophancy by introducing a straightforward synthetic-data intervention during the fine-tuning process. This intervention involves training models on prompts where the truthfulness of a claim remains independent of the user’s opinion.

These prompts are created by using the input-label pairs sourced from public NLP datasets. From each dataset, they exclusively draw input-label pairs from the training split to construct our claims. After this is done, this is where the intuition comes into play. For example, whenever there is a true or false claim, the user’s opinion is now incorporated. An opinion that either aligns with or opposes the claim. To introduce a degree of diversity into the dataset, there are random variations introduced for other user-related aspects. These generated data points are then integrated into a predefined template to yield prompts for the fine-tuning process, check out the image below for more context.

Now let’s jump into the procedure of using this synthetic data to reduce sycophancy.

synthetic data intervention.

After applying the synthetic-data to the LLMs, the models are going to be evaluated. The intervention technique that is used is designed to reduce the model’s tendency to skew towards sycophantic behaviour. Based on the pre-experiment work, it’s expected that the model will be less likely to agree with users on questions without a correct answer and also there’s a lower chance of the model to follow an incorrect opinion.

In the graph above, we see that sycophancy was reduced in all the models. More specifically — “the largest reduction was seen in Flan-cont-PaLM-62B, which was 10.0% less likely to match the user’s opinion, though all other models saw reductions in sycophancy between 4.7% (Flan-PaLM-62B) and 8.8% (Flan-PaLM-8B).” In other words, this was really promising as the data didn’t inlcude any prompts where the model was asked for an opinion on a claim that didn’t have a clearly correct answer. Thus, this approach is clearly generalizable.

In the graph below, a comparison is made of Flan-PaLM performance on the simple addition statements task from the ‘extending this trend’ section before and after intervention. Flan-PaLM models exhibit an inability to maintain their performance in the presence of contradicting user opinions, shifting instead to align with the user’s incorrect opinion. However, Flan-PaLM models that undergo synthetic-data intervention consistently achieve close-to-perfect accuracy, regardless of whether the user’s incorrect opinion is present or absent. These improvements on an unseen task type demonstrate some additional generalization, as the intervention procedure used no mathematical data and relied solely on natural-language data.

An exception to this pattern was observed in the smallest model, Flan-PaLM-8B, which displayed an unexpected change in behavior by consistently agreeing with incorrect statements. This change in behavior may have arisen because the smallest model lacked the capacity to assess the truthfulness of claims and relied mostly on random guessing, rendering the filtration step ineffective. In combination with the results from Figure 4, it can be suggested that the intervention technique is a straightforward yet significant procedure capable of reducing sycophancy in various contexts.

filtration process.

Another step in the pipeline involves the filtration of prompts for which the model lacks knowledge about the correct answer to the claim presented in the prompt. This filtration process serves the purpose of emphasizing that the user’s opinion should remain independent of the truthfulness of the claim. To illustrate this, let’s consider a claim for which the model lacks an answer, such as “foo + bar = baz.” In such a case, when the user provides an opinion about this claim, the model would essentially be trained to either agree or disagree with the user’s input at random because it lacks prior information about the claim’s veracity. Therefore, to instruct the model to disregard the user’s opinion when evaluating the claim, it’s imperative that the model possesses knowledge of whether the claim is true or false. This underscores the critical role of the proposed filtration step in mitigating instances of random or unexpected behaviour following intervention.

Most notably, Flan-PaLM-62B achieves near-perfect accuracy when all incorrectly answered prompts are eliminated, even though it exhibits erratic and unforeseen behaviors when no prompts are filtered. Similarly, Flan-contPaLM-62B attains its highest performance when the filtration step is implemented. In contrast, Flan-PaLM-8B demonstrates subpar behavior regardless of the extent of filtration. This could be attributed to the possibility that the filtration step becomes irrelevant for the smallest model since it may have been arriving at correct answers solely through random guessing, without actually possessing knowledge of the correct responses. These findings suggest that, for sufficiently large models, the act of filtering out incorrectly answered prompts becomes imperative to stabilize and enhance model behavior following intervention. However, smaller models may require additional processing to derive benefits from synthetic-data intervention. Overall, combining all these approaches is how sycophancy can be reduced in LLMs!

If you’ve made it this far, thank you for taking time out of your day to read this article and I hope it was insightful and valuable!

If you have any questions regarding this article or just want to connect, you can find me on LinkedIn or my personal website :)

--

--