Source: Synthesis AI Blog

Synthesis AI Blog Fine-Tuning LLMs: RLHF, LoRA, and Instruction Tuning

We continue our series on generative AI. We have discussed Transformers, large language models, and some specific aspects of Transformers - but are modern LLMs still running on the exact same Transformer decoders as the original GPT? Yes and no; while the basics remain the same, there has been a lot of progress in recent years. Today, we briefly review some of the most important ideas in fine-tuning LLMs: RLHF, LoRA, instruction tuning, and recursive self-improvement. These ideas are key in turning a token prediction machine into a useful tool for practical applications.From GPT to GPT-4: What Has Been Changing?For over a year, I have been writing about generative AI on this blog. Recently, we have discussed the basic architecture of this latest generative AI revolution: the Transformer. We have also considered modern LLMs and even reasons to worry about their future development, and we have discussed in detail one specific venue of progress: how to extend the context window size in Transformers, alleviating the quadratic complexity of self-attention.But has this fight for context windows been the entire difference between the original Transformer and the latest GPT-4, Gemini 1.5, and the rest? Is there anything else except for "stacking more layers"? Sure there is, and today we discuss it in more detail.Before proceeding further, I have to warn you that the new ideas and especially engineering implementation details of the very latest large language models are not being released to the public. There is no definitive paper about GPT-4's internal structure (let alone plans for GPT-5) written by OpenAI researchers. Still, there are plenty of ideas floating around, and plenty of information already available from previous attempts by leading labs, from publicly released models such as the Llama family, and from independent research efforts. So while I'm not claiming to show you the full picture today, I still hope to give a wide enough survey. Our plan is as follows:we begin with the most important advance that made GPT-3 into ChatGPT and kickstarted the LLM revolution: reinforcement learning with human feedback (RLHF);then we discuss fine-tuning pretrained LLMs on small datasets by learning adapters for the mostly frozen weights of the base models; the most popular and efficient tool here have been low-rank adapters (LoRA);next, we consider instruction tuning, where LLMs are themselves fine-tuned on datasets providing realistic prompt-response samples rather than token prediction in arbitrary text; the main question here is where to get the data, and we discuss both manually labeled datasets and our main focus, synthetic data;finally, synthetic data for LLM fine-tuning and reinforcement learning come together in our last topic: attempts at recursive self-improvement where the LLM may become smarter by bootstrapping from its own outputs.All of these techniques, and more, are key to efficiently using LLMs for practical problems, especially for specific applications such as mathematical reasoning or programming; we will see many such examples below.Giving Humans What They Want: RLHFYou have certainly heard of reinforcement learning with human feedback (RLHF). This is the secret sauce that turned GPT-3, an amazing but hard to use token prediction machine, into ChatGPT, an LLM that keeps turning the world upside down. But how does it work, exactly?We don't often talk about reinforcement learning (RL) on this blog; probably the only notable exception was my last post on world models, where RL was featured very prominently. In general, it is a separate way of doing machine learning, in addition to supervised and unsupervised learning:In supervised learning, you have a labeled dataset and want to learn a conditional distribution of labels given the data points. In unsupervised learning, there are no labels, you just mine the data for structure, learning the joint distribution of all variables. For example, token prediction is pure classification, a supervised learning problem of learning p(y|x) for a text prompt x and next token y, but we can also say that as a result, the language model has implicitly learned a distribution over text snippets p(x) because it can generate whole texts in an autoregressive fashion. In reinforcement learning, there is no prior dataset: a learning agent is just "living" in an environment, getting rewards based on actions that it takes and trying to maximize these rewards. In the last post, we discussed the distinction between several different approaches to RL such as policy gradient and actor-critic algorithms:But be it with a world model or without, RL and training an LLM sound very different, right?RLHF started with the work of OpenAI researchers Christiano et al. (2017). Paul Christiano is one of the leading figures in the field of AI alignment, and this work was also motivated by a problem that sounds more like alignment: how do we tell, for instance, a robot what exactly we want it to do? Unless we are in a self-contained formal system such as chess, any reward function that we could formulate in the real world might be superficially optimized in ways that are hard to predict but that do not give us what we want. It is well known, for example, that robots learning complex behaviors in simulated environments often learn more about the bugs and computational limits of the simulator than about the desired behavior in the real world; for more details see, e.g., Lehman et al., 2018 or a list of specification gaming examples by Krakovna et al..Thus, Christiano et al. suggested that since we most probably cannot define what we want formally, we can instead ask a human: when you see it, you know it. Human feedback would define how well the system's current behavior matches the actual hard-to-define goal; that feedback might be provided in the form of comparing two responses or two outcomes and preferring one of them. This approach, however, is impractical: we cannot ask humans to label as much data as actually necessary to train a reinforcement learning model. Therefore, the idea of Christiano et al. is to train a separate model that encodes user preferences and predicts the reward used in actual RL training. Here is a general scheme of this training:The human providing feedback cannot assign numerical reward value, so instead they compare pairs of "actions"-in the case of Christiano et al., actions were short sequences of Atari game playing or a robot walking-and give pairwise preferences. As a result, the dataset looks like a set of pairs D={(σ1, σ2, μ)n}n=1N, where σi = ((oi0, ai0), (oi1, ai1), ..., (oi,k_i, ai,k_i)) are sequences of observation-action pairs that describe a trajectory in the reinforcement learning environment, and μ is a probability distribution specifying whether the user preferred σ1, σ2, or had an equal preference (uniform μ). To convert pairwise preferences into a reward function, this approach uses the assumptions of Bradley-Terry models for learning a rating function from pairwise preferences (Bradley, Terry, 1952). The problem setting for a Bradley-Terry model is a set of pairwise comparisons such as the results of, e.g., chess games between players, and the basic assumption is that the probability of player i winning over player j can be modeled asfor some rating values ɣi,ɣj∊ℝ. Then Bradley-Terry models provide algorithms to maximize the total likelihood of a dataset with such pairwise comparisons, usually based on minorization-maximization algorithms, a generalization of the basic idea of the EM algorithm (Hunter, 2004; see also a discussion of EM below).In the case of RL from human preferences, we need a further assumption since ɣi has to be a function of σi; Christiano et al. (2017) modeled it as a product of exponential rewards over the sequence:and then the loss function for the neural network can be defined asIt might seem that this idea just shifts the impractical part of providing human feedback during RL training to an equally impractical task of providing enough human feedback to train a reward prediction model. However, it turned out that with this approach, it only takes a few hundred queries to a human rater to learn walking or hopping in the MuJoCo simulated environment (Todorov et al., 2012; see a sample choice posed for the human evaluator on the left in the figure below), and if you are willing to go over 1000 queries you might even get better results than pure reinforcement learning! The latter effect is probably due to reward shaping (Wiewiora, 2010): when we humans rate behaviors, we impose an ordering where sequences closer to the goal are rated higher, and the resulting rewards provide more information to the agent than just a binary label of whether the task has been done successfully.By the way, this work also contains a very interesting example of reinforcement learning gone rogue. On the right, the figure above shows a sample frame from a video showing the robotic hand trying to grasp the ball. Human evaluators were asked to check whether the grasping had been successful. But since the scene had only one virtual camera, and with such a uniform background depth estimation was hard for humans, the robot learned to position the hand between the ball and the camera so as to appear as if it is grasping the ball rather than actually doing it! This is an excellent example of what is known as specification gaming, when machine learning models converge on behaviors that had not been intended by the developers but that indeed optimize the objective function they specified; we have talked about possible problems resulting from such effects on the blog before.The ideas of Christiano et al. have been continued in many works. In particular, there have been extensions to k-wise comparisons, with a specially developed maximum likelihood estimator (Zhu et al., 2023), to vague feedback, where a human evaluator can only reliably distinguish two samples if

Read full article »
Est. Annual Revenue
$100K-5.0M
Est. Employees
1-25
Yashar Behzadi's photo - Founder & CEO of Synthesis AI

Founder & CEO

Yashar Behzadi

CEO Approval Rating

90/100

Read more