IBM’s unique training program is designed to help the LLM assimilate new knowledge and skills quickly and efficiently during alignment. Typically, new knowledge is added during pre-training, the most time-consuming and computationally intensive part of AI development.
The model is first fed simple instructions, followed by longer, narrative-like instructions corresponding to the knowledge and foundational skills needed for the target task.
In the second phase, the model is trained on the kinds of task-specific skills needed to write a corporate earnings email, things like summarizing information and putting key details in context. “It turns out that the order matters,” said Srivastava. “We confirmed empirically that the model struggles to assimilate new knowledge if you try to teach it complex skills first.”
The team also found they got better results when they trained the model at a low learning rate, with an extended warm-up, and incorporated the data in large batches. They also used replay buffers, where a small subset of data from early training is reinjected at the end of the process, to prevent the model from overwriting what it learned before.
IBM Research generated a synthetic dataset of 1.2 million instructions with the LAB method and trained two open-source LLMs on the data: Labradorite 13B (built on Meta’s Llama-2-13B model) and Merlinite 7B (built on the Mistral 7B model).They found that their aligned models were competitive with state-of-the-art chatbots on a range of benchmarks, including ones for coherent and engaging conversation and common sense reasoning.
IBM’s Labradorite and Merlinite models not only outperformed chatbots aligned on human-generated data, but also models aligned on significantly more synthetic data, including Microsoft’s Orca-2 chatbot, which was trained on 15 million instructions generated by the behemoth GPT-4 model.
IBM also used LAB to significantly improve its own enterprise-focused Granite models on IBM watsonx.
LAB has two distinguishing traits that help explain these results. The teacher model generates synthetic examples from each leaf node of the taxonomy, producing a much broader coverage of target tasks. Other methods use random sampling which limits the breadth of the data generated.
LAB also allows you to add new knowledge and skills to the base LLM without having to incorporate this information into the teacher model as well. “This means you don’t need some all-powerful teacher model that distills its capabilities into the base model,” said David Cox, vice president for AI models at IBM Research.
It also allows LLM developers to generate their own instructions without having to worry about the legality of using proprietary LLMs like GPT-4 to generate synthetic data.
IBM’s LAB method grew out of the team’s insight that great alignment data can bring advanced capabilities to smaller, more cost-effective models that can be tailored for enterprise needs. Pre-training is important, but giving the model highly curated task-specific instructions is just as important.
“The brilliant part of it is that it’s far easier to improve your chatbot during alignment than it is during initial training,” said Cox. “This method levels the playing field, allowing smaller open-source models to compete with models pre-trained on thousands of GPUs and aligned with human-generated instructions.”