With recent technological breakthroughs in artificial intelligence, Large Language Models, or LLMs in short, have become increasingly prevalent. Over the past few years, researchers have made rapid advancements in solving several complex language-related tasks by training these models on vast amounts of data in order to comprehend intricate language patterns, generate coherent responses, etc. One area of research that has particularly gained the interest of researchers and developers is the application of LLMs when it comes to handling long-form content to include broader contexts. Some examples of these tasks range from relatively simple tasks like text summarization and code generation to more complex problem statements like protein structure prediction and information retrieval. Long textual sequences consist of information in diverse forms, such as paragraphs, tables, images, etc.; thus, LLMs must be trained to process and understand such elements. Moreover, by effectively considering long-distance structural dependencies, LLMs can identify the connections between different parts of the text and extract the most relevant information. Thus, exposure to a broader range of knowledge allows LLMs to provide more accurate and contextually relevant answers to user queries.
Yet, despite the numerous potential use cases, most available open-source LLMs, ranging from Meta’s LLaMA to MosaicML’s MPT LLM models, have been trained on sequences with a maximum of 2K tokens. This limitation presents a significant challenge when it comes to modeling longer sequences. Additionally, previous research on model scaling has shown that smaller models trained on a greater number of tokens outperform larger models when given a fixed computational budget. Thus, inspired by the problem at hand and current advances, Salesforce Research made groundbreaking achievements by introducing XGen-7B, a series of 7B LLMs trained on 8K sequence length for 1.5 trillion tokens. The series of models include XGen-7B-4K-Base (with support for 4K sequence length), XGen-7B-8K-Base (with support for 8K sequence length), and XGen-7B-8k-Inst which has been fine-tuned on public-domain instructional data (released only for research purposes). The striking characteristic of these LLMs is that on standard NLP benchmarks, XGen achieves comparable or better results when compared to other state-of-the-art LLMs of similar size like MPT, Falcon, LLaMA, etc.
The XGen-7b models employed in this study were trained using Salesforce’s proprietary library JaxFormer, which enables efficient training of LLMs utilizing data and model parallelism specifically optimized for TPU-v4 hardware. The training process followed the guidelines of LLaMA, augmented with two additional investigations. The first exploration focused on understanding “loss spikes,” where the loss suddenly and temporarily increases during training without a clear underlying cause. Although the root cause of these spikes remains unknown, the researchers identified factors such as “sequential over parallel circuits,” “swish-GLU over GeLU,” and “RMS-Norm over Layer-norm” as potential contributors to training instability. The second aspect addressed was sequence length. Since training with longer sequences incurs significantly higher computational costs due to the quadratic complexity of self-attention, a staged training approach was adopted. The training initially encompassed 800B tokens with a sequence length of 2k tokens, followed by 400B tokens with 4k length, and finally, 300B tokens with 8k length.
To assess the capabilities of the XGen-7b 8k model in comprehending longer contexts, the researchers conducted evaluations using three primary tasks: long-form dialogue generation, text summarization, and question-answering. The researchers used the instruction-tuned model for their evaluations pertaining to the difficulty of the tasks at hand. Regarding long-form dialogue generation, the researchers utilized three tasks for assessment: AMI meeting summarization, ForeverDreaming, and TVMegaSite screenplay summarization. Across all metrics, the XGen-7B-inst model achieved the highest scores compared to several other instruction-tuned models, demonstrating its superior performance.
For long-form question-answering, the researchers generated questions using ChatGPT based on Wikipedia documents covering diverse topics like Physics, Engineering, History, and Entertainment, along with their corresponding summaries. The LLM-generated answers, which were 256 tokens long, were evaluated using GPT-4 based on their structure, organization, and relevance to the question and source document. In this scenario, the XGen-7B-8k-Inst model outperformed the baseline models, which are limited to 2k tokens, showcasing its superior performance. In terms of text summarization, the researchers employed two datasets from different domains, specifically meeting conversations and government reports, to evaluate the XGen-7b model. The results revealed that the XGen-7b model significantly outperformed other baseline models in these tasks, indicating its superior performance in text summarization as well.
The evaluations demonstrated that the XGen-7b model excelled in understanding longer contexts across various tasks, including long-form dialogue generation, question-answering, and text summarization. Its performance surpassed that of other instruction-tuned and baseline models, showcasing its effectiveness in comprehending and generating coherent responses in extensive text contexts. Nevertheless, despite its efficacy, the researchers acknowledge a limitation of the XGen model, as it is not exempt from biases and has the potential to generate toxic responses, a characteristic it shares with many other AI models. Salesforce Research has also open-sourced its code to allow the community to explore its work.
Check Out the SF Blog and Github Link. Don’t forget to join our 25k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at Asif@marktechpost.com
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is passionate about the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more about the technical field by participating in several challenges.