A subscription to JoVE is required to view this content. Sign in or start your free trial.
In this protocol, foundation large language model response quality is improved via augmentation with peer-reviewed, domain-specific scientific articles through a vector embedding mechanism. Additionally, code is provided to aid in performance comparison across large language models.
Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.
Large language models (LLMs) such as OpenAI's ChatGPT or Meta AI's Llama have rapidly become a popular resource for generating text relevant to a user prompt. Originally functioning to predict the next lexical items in a sequence, these models have evolved to understand context, encode clinical information, and demonstrate high performance on a variety of tasks1,2,3,4. Though language models predate such capabilities and their current level of popularity by decades5, recent advances in deep learning and computing capabilities have made pretrained, high-quality commercial LLMs broadly available to users via web-based technologies and application program interfaces (APIs)6. However, there are several notable limitations to consuming LLMs in this format.
Challenge 1: Static training corpus
LLMs are trained on an enormous (e.g., two trillion tokens in the case of Llama 27) but static body of text data. This poses a challenge to producing accurate responses pertaining to fields undergoing rapid development or changing literature. In this static approach, LLMs would require frequent retraining to keep up with the latest data, which is neither practical nor scalable. Moreover, prompts that require responses based on information not present in the training data may prevent useful text generation or lead to hallucinations8. Instances of hallucinations or fact fabrication raise significant concerns about the reliability of LLMs, particularly in settings where the accuracy of information is critical9.
Challenge 2: Lack of domain specificity
Pretrained models are often created for general use, while users may require a model specifically optimized for performance in a particular domain. Additionally, the computational resources and data required for training a model de novo or performing significant fine-tuning are prohibitive to many users.
Challenge 3: Lack of privacy
Users pursuing applications involving sensitive data or proprietary information may be unwilling or unable to use certain LLM services as they lack information on how data may be stored or utilized.
Challenge 4: Lack of guaranteed stability
Services with proprietary LLMs may change available models or alter behavior at any time, making stability a concern for the implementation of applications relying on these services.
Retrieval-augmented generation (RAG) is a technique developed to improve LLM performance, particularly on queries related to information outside the model's training corpus10,11. These systems augment LLMs by incorporating contextual information to be considered when generating a response to a user query. Various recent works have described applications of RAG systems and their potential advantages12,13,14.
The goal of the method outlined in this work is to demonstrate the construction of such a system and provide a framework for researchers to rapidly experiment on domain-specific, augmented LLMs. This method is applicable for users seeking to augment an LLM with an external text-based data source. Specifically, an overarching aim of this protocol is to provide step-by-step code that is extensible to a variety of practical LLM and RAG experiments without the need for significant technical expertise in the language-modeling domain, though a working knowledge of Python is required to apply this approach without modification. To maximize user control, transparency, portability, and affordability of solutions, open-source, publicly available tools are utilized. The proposed system addresses the previously stated issues in the following ways:
Solutions 1 and 2: Static training corpus and lack of domain specificity
The provided methodology leverages a RAG approach, utilizing embeddings to supply domain-specific information not included in the original training data. At a high level, embedding models transform text or other data into a representation as a vector or single-dimension array of numbers. This technique is beneficial as it converts semantic information contained in text to a dense, numeric form. By projecting a user query into the same embedding space, various algorithms can be used to calculate the distance15, and therefore, approximate semantic similarity, between the user query and sections of text documents. Thus, creating a database of such vectors from documents broken into discrete sections can facilitate searching over a significant number of documents for text most relevant to a user query (Figure 1). This approach is extensible to any text document. While other approaches, such as online search capabilities, are beginning to be implemented to augment LLMs, this approach allows users to choose sources considered of sufficiently high quality for their use case.
Solution 2: Lack of privacy
In this implementation, a secure cloud environment was used for hosting, with no user prompts, generated responses, or other data leaving this ecosystem. All code is written in a platform-agnostic manner, however, to ensure that another cloud provider or local hardware may be substituted.
Solution 3: Lack of guaranteed stability
This approach utilizes open-source libraries and focuses on augmenting LLMs with publicly available weights, allowing a higher degree of transparency, stability, and versioning, if required.
A full schematic of our proposed system is shown in Figure 2, and detailed instructions on replicating this or a similar system is outlined in the protocol section. An additional consideration when altering model behavior through fine-tuning or augmentation is the evaluation of performance. In language-generating models, this presents a unique challenge as many traditional machine learning metrics are not applicable. Though a variety of techniques exist16, in this study, expert-written multiple-choice questions (MCQs) were used to assess accuracy and compare performance pre- and post-augmentation as well as against popular alternative LLMs.
In the use case demonstrated in this paper, the vector store was generated using published guidelines from the Chicago Consensus Working Group17. This expert group was established to develop guidelines for the management of peritoneal cancers. The subject area was chosen as it is within the investigators' area of clinical expertise. The set of papers was accessed from online journal repositories including Cancer and the Annals of Surgical Oncology. A compact (33.4M parameters) embedding model created by the Beijing Academy for Artificial Intelligence (BAAI, https://www.baai.ac.cn/english.html), bge-small-en, was used to generate embeddings from source documents. The resulting database was then used to augment Llama 2 and Open-AI foundation models7. For the reader's convenience, the code is made available through GitHub (https://github.com/AnaiLab/AugmentedLLM). To ensure replicability, it is recommended to use the same versions of libraries used in the provided requirements list as well as the same version of Python. Additional details on installation or documentation regarding tools used in the following methods can be located at the official websites of the providers for Python (https://www.python.org), git (https://git-scm.com), Llama-Index (https://llamaindex.ai), and Chroma (https://trychroma.com).
1. Prerequisites: Review code and install required libraries
2. Creation of a vector database with Llama-Index
3. Augmentation of a Llama model with vector database generated in section 2
4. Programmatic comparison of alternative LLMs
A set of 22 publications from the Chicago Consensus Working Group management guidelines were used to augment the base Llama-7b model17. The documents were converted into a vector index using the tool Llama-Index to generate Llama-2-7b-CCWG-Embed. Popular OpenAI models such as GPT-3.5 and GPT-4 were also augmented in a similar fashion to produce GPT-XX-CCWG-Embed models. A total of 20 multiple choice questions (MCQ) were developed to assess knowledge related to the management of a variety of perito...
The methods provided here aim to facilitate the research of domain-specific applications of LLMs without the need for de novo training or extensive fine-tuning. As LLMs are becoming an area of significant research interest, approaches for augmenting knowledge bases and improving the accuracy of responses will become increasingly important18,19,20,21. As demonstrated in the provided res...
The authors have no conflicts of interest to declare.
This work was facilitated by several open-source libraries, most notably llama-index (https://www.llamaindex.ai/), ChromaDB (https://www.trychroma.com/), and LMQL (https://lmql.ai/).
Name | Company | Catalog Number | Comments |
pip3 version 22.0.2Β | |||
Python version 3.10.12 |
Request permission to reuse the text or figures of this JoVE article
Request PermissionExplore More Articles
This article has been published
Video Coming Soon
Copyright Β© 2025 MyJoVE Corporation. All rights reserved