A subscription to JoVE is required to view this content. Sign in or start your free trial.

In This Article

  • Summary
  • Abstract
  • Introduction
  • Protocol
  • Results
  • Discussion
  • Disclosures
  • Acknowledgements
  • Materials
  • References
  • Reprints and Permissions

Summary

In this protocol, foundation large language model response quality is improved via augmentation with peer-reviewed, domain-specific scientific articles through a vector embedding mechanism. Additionally, code is provided to aid in performance comparison across large language models.

Abstract

Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.

Introduction

Large language models (LLMs) such as OpenAI's ChatGPT or Meta AI's Llama have rapidly become a popular resource for generating text relevant to a user prompt. Originally functioning to predict the next lexical items in a sequence, these models have evolved to understand context, encode clinical information, and demonstrate high performance on a variety of tasks1,2,3,4. Though language models predate such capabilities and their current level of popularity by decades5, recent advances in deep learning and computing capabilities have made pretrained, high-quality commercial LLMs broadly available to users via web-based technologies and application program interfaces (APIs)6. However, there are several notable limitations to consuming LLMs in this format.

Challenge 1: Static training corpus
LLMs are trained on an enormous (e.g., two trillion tokens in the case of Llama 27) but static body of text data. This poses a challenge to producing accurate responses pertaining to fields undergoing rapid development or changing literature. In this static approach, LLMs would require frequent retraining to keep up with the latest data, which is neither practical nor scalable. Moreover, prompts that require responses based on information not present in the training data may prevent useful text generation or lead to hallucinations8. Instances of hallucinations or fact fabrication raise significant concerns about the reliability of LLMs, particularly in settings where the accuracy of information is critical9.

Challenge 2: Lack of domain specificity
Pretrained models are often created for general use, while users may require a model specifically optimized for performance in a particular domain. Additionally, the computational resources and data required for training a model de novo or performing significant fine-tuning are prohibitive to many users.

Challenge 3: Lack of privacy
Users pursuing applications involving sensitive data or proprietary information may be unwilling or unable to use certain LLM services as they lack information on how data may be stored or utilized.

Challenge 4: Lack of guaranteed stability
Services with proprietary LLMs may change available models or alter behavior at any time, making stability a concern for the implementation of applications relying on these services.

Retrieval-augmented generation (RAG) is a technique developed to improve LLM performance, particularly on queries related to information outside the model's training corpus10,11. These systems augment LLMs by incorporating contextual information to be considered when generating a response to a user query. Various recent works have described applications of RAG systems and their potential advantages12,13,14.

The goal of the method outlined in this work is to demonstrate the construction of such a system and provide a framework for researchers to rapidly experiment on domain-specific, augmented LLMs. This method is applicable for users seeking to augment an LLM with an external text-based data source. Specifically, an overarching aim of this protocol is to provide step-by-step code that is extensible to a variety of practical LLM and RAG experiments without the need for significant technical expertise in the language-modeling domain, though a working knowledge of Python is required to apply this approach without modification. To maximize user control, transparency, portability, and affordability of solutions, open-source, publicly available tools are utilized. The proposed system addresses the previously stated issues in the following ways:

Solutions 1 and 2: Static training corpus and lack of domain specificity
The provided methodology leverages a RAG approach, utilizing embeddings to supply domain-specific information not included in the original training data. At a high level, embedding models transform text or other data into a representation as a vector or single-dimension array of numbers. This technique is beneficial as it converts semantic information contained in text to a dense, numeric form. By projecting a user query into the same embedding space, various algorithms can be used to calculate the distance15, and therefore, approximate semantic similarity, between the user query and sections of text documents. Thus, creating a database of such vectors from documents broken into discrete sections can facilitate searching over a significant number of documents for text most relevant to a user query (Figure 1). This approach is extensible to any text document. While other approaches, such as online search capabilities, are beginning to be implemented to augment LLMs, this approach allows users to choose sources considered of sufficiently high quality for their use case.

Solution 2: Lack of privacy
In this implementation, a secure cloud environment was used for hosting, with no user prompts, generated responses, or other data leaving this ecosystem. All code is written in a platform-agnostic manner, however, to ensure that another cloud provider or local hardware may be substituted.

Solution 3: Lack of guaranteed stability
This approach utilizes open-source libraries and focuses on augmenting LLMs with publicly available weights, allowing a higher degree of transparency, stability, and versioning, if required.

A full schematic of our proposed system is shown in Figure 2, and detailed instructions on replicating this or a similar system is outlined in the protocol section. An additional consideration when altering model behavior through fine-tuning or augmentation is the evaluation of performance. In language-generating models, this presents a unique challenge as many traditional machine learning metrics are not applicable. Though a variety of techniques exist16, in this study, expert-written multiple-choice questions (MCQs) were used to assess accuracy and compare performance pre- and post-augmentation as well as against popular alternative LLMs.

Protocol

In the use case demonstrated in this paper, the vector store was generated using published guidelines from the Chicago Consensus Working Group17. This expert group was established to develop guidelines for the management of peritoneal cancers. The subject area was chosen as it is within the investigators' area of clinical expertise. The set of papers was accessed from online journal repositories including Cancer and the Annals of Surgical Oncology. A compact (33.4M parameters) embedding model created by the Beijing Academy for Artificial Intelligence (BAAI, https://www.baai.ac.cn/english.html), bge-small-en, was used to generate embeddings from source documents. The resulting database was then used to augment Llama 2 and Open-AI foundation models7. For the reader's convenience, the code is made available through GitHub (https://github.com/AnaiLab/AugmentedLLM). To ensure replicability, it is recommended to use the same versions of libraries used in the provided requirements list as well as the same version of Python. Additional details on installation or documentation regarding tools used in the following methods can be located at the official websites of the providers for Python (https://www.python.org), git (https://git-scm.com), Llama-Index (https://llamaindex.ai), and Chroma (https://trychroma.com).

1. Prerequisites: Review code and install required libraries

  1. Verify that git, python, and pip are installed.
    1. In a terminal, run the following commands to verify the installation:
      git --version
      python3 --version
      pip3 --version
  2. Check the code and install requirements.
    1. In a terminal, run the following commands:
      git cloneΒ https://github.com/AnaiLab/AugmentedLLM.git
      cd ./AugmentedLLM/

      pip3 install -r requirements.txt

2. Creation of a vector database with Llama-Index

  1. Convert RTF file formats (only required if files are in an RTF format).
    1. Edit the file titled config.py, replacing the file paths in the following code with the location of the RTF articles to be converted and the location to which the plain text files are to be written by typing the following commands. Save the file.
      rtf_file_dir = './articles_rtf/'
      converted_file_dir = './articles_converted/'
    2. In a terminal, in the same directory, execute the code to generate plain text versions of RTF files by running the following command:
      python3 ./convert_rtf.py
  2. Create and save the vector database.
    1. Edit the config.py file, replacing the value of the following variable with the file path of the folder containing documents with which the LLM is to be augmented; save the file.
      article_dir = './articles/'
    2. In a terminal, in the same directory, execute the code with the following command to create and persist the database. Verify the database is now saved in the vector_db folder.
      python3 ./build_index.py

3. Augmentation of a Llama model with vector database generated in section 2

  1. Instantiate custom LLM Locally (optional)
    NOTE: Performing this step is only required if you would like to use a model other than the default Llama-2-7B. If you wish to use the default model, proceed to step 3.2.
    1. (Using a custom LLM) Specify an LLM to augment by editing run_augmented_llm.py and passing a llama-index LLM object in the constructor as the llm parameter in the following lines of code rather than None.
      augmentedLLM = AugmentedLLM(vector_store, llm=None)
  2. Query augmented LLM
    1. Run the following command in the terminal:
      python3 ./run_augmented_llm.py
    2. Run user queries to get a response augmented by data in the set of manuscripts (Figure 3 and Figure 4). Press CTRL + C to exit when finished.

4. Programmatic comparison of alternative LLMs

  1. Create MCQs.
    1. Edit the file questions.py, taking note of format of the examples. Add questions following a similar format. Save the file.
  2. Connect to GPT-3.5, GPT-4, OpenChat, or other comparator models via API.
    1. Edit the config.py file, adding the API key for OpenAI or Huggingface if the objective is to benchmark against models from either provider. Save the file.
      huggingface_key = ''
      openai_key = ''
    2. Edit the compare_llms.py file, and choose the set of models to test against by uncommenting (deleting the '# ' characters at the begging of that line) models to compare against.
      NOTE: Some comparators require an API key as set in step 4.2.1. Additionally, edit the output_dir parameter to change where LLM output is stored if desired; otherwise, a default will be used.
      output_dir = './llm_responses/'
    3. In a terminal, execute the code with the following command. After execution, view the model responses in the folder specified in step 4.2.2 for grading or other review.
      python3 ./compare_llms.py
  3. (Optional) Experiment with automated grading of MCQ responses. The example code uses the LMQL library to constrain LLM output into an expected format.
    1. Open the file automated_comparison.py and similarly to step 4.2.2, uncomment models to be included, edit the output_dir variable, or otherwise customize. Note that the model output will still be saved in a similar fashion. Save the file.
    2. Run the code from step 4.3.1 by running the following command in a terminal:
      python3 ./automated_comparison.py

Results

A set of 22 publications from the Chicago Consensus Working Group management guidelines were used to augment the base Llama-7b model17. The documents were converted into a vector index using the tool Llama-Index to generate Llama-2-7b-CCWG-Embed. Popular OpenAI models such as GPT-3.5 and GPT-4 were also augmented in a similar fashion to produce GPT-XX-CCWG-Embed models. A total of 20 multiple choice questions (MCQ) were developed to assess knowledge related to the management of a variety of perito...

Discussion

The methods provided here aim to facilitate the research of domain-specific applications of LLMs without the need for de novo training or extensive fine-tuning. As LLMs are becoming an area of significant research interest, approaches for augmenting knowledge bases and improving the accuracy of responses will become increasingly important18,19,20,21. As demonstrated in the provided res...

Disclosures

The authors have no conflicts of interest to declare.

Acknowledgements

This work was facilitated by several open-source libraries, most notably llama-index (https://www.llamaindex.ai/), ChromaDB (https://www.trychroma.com/), and LMQL (https://lmql.ai/).

Materials

NameCompanyCatalog NumberComments
pip3 version 22.0.2Β 
Python version 3.10.12

References

  1. Singhal, K., et al. Large language models encode clinical knowledge. Nature. 620 (7972), 172-180 (2023).
  2. Gilson, A., et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 9 (1), e45312 (2023).
  3. Guerra, G. A., et al. GPT-4 artificial intelligence model outperforms ChatGPT, medical students, and neurosurgery residents on neurosurgery written board-like questions. World Neurosurg. 179, e160-e165 (2023).
  4. . Would Chat GPT3 get a Wharton MBA? A prediction based on its performance in the Operations Management course Available from: https://mackinstitute.wharton.upenn.edu/wp-content/uploads/2023/01/Christian-Terwiesch-Chat-GTP.pdf (2023)
  5. Weizenbaum, J. ELIZA-a computer program for the study of natural language communication between man and machine. Communications of the ACM. 9 (1), 36-45 (1966).
  6. Wu, T., et al. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA Journal of Automatica Sinica. 10 (5), 1122-1136 (2023).
  7. Huang, L., et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv [cs.CL]. , (2023).
  8. Thirunavukarasu, A. J., et al. Large language models in medicine. Nature Med. 29 (8), 1930-1940 (2023).
  9. Ram, O., et al. In-Context retrieval-augmented language models. Trans Assoc Comput Linguist. 11, 1316-1331 (2023).
  10. . Sequence distance embeddings Available from: https://wrap.warwick.ac.uk/61310/7/WRAP_THESIS_Cormode_2003.pdf (2003)
  11. Chicago Consensus Working Group. The Chicago Consensus Guidelines for peritoneal surface malignancies: Introduction. Cancer. 126 (11), 2510-2512 (2020).
  12. Dodge, J., et al. Measuring the carbon intensity of AI in cloud instances. FACCT 2022. , (2022).
  13. Khene, Z. -. E., Bigot, P., Mathieu, R., RouprΓͺt, M., Bensalah, K. Development of a personalized chat model based on the European association of urology oncology guidelines: Harnessing the power of generative artificial intelligence in clinical practice. Eur Urol Oncol. 7 (1), 160-162 (2024).
  14. Kresevic, S., et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit Med. 7 (1), 102 (2024).
  15. Ge, J., et al. Development of a liver disease-specific large language model chat interface using retrieval augmented generation. medRxiv. , (2023).
  16. Panagoulias, D. P., et al. Rule-augmented artificial intelligence-empowered systems for medical diagnosis using large language models. , 70-77 (2023).
  17. Panagoulias, D. P., et al. Augmenting large language models with rules for enhanced domain-specific interactions: The case of medical diagnosis. Electronics. 13 (2), 320 (2024).
  18. Bommasani, R., Liang, P., Lee, T. Holistic evaluation of language models. Ann N Y Acad Sci. 1525 (1), 140-146 (2023).
  19. Papineni, K., Roukos, S., Ward, T., Zhu, W. -. J. Bleu. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL '02. , (2001).
  20. Johnson, D., et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the chat-GPT model. Res Sq. , (2023).
  21. Lin, C. -. Y. ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out. , 74-81 (2004).
  22. Chen, S., et al. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J Am Med Inform Assoc. 31 (4), 940-948 (2024).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

Medicine

This article has been published

Video Coming Soon

JoVE Logo

Privacy

Terms of Use

Policies

Research

Education

ABOUT JoVE

Copyright Β© 2025 MyJoVE Corporation. All rights reserved