6 minutes
Deploying an Open Source 100k-Context Window LLM
Quick Note on LLMs
In a field as fast-moving as Large Language Models (LLMs), almost anything shared publicly can feel outdated within weeks. Still, when I was earlier in my data science and machine learning career, I relied on blogs and other resources for guidance, so I hope sharing my notes here will be helpful to others. Additionally, I’ve found that posting information online can be a great way to get feedback; if I’ve made an error or there’s a better approach, others can help clarify. These are my personal notes — I hope you find them useful, and I’d love to hear your thoughts!
Deploying a Large Context Window Dedicated Endpoint
Recently I have been working on a project requiring the application of Large Language Models (LLMs) to very large text documents — sometimes the equivalent of over 200 pages in length. For context, I use a working assumption that 75 words are approximately 100 tokens. This leads to a requirement for a context window (i.e., the number of tokens that can be provided in a single input) that is 100k tokens long1.
Until last year, this was essentially unfeasible, but several different providers are now offering support for extended contexts into the millions of tokens.
An added requirement was to control (as much as possible) the LLM environment. I will write more in the future on utilization of LLMs in research environments, but I have a concern that we are going to make the replicability crisis more severe as researchers jump into utilizing tools such as ChatGPT where performance can change over time (see here). This is feasible through a private deployment of an open source model on a service such as on Hugging Face, which allow for private endpoints to be deployed.
Getting a 100k context window deployed and working took more effort than I initially expected. Particularly interacting with several of Hugging Face APIs I get the impression that they are moving so quickly that the documentation may be struggling to keep up. There have been many “tricks” developed such as quantization2 and vLLMs3 to allow these large performant models to be deployed more efficiently but getting them to work practically was tough (and in the end I am still working on getting a quantized model and VLLMs working on Hugging Face). The below requirements set out how I got a model deployed and working. This page suggests that in principle less VRAM should be required but I could not get it to deploy on anything smaller successfully - any advice would be welcome!
Deployment Considerations
Given the above and with some testing I have the following requirements:
- At least 80 GB of VRAM required
- Deployment with a specific image for future reference
- Deployment of quantized open source LLM to allow inference on this hardware and context length
Setup
HEALTH WARNING
Running the below code with an API key that has sufficient permissions will deploy a GPU based machine that will cost you money (this particular example as of time of writing costs $4 an hour to run) - proceed at your own risk.
HEALTH WARNING OVER
I am working as part of an org with other researchers so needed to give the specific org to generate the URL - you will not need to do this if you are doing it in your own account. You will however need to generate an API key with sufficient permissions to interact and deploy endpoints.
import requests
import os
org = "org_name"
url = f"https://api.endpoints.huggingface.cloud/v2/endpoint/{org}"
# A valid Hugging Face API key with sufficient permissions is required
hugging_api = os.getenv("HUGGING_API")
Once setup I found that interacting directly through the Hugging Face web interface or even the Python SDK was less easy than doing it directly through the API which allowed access to many options in a straightforward way. I found this page invaluable to working out the different options.
We first provide the headers which need to include your Bearer token:
headers = {
"accept": "application/json",
"authorization": f"Bearer {hugging_api}",
"content-type": "application/json"
}
Before finally the most important section was working out the requirements to correctly specify the configuration of the endpoint we want to deploy:
data = {
"name": "name-your-endpoint",
"type": "protected",
"provider": {
"vendor": "aws",
"region": "us-east-1"
},
"compute": {
"id": "aws-us-east-1-nvidia-a100-x1",
"accelerator": "gpu",
"instanceType": "nvidia-a100",
"instanceSize": "x1",
"scaling": {
"minReplica": 0,
"maxReplica": 1,
"scaleToZeroTimeout": 15,
"metric": "hardwareUsage"
}
},
"model": {
"repository": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"revision": "db1f81ad4b8c7e39777509fac66c652eb0a52f91",
"task": "text-generation",
"framework": "pytorch",
"image": {
"tgi": {
"healthRoute": "/health",
"port": 80,
"url": "ghcr.io/huggingface/text-generation-inference:2.3.1",
"maxBatchPrefillTokens": 100000,
"maxInputLength": 100000,
"maxTotalTokens": 128000,
"disableCustomKernels": False,
"quantize": "awq"
}
},
"secrets": {},
"env": {}
}
}
Useful options to note:
- Provider: Deployed on AWS due to my familiarity, though Hugging Face supports other cloud providers.
- Compute: Specified a machine with a single A100 GPU and set scaling options to quickly scale down when not in use. Since my work is research-focused, response time is less important than compute cost.
- Model: Used a quantized4 version of Meta’s Llama 3.1 model to handle the large context size efficiently. Having a revision selected should hopefully allow better model replication in the future. I utilized Hugging Face’s “Text Generation Interface” (tgi) image, providing options for batch size and token length to support the large context.
Finally, we can send a request to the API, and the dedicated endpoint will begin deployment:
response = requests.post(url, headers=headers, json=data)
print(response.status_code)
print(response.json())
Quick Test
Finally, in a not-so-elegant test, to see if the endpoint would accept 100k tokens in one go we can send the following requent to the endpoint:
### Get your Hugging Face Endpoint after it has been deployed:
API_URL = "https://your_endpoint_here.endpoints.huggingface.cloud"
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
input_text = " ".join(["test"] * 99999)
output = query({
"inputs": f"{input_text}",
"parameters": {
"max_new_tokens": 1000
}
})
print(output)
This generates the word “test” 100k times and sends it to the endpoint. The response will likely echo back “test” many times since it isn’t a sensible query, but it demonstrates the model’s ability to handle a full 100k-token input.
Hope you find this useful - let me know if you use it or if there is a better/cheaper way to achieve the same thing!
There are other methods developed to deal with long token lengths (Retrieval Augmentation Generation or RAG being the principle one that many will be familiar with, however here I was curious whether we could get the entire context into the prompt for document analysis). ↩︎
Essentially this reduces resource requirements by reducing precision of the parameters from 32 bit floating point number to something smaller - in this case we will use a 4-bit model. ↩︎
vLLMs (virtual LLMs) optimize memory management and computation in large language models, allowing efficient handling of long contexts and reducing latency by selectively calculating attention and dynamically batching tasks. These techniques lower VRAM requirements, making it feasible to deploy large-context models on less expensive hardware without sacrificing much performance. ↩︎
Note that here a AWQ (one of the ways of quantizing) version of the model from hugging-quants and this needs to match the
quantize
option provided here. ↩︎