The Road less travelled: running LLMs on your own

Introduction: The Privacy Paradox

The desire to run Large Language Models (LLMs) like GPT-4 locally or in the company cloud is increasing. There are several reasons behind this trend. They range from data protection, to regulatory compliance and performance customization. Data privacy is of paramount importance, especially given the consequences of breaking GDPR.

But, in the quest for self-reliance, we meet a multitude of challenges that make the process less than straightforward. Running an LLM, whether locally or in the cloud, is not a simple plug-and-play operation. It necessitates not only in-depth technical knowledge but also powerful computing resources. Such endeavours might bring us back into the orbit of major tech players – the same entities we aim to gain independence from.

Problems with models: the compatibility conundrum

Foundational models come in different flavours. We have architecture-dependent models like LLaMA, GGML, GGMLv3, GPTQ, and Falcon. Some are better at specific tasks than others. Unfortunately, these models are incompatible with each other, which means one cannot switch between them on a whim. Businesses may have to dedicate resources to supporting several if they want to cover a broad spectrum of tasks.

Furthermore, each model may have its own restrictions. For example, Falcon's training data is mostly English, with some fraction of German, Spanish, and French. If you need to use it with data from an Asian country, it is unlikely it will provide adequate results.

Each of these models also has different performance profiles. Although Falcon has shown promising results, it is not yet on par with GPT-4 in the benchmarks. The same can be said about LLaMA, the model by Meta. This discrepancy is vital to note. Running a 40B parameter model may suffice for specific use cases. But it might fall short when compared to the state-of-the-art in other cases.

There is also the issue of compliance of the models, as discussed in our previous article. Not all the models allow commercial use. This may restrict which models you can select even further.

CPU models: the struggle with speed

Running LLMs on personal devices or servers brings the promise of localized processing and control. Yet, due to the immense computing power required for these models, this is not an easy task.

Unfortunately, the current LLM models push modern CPUs to their limits. For example, the latest M2 processors from Apple hit 100C when running a 13B model. Not even considering a 40B model here. That level of heat is not sustainable for long-term use. Even the AMD Ryzen series maxes out with small models.

That would be acceptable if we could get valid responses from the models. But, to start with, the speed of token generation is too slow to be practical. The CPUs mentioned before achieve speeds of a few tokens per second, which degrade more for longer outputs. A short answer may take more than 30 seconds to appear.

There is work going on, trying to use quantization in projects like ggllm. The downside is that quantization reduces the effectiveness of the models, which may already be smaller to start with. A 2-bit 7B parameter model may not melt your laptop, but may not give you a decent answer either.

The direct consequence of these performance limitations is a severe compromise in usability. For real-time tasks such as chatbots, a rapid response is key. But, with a few tokens per second output, these high-end CPUs cannot meet the user's expectations for swift interactions. And the generated heat implies a shorter device lifespan and increased operational costs.

GPU models: the hardware hurdle

Given that CPUs reach their performance limits when running LLMs, we need to look at GPUs. GPUs have become the preferred choice for running machine learning models for a reason. But be aware that not all GPU can effectively handle the workload that comes with LLMs.

Let look at an example of why. Deploying a 13-billion parameter model on an AWS EC2 G5 instance gives us access to an Nvidia A10G card. This is by no means a bad GPU, but running a 13B parameter LLaMA compatible model, like WizardLM v1.1 13B q4, shows it is not enough. A request to this model still takes around 30 seconds to process and reply to a prompt with 2500 tokens. That's a query that GPT-3.5 or GPT-4 start answering to almost immediately.

So, once again we hit the issue of performance of the model, already discussed. The models are reliant on the RAM in the graphic card. Nvidia released the A10G card on April 2021, but with its 24 Gb of RAM it can only run smaller models. And even with smaller quantized models, it can't get enough layers into the GPU to speed up the processing.

This means that we need modern GPUs, which hit a double-whammy of issues. First, the processor shortage along with the crypto boom have caused issues with stock availability. This is getting better, but there's still some scarcity. As a second issue, there is the cost of a modern card. A RTX 4090 can be over $2,000 per card. An RTX 3090, older, is still over $1,500. Specialised hardware like the Nvidia V100 can sell for $5,000. And newer hardware like the H100 costs more than $30,000. That is, if you can find one for sale. Of course, we also need to discuss extra costs like the hardware where we install the GPU.

Unfortunately, VPS services like Linode or Hezner can't help us here, as they don't provide GPU-enabled servers. And, last we checked, you can't buy physical servers with a GPU in the main hosting providers. You will need a custom order, which will take a long time due to the availability issues we discussed.

The result is a high TCO, along supply chain issues to provision the tools we need.

Costs in the Cloud: the economic impediment

Given all the problems described, the solution seems to go back to the cloud. Granted, this brings back challenges about data protection, regulatory compliance and performance customization. But given most organisations are likely to be using the cloud, these may be mitigated.

But the costs of this option are considerable. Consider the example of running an LLM on an AWS ml.p3.2xlarge machine, which includes a single V100. This powerful instance can provide the necessary horsepower to handle LLMs efficiently. It also costs $3.825 per hour. If you spend the retail price of the card on this instance, you will get 55 days of use. Even considering the full TCO of the card, you will at most run this instance for a single quarter. With what you pay in one year of use, you could service 4 cards.

The other major providers don't fare much better. A similar configuration costs $2.48 per hour for the GPU (CPU and other parts not included in the price). In Azure, the price of the instance (CPU and GPU) is around $0.3 per hour. In all these cases, including AWS, you can negotiate discounts with long-term commitments. Even with the 50% discounts we have seen, you would still pay a TCO of 2 V100, while getting access to 1.

There are other players in the space that make costs more bearable. For example, Paperspace, recently acquired by DigitalOcean, offers a V100 for $2.30 per hour. This can be a better option for those who don't want the long-term commitments. But it still works out to a very high price.

The TCO of GPU-enabled servers in the cloud are high, more so when compared to the costs of a service like the GPT-3.5 API. This is without considering the in-house expertise required to run our models. It is possible, but a realistic choice only for the biggest companies.

Other options: API and trust issues with Azure

Seeing that running the services ourselves is complicated, we can try to approach the issue for a new angle. Is there any existing API service that is compliant with our privacy expectations?

Azure has been promoting their Azure OpenAI service, offering access to GPT-3.5 and GPT-4 with a strong privacy-oriented setup. This is a very promising step in the right direction. We could rely on state-of-the-art services without compromising our data.

Of course, you may be aware of other concerns, relevant to our discussion. As the following blog post lists, Azure has been the victim of many breaches. As we write this post, there's been a new breach affecting over two dozen organisations. The response of Azure to previous breaches has been less than optimal. The current security profile of the platform would make us uneasy about using their services with sensitive data.

Going back to the source, OpenAI, is not likely to be an option. There is their current problems with European regulators about the source of their data. And there are reasons to mistrust their commitments about privacy. Alternative providers like Anthropic don't fare better. Their terms of service seem to state they can use prompts and outputs to improve the services.

This creates a complicated situation. We can't seem to trust the model providers. And the cloud providers, trying to meet our needs, seem to open other challenges of their own. With proper risk management, we may be able to handle those risks. But it is not an optional task. A strong operational framework is mandatory, to reduce the damage from any breaches.

Conclusion: the reality of AI as SaaS

In the foreseeable future, we might be more reliant on APIs for advanced chat models than we'd like. Running these models on our own is possible, but it is not without its challenges and costs. From compatibility issues between models to the costs of running powerful machines, the road to self-reliance is steep.

Thus, for many, the reality of AI might manifest more as a Software-as-a-Service (SaaS) rather than an internal service. While this may not be the ideal scenario for all, it presents a viable alternative that balances performance, cost, and usability. We can still hope for advancements that make running LLMs independently more feasible. Until then, our journey towards private, self-hosted AI remains a challenging one.