Host Llama3 with Python Flask on AWS EC2

Mohan Sai Teki

Published in

Towards AWS

5 min readMay 5, 2024

In this blog, I will share the steps you need to take to host Llama3–8b with Python Flask on an AWS EC2 machine.

Llama3 is the latest state-of-the-art LLM from Meta which set the new benchmark in model performance.

llama3 benchmark performance — Llama3 performance (Image is taken from the official website)

Without wasting time let's see how you can host this state-of-the-art LLM.

First Get the Llama3 access…

Llama3 is open source but you need to share your details with Meta to use. Go to the Hugging Face website and provide your details after signing up. Once you complete it within 1hr you will get access to model files. You can also download the LLM from the Meta website but the download link is valid only for 24 hr. You need to fill in your details again after 24 hours if you want to download it again.

Launching EC2 machine…

You can follow your approach for creating EC2 machine like from console, aws cli, etc. For this demo, I used below EC2 configurations

AMI: Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.2.0 (Amazon Linux 2) 20240420 ( You can also use Ubuntu or normal Amazon Linux 2 and install the above packages (Nvidia driver and PyTorch) )
Instance Type: g5.xlarge
Disk Space: A minimum of 50GB is required

Once you launch the machine. Login to it and check the Nvidia drive status using the below command

nvidia-smi

Installing required python packages…

Below are the python packages used for this demo

huggingface transformers — For up-and-running LLM
accelerate, bitsandbytes — Used for optimizing memory usage of LLM
Pytorch — One of the popular Machine Learning library. It is pre-installed with AMI.
Flask — For up-and-running web app

Use the below command to install the above packages

pip install transformers torch flask accelerate bitsandbytes

Let’s see the code…

You can directly access the code here.

code for running llama3 with Flask — python code to run llama3 with flask

Code Explanation…

Adding HF_TOKEN in env variable — Code for setting HF_TOKEN in the env variable

Since Llama3 is restricted we need to use HF_TOKEN to download it.

In the above code, the AutoModelForCausalLM class is used to load the LLM into memory. If the model is not available in the machine it will download directly from Huggingface.

load_in_4bit=True is an important parameter because it is used to quantize the LLM.

What is the Quantization?

Everyone knows LLM has the parameters. In Llama-8B, 8B refers to 8 billion parameters.

Whenever you load the LLM into memory each parameter occupies a certain amount of space. In our case, Llama3 takes 16 bits / 2 bytes of memory for each parameter. So the entire Llama3–8B takes ~16GB (8B * 2 bytes) of memory, just to load the LLM.

For Llama-70B it will take ~140GB (70B * 2 bytes) of memory. If you are rich then you can afford to launch a machine which is more than ~140GB of memory but not everyone can afford that.

To solve this, people usually decrease the no of bytes for each parameter i.e. reducing the 2 bytes to 1 byte or even 4 Bit’s. This is called quantization.

Once you have done that your LLM uses less memory to run and the response will be faster but the drawback is you see a bit of quality drop in output. For normal use cases, it will be fine.

So in our case, I loaded the LLM with 4bits by just adding the load_in_4bit=True parameter. Thanks to the transformer package for hiding all the complexity in quantization.

You can also use load_in_8bit=True with a g5.xlarge machine.

Initializing tokenizer — Code to Initializing Tokenizer

In the above code, the AutoTokenize class is used for tokenization.

Loading the model and tokenizer into pipeline — Code for loading the model and tokenizer into the pipeline

pipeline class is a special class from the transformers package where it hides all the complex code that is involved in solving different use-cases of LLMs like Sentiment Analysis, Named Entity Recognition, Feature Extraction, etc, and provides us a simple interface to use.

In this example, we are using the “text-generation” which is another use case of LLM and Llama3 comes under this bucket, and along with it we also pass the initialized model and tokenizer as inputs.

In the above code, we are opening the endpoint under http://<HOST_NAME>:5000/generate

When a user wants to ask the question they have to pass the question under the query parameter like below

http://<HOST_NAME>:5000/generate?query=hi

I chose this approach because it’s simple and requires less code from Flask point of view.

We are passing the user query as a user message in the prompt and passing that to the pipeline (pipe variable).

Once Llama3 generates the response we are parsing that and returning it as JSON response.

Run the Flask app it will take some time to start, as it has to load the LLM.

Results…

With this, we are hosting the Llama3 with Flask.

By the way in the above code, you can replace the model_id with any text-generation model available in huggingface. If you face any issues you may need to tweak the input parameter a little bit but the way we load the model and tokenizer will be the same.

I hope you didn’t waste your time reading this blog. See you in the next one!