Dive into Generative AI with Amazon Bedrock and AWS Lambda Function URL Response Streaming

Ike Gabriel Yuson
Towards AWS
Published in
11 min readApr 16, 2024

--

Generative AI has taken the industry by storm and a lot of companies and developers nowadays are itching to integrate this new piece of technology in their existing applications. However, as simple as it may seem at face value, people now realize that its not as simple as calling an API. More so if you want to stream responses like the behavior of current LLM applications like ChatGPT. You might hear the following questions and pain points from developers who are getting their feet wet in developing Generative AI solutions:

How do I stream responses like the behavior in ChatGPT?
Is there a straightforward yet serverless way to do this?
How do I connect this with my frontend application?

In this blog post, all these pain points above will hopefully be resolved with the combination these two emerging technologies within AWS’s suite of services, Amazon Bedrock and AWS Lambda Function URLs. As well as a bonus tutorial on how to implement this behavior in the frontend with React.

Why should you use AWS Lambda Function URLs?

The most common use case of using AWS Lambda Function URLs (fURLs) for the project we are going to build is to bypass the current limitations of API Gateway. API Gateway has a timeout of 30 seconds while Lambda has a timeout of 15 minutes. If your current infrastructure hosts APIs with API Gateway and AWS Lambda, then you might notice all your request times out at 30 seconds even if you configured your Lambda function’s timeout to be greater than this. This is because of API Gateway’s limitation, not Lambda’s. However, with fURLs, you can bypass this limit and utilize the full 15-minute limit. For this project, a 15-minute timeout is more than enough for our integration with Amazon Bedrock.

Time it takes to stream a 3000-token response from Amazon Bedrock.

To give you some context, this is a sample streaming response of 3000 tokens using the Anthropic’s Claude V2 foundational model from Amazon Bedrock. As you can see, it clearly exceeds 30 seconds. If this was hosted in a Lambda function behind an API Gateway, this would timeout and the remaining tokens will be cut off.

Additionally, another limitation is that Lambda has a size limit of 6MB for both request and response payloads. One might argue that 6MB is quite big for merely transferring plaintext. However, what if you need to transfer files like images and PDFs? Remember, Generative AI is is not only limited to text. The most popular implementation to bypass this limitation is to use S3 pre-signed URLs. This implementation, however, burdens the user experience of the client side of your application.

client requests for pre-signed url → receives pre-signed url → uses pre-signed url → uploads or downloads object to s3

Having to request for a pre-signed URL first and then uploading or downloading an object to or from an S3 bucket will without a doubt, give your application more latency and at the same time requiring a lot of operational overhead. With fURLs streaming response feature, you can easily bypass this 6MB limit without having to hurt your user experience. Note that there is a 20MB soft limit in AWS Lambda but this can be increased by raising a support ticket.

What about using WebSockets?

Yes, you can still use API Gateway’s WebSockets support to emulate our desired behavior. However, this requires a significant amount of operational overhead. If you already use REST APIs in your application, dedicating WebSockets for only a Generative AI feature will be quite a hassle since you would be maintaining a new type of protocol.

Hands-on example

Overall architecture of AWS Lambda Function URL streaming response and Amazon Bedrock.

Let’s implement the following architecture above using the Serverless Framework. You may find the repository over here to follow along:

Prerequisites

  1. Configure your AWS profile: export AWS_PROFILE=<your-profile-name>
  2. Install dependencies: npm install
  3. Request for Anthropic’s Claude V2 foundational model in the AWS Management Console.
Request access for foundational models in AWS Bedrock.

To request for access, go to Amazon Bedrock in the AWS Management console, then click on Model Access, then request for the Claude foundational model inside the Anthropic section. Note that you might be prompted to submit use case details for you to gain access to some foundational models.

Configuring the Lambda function

In our serverless.yml, this is how you configure a Lambda function to use it’s response streaming feature and to make sure it has the necessary permissions and configurations to integrate with Amazon Bedrock:

# serverless.yml

...
functions:
bedrock-streaming-response:
handler: index.handler
url:
invokeMode: RESPONSE_STREAM
iamRoleStatements:
- Effect: "Allow"
Action:
- "bedrock:*"
Resource: "*"
timeout: 120
...

In the example above, we set the value of url.invokeMode to RESPONSE_STREAM. The default option to this is BUFFERED where Lambda invokes your function using the Invoke API operation. With this option, invocation results are available when the payload is complete— we don’t want this behavior. On the other hand, with RESPONSE_STREAM, Lambda invokes your function using the InvokeWithResponseStream API operation where your function streams payload results as soon as they become available — this is the behavior we are trying to develop.

Additionally, we edit our Lambda function’s execution role to be able to access Amazon Bedrock. This is crucial because failing to set this up will return an AccessDenied error as soon as you call Amazon Bedrock APIs within your Lambda function.

Ultimately, the default timeout of newly created Lambda functions is 3 seconds which is arguably not enough to stream responses from Amazon Bedrock. That is why we set the timeout to a value greater than 30 to at least show that we have bypassed API Gateway’s timeout limit.

Next is to configure our Lambda function code. You may refer to the code block below:

// index.mjs

import {
BedrockRuntimeClient,
InvokeModelWithResponseStreamCommand,
} from "@aws-sdk/client-bedrock-runtime";

export const handler = awslambda.streamifyResponse(
async (requestStream, responseStream, context) => {
const client = new BedrockRuntimeClient({
region: "us-east-1",
});

const prompt = "Create me an article about amazon bedrock";
const claudPrompt = `\n\nHuman:${prompt}\n\nAssistant:`;

const body = {
prompt: claudPrompt,
max_tokens_to_sample: 2048,
temperature: 0.5,
top_k: 250,
top_p: 0.5,
stop_sequences: [],
};

const params = {
modelId: "anthropic.claude-v2",
stream: true,
contentType: "application/json",
accept: "*/*",
body: JSON.stringify(body),
};

console.log(params);

const command = new InvokeModelWithResponseStreamCommand(params);

const response = await client.send(command);
const chunks = [];

for await (const chunk of response.body) {
const parsed = JSON.parse(
Buffer.from(chunk.chunk.bytes, "base64").toString("utf-8")
);
chunks.push(parsed.completion);

responseStream.write(parsed.completion);
}

console.log(chunks.join(""));
responseStream.end();
}
);

Here you can see how we utilize the InvokeModelWithResponseStreamCommand API operation of Amazon Bedrock from the AWS SDK for Javascript v3. In the parameters of this API operation, the modelId is indicated where we use Anthropic’s Claude V2 foundational model with its corresponding Amazon Bedrock model name, anthropic.claude-v2. We also supply the body which needs to be a JSON string of the inference parameters of Anthropic’s Claude V2 foundational model. These are the following parameters:

  1. prompt — this is the prompt of which you are going to send to the model. You can do this programmatically where you extract values from the Lambda function’s event object. In this hands-on, however, for simplicity’s sake, we hard coded the prompt to be the following: \n\nHuman: Create me a an article about amazon bedrock \n\nAssistant:. For proper response generation, as indicated in Anthropic’s documentation, you need to format your prompt using alternating \n\nHuman: and \n\nAssistant: conversational turns.
  2. max_tokens_to_sample — this is the maximum number of response tokens the model will supply. The higher the value, the longer the response of Amazon Bedrock. However, it is important to know that setting a higher value will incur additional costs.

The temperature, top_k, and top_p inference parameters are the ones that influence the models response which dictate its randomness and diversity. These are optional parameters but its important to be able to understand them. Consider the following example:

“Don’t cry over spilled ”

How this works is that the model determines some words that is most likely going to be the next token. The model collects these words with their corresponding probability like the following:

  • milk — 70%
  • juice — 20%
  • water — 10%

If the specified temperature value is high, the probability of choosing the least probable word which is this case, “water”, would increase. So this would mean that a higher temperature value would be a more diverse response. In our Lambda code where the temperature is set to a value of 0.5, this means that the response’s diversity sits in the middle. The response that the model generates is not too diverse nor not too strict.

Additionally, top_k parameter is the number of options the model looks at for the next token. In the example above, if we supply the top_k parameter with a value of 1, then only the word “milk” will be considered. If the value is 2 however, then the words “milk” and “juice” will be considered. In our Lambda code, our model will be considering the most probable 250 words the model most likely determines to be the next token.

Lastly, the top_p parameter is technically the cumulative probability threshold. In the example above, if the top_p parameter is set to 0.7, then only the word “milk” will be considered since the model selects from the top 70% of the probability distribution of tokens.

Another inference parameter includes stop_sequences. The stop_sequences parameter is an array of strings where if the response of the model encounters these values, it then stops generating a response. As of writing Anthropic Claude models stop on the string "\n\nHuman:" by default.

You can learn more about these inference parameters in the official AWS documentation over here.

Why are you not using python?

I would use python if I could. However, fURLs streaming response, as of now, is only available in the NodeJS runtime out of the box. To stream responses in other languages, you can create your own custom runtime or utilize the Lambda Web Adapter. But for the sake of the simplicity of this hands-on, we’ll be utilizing NodeJS. Hopefully, there would be out of the box support for other languages soon.

Invoking your Lambda Function URL in your terminal

First you need to deploy your serverless application through the following command in your terminal: npm run sls -- deploy. After this, you will then see the output of your fURL in your terminal. It has a format of the following: https://<url-id>.lambda.url.<region>.on.aws/.

AWS Lambda Function URL output after Serverless Deploy

You can call your Lambda function using the curl command. Here’s a sample:

curl -N https://<your-url-id>.lambda-url.<your-region>.on.aws/

We use the -N flag to indicate that we want a no buffer response. Without this flag, your response will still stream but its behavior is to stream by line break which does not really emulate a response streaming behavior like popular LLM applications.

Streaming Amazon Bedrock’s response via fURL in the terminal

How do you connect this with your frontend application?

This has been one of the famous inquiries or questions in developing Generative AI applications. You may find the repository over here to follow along:

Prerequisite

  • Install dependencies: npm install

A lot of experienced frontend developers may already have experience processing a streamed response and there are a lot of ways to do this but this React code here is a simple way of processing this behavior.

// App.jsx

...
const [data, setData] = useState("");

const handleClick = async () => {
try {
const response = await fetch("<your-function-url>");

const reader = response.body!.getReader();
const decoder = new TextDecoder();

reader.read().then(function processText({ done, value }) {
if (done) {
console.log("Stream completed");
return;
}

const textChunk = decoder.decode(value, { stream: true });
setData((prevData) => prevData + textChunk);
return reader.read().then(processText);
});
} catch (error) {
console.error("Failed to fetch data:", error);
}
};

return (
<>
...
<div className="card">
<button onClick={handleClick}>Call Lambda Function!</button>
<p>{data}</p>
</div>
</>
);
...

Calling the fURL

const response = await fetch("<your-function-url>");

This line sends an HTTP request to the URL specified (replace <your-function-url> with the actual URL). It uses the await keyword to pause execution until the promise returned by fetch() is resolved, providing the response object.

Setting up stream reading

const reader = response.body!.getReader();
const decoder = new TextDecoder();

Here, response.body is a readable stream representing the body of the response. getReader() returns a readable stream reader that is used to read chunks from the stream. TextDecoder is used for converting the raw data (by default assumed to be utf-8 encoded) in the stream into a string.

Processing the stream

reader.read().then(function processText({ done, value }) {
if (done) {
console.log("Stream completed");
return;
}

const textChunk = decoder.decode(value, { stream: true });
setData((prevData) => prevData + textChunk);
return reader.read().then(processText);
});
}
  • reader.read(): Initiates the reading of the first chunk of the stream.
  • processText: This function is a recursive callback used to handle each chunk of data.
  • done: A boolean that indicates if the readable stream has been fully read.
  • value: A Uint8Array containing the chunk of data read from the stream.
  • If done is true, the stream has ended, and it logs "Stream completed".
  • If done is false, it decodes the chunk (value) into a text string and updates the data state by appending the new text chunk to any previously received text.
  • It then recursively calls reader.read().then(processText) to process the next chunk of data.

Invoking your Lambda Function URL in your React frontend

First you need to run your react application via the command: npm run dev. By default your local frontend should be hosted in http://localhost:5173/ unless you changed the port. Click on the “Call Lambda Function!” button and see the magic!

Displaying Amazon Bedrock’s response in a React frontend.

You might notice that the response is not properly formatted. This is where prompt engineering will come in handy. You can dictate how the model responds with a given format depending on how you formulate your prompts.

Caveats of using fURLs

Choosing fURLs from API Gateway also has its cons. Although it bypasses some of API Gateway’s limitations, there are still a few things to consider:

  1. API Gateway’s integration with Amazon Cognito. fURLs cannot secure your APIs just like API Gateway’s integration with Amazon Cognito. This the does not allowing you to use Amazon Cognito’s authentication and authorization capabilities.
  2. It only supports IAM_AUTH for authorization out of the box.
  3. Forces you to build a Lambdalith and offers no per-endpoint metrics.

Although there are these caveats of choosing fURLs over API Gateway, writing this article made me think that maybe fURLs are made for Lambda’s integration with Amazon Bedrock. It offers a simple, functional, and serverless solution rather than setting up those complex infrastructure using traditional architectures.

Integration of Generative AI into applications through Amazon Bedrock and AWS Lambda Function URLs represents a robust, serverless solution for streaming responses efficiently. This approach not only circumvents the limitations of API Gateway but also enhances user experience by minimizing latency and operational overhead. Developers can leverage these technologies to handle large payloads and streaming responses seamlessly, paving the way for more dynamic, responsive, and user-friendly applications. I hope this blog post empowers developers to harness the power of Generative AI in their projects, ensuring simpler, smoother, and more effective deployment.

--

--

Hi, I am Iggy. A DevOps Engineer based in the Philippines and the current User Group Leader of AWS User Group Davao. https://www.linkedin.com/in/iggyyuson/