EURO2024 - Lambda and RDS Connections

We are already well aware of how the Lambda does not play well with connection-based services, especially the long one like the RDS. Hence we use RDSProxy to help Lambda manage the Connection. However doing so does not magically solve the problem.

This help mitigate the problem.

Topology

+--------+         +--------+          +-----------+         +-----+
| API Gw | ------> | Lambda | -------> | RDS Proxy | ------> | RDS |
+--------+         +--------+          +-----------+         +-----+
             REQ               Client                  RDS
                                conn                   conn

With this topology; Lambda connection with RDS is now easier to connect because (RDSProxy) maintain a long live connection with RDS. While Lambda can simply cut short on the Client conn.. Hence when we diagnose this problem we look at

RDS Proxy Client Connection.
RDS Connection.
Lambda Instances (number of alive Lambda instances).

The Normal Timeline of the Connections on Lambda

Lambda LifeCycle     Created                                              Terminated
                        |                                                      |
Lambda Event          = | Exec Ev.A            Exec Ev.B          Exec Ev.C    |
                        | |                    |                  |            |
Lambda Exec Timeout   = +-+========|-----------+===========|------+=========|--X
                          | (create)           | (reuse)          | (reuse)    |
Lambda+RDS Connection =   +----------------------------------------------------X (shared)
                          |                    |                  |
Event A Timeout Thres.=   |===========X (30s)  |                  |
Event B Timeout Thres.=                        |===========X (30s)|
Event C Timeout Thres.=                                           |===========X (30s)

The Problem

During production phase; Our service had experience the cascade KnexErrorTimeout problem. This is when RDSProxy's client connections are still available; RDS connection is still available. CPU and RAMS are also more than suffice but our ORM failed to connect to RDSProxy and Keep throwing KnexError, Timeout.

Of cause we first try to tune the RDS Pool's connection. But we already did this during our load test; And as expected nothing really helps.

After hours of investigation -- with a lot of attempts to reproduce the error; we found that the KnexError timeout happens when ORM already established a connection. And the Lambda task was timedout. The next invocation of Lambda will get the immediate KnexError right away.

Connection get tainted

Lambda LifeCycle     Created                                              Terminated
                        |                                                      |
Lambda Event          = | Exec Ev.A            Exec Ev.B          Exec Ev.C    |
                        | |                    |                  |            |
Lambda Exec Timeout   = +-+========|-----------+===========X==|---+==|---------X
                          | (create)           | (reuse)   |      | (reuse, got KnexTimeout)
Lambda+RDS Connection =   +--------------------------------X~~~~~~~~~~~~~~~~~~~X (shared)
                          |                    |           |      |
Event A Timeout Thres.=   |===========X (30s)  |           |      |
Event B Timeout Thres.=                        |===========X (30s)|
Event C Timeout Thres.=                                           |===========X (30s)

The Solution

We now do have how to reproduce the problem. But we don't know why. However we can prevent the first condition from happening. (We implement our own execution deadline and make it lesser than Lambda's Deadline). Hence the Lambda will never got timeout!

The code to provide deadline constraints can be as simple as the code below;

export const deadline = <T>(timeoutInMs: number, operation: () => Promise<T>): Promise<T> => {
  return Promise.race([
    operation(),
    new Promise<T>((_resolve, reject) => {
      setTimeout(() => {
        reject(new Error(`deadline ${timeoutInMs}ms exceeded`))
      }, timeoutInMs)
    }),
  ])
}

Please be aware that operation() callback will not stop immediately as this another promise doesn't have know how to gracefully stop. It will just try to execute and resolve the promise. The result of such promise is basically ignored. Hence this should be wrap around the safe operation; e.g. Calling others (READ) API that may take long time.

Yet during solution exploration we stumble upon this example from Lambda Graceful Shutdown. It intrigue us that how come Graceful Shutdown needed Extension? So it turns out that it cause you a bit of money. As it utilize CloudWatch matrices to populate the data.

What is this extension?

This extension equip Lambda with the CloudWatch stats, and with such stats it also create a SIGTERM hook for Lambda's process. What's this you may asked?

# Given this code

process.on('SIGTERM', () => {
  console.log('SIGTERM received')
})

If you are to attach this code in your Lambda's handler. The code will never been called. But with Lambda Insight it will. Basically it give a way for Lambda instance to received this OS's signal event.

After we attached this hook. It magically solved the problem. Why? We speculate that this is due to SIGTERM received. As of our ORM (MikroORM) utilize the pg-pool library to handle the postgres's database pool. Which is nativelly implemented. pg-pool would be able to received this SIGTERM events too and hence it can gracefully destroy the pending database connection.

After attached LambdaInsightExtension

Lambda Insight Extension                                SIGTERM
Lambda LifeCycle     Created                               |              Terminated
                        |                                  |                   |
Lambda Event          = | Exec Ev.A            Exec Ev.B   |      Exec Ev.C    |
                        | |                    |           |      |            |
Lambda Exec Timeout   = +-+========|-----------+===========Xo-----+=========|--X
                          | (create)           | (reuse)   ||     | (reuse)    |
Lambda+RDS Connection =   +--------------------------------Xo------------------X (shared)
                          |                    |           |      |
Event A Timeout Thres.=   |===========X (30s)  |           |      |
Event B Timeout Thres.=                        |===========X (30s)|
Event C Timeout Thres.=                                           |===========X (30s)

And since we already have this glorious SIGTERM event we can also leverage it by attaching this code into our handler just for tracing sake.

import type { MikroORM } from '@mikro-orm/core'

const print = process.stdout.write.bind(process.stdout)

// ORM Clearing
const tearDown = async (dbOrm: Promise<MikroORM>): Promise<void> => {
  if (!dbOrm) {
    print('[tearDown] ORM object is NOT available. No need to tear down.\n')
    return
  }
  print('[tearDown] ORM object is available. Tearing down.\n')
  const orm = await dbOrm
  process.stdout.write('[tearDown] Validating connection.\n')

  const isConnected = await orm.isConnected()
  if (!isConnected) {
    print('[tearDown] Connection are no longer available. No need to tear down.\n')
    return
  }
  print('[tearDown] Connection is still valid. Force closing them.\n')
  await orm.close(true)
  print('[tearDown] Connection closed.\n')
}

// SIGTERM Handler: https://docs.aws.amazon.com/lambda/latest/operatorguide/static-initialization.html
// Listening for os signals that can be handled,reference: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html
// Termination Signals: https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
export const handleSigTermEvent = (orm: Promise<MikroORM>) =>
  process.on('SIGTERM', async () => {
    print('[runtime] cleaning up\n')
    // perform actual clean up work here.
    await tearDown(orm)
    print('[runtime] bye.\n')
    process.exit(0)
  })

Note that we don't really use console.log here. As console.log is asynchronus by nature. And with the SIGTERM hook the execution context is only 200ms or less which will not be guaranteed to have enough time to wait for microtask to be scheduled and run.

Finally as of result; We implement both Extension hook and Deadline constraint; We never have such error any more :)

EURO2024 - Lambda and RDS Connections

The Problem​

The Solution​

The Problem

The Solution