EURO2024 - Unsynced Serial Problem
The Problem
Main reason Lambda's Idempotent problem -- Lambda would guarantee to deliver the same message at least once, NOT exactly once. This means.
Arrival of
SQS Events = --A-B--CD---E--------------------------------------->
Lambda 0 = ---A==---C==-E==---------------| 15m end
Lambda 1 = -----B==--D==E==----------------------------------| 15m end
- the idle time
= the execution time
We are well aware of this problem - and hence implement a transaction to cover the problem. But it poses a Database connection problem prior to this problem. Hence we relaxed our rule so that the connection can flow easily (previously we have an API call in the middle of transaction block; and Lambda Timeout midflight hence the it causes the Database Connection issue -- we later learned that it is not actually the cause; As Lambda actually just lacked of SIGTERM
hence upon Lambda Timeout it failed to handle the connection dismissal.).
After the relaxation of the problem Lambda queue handler that normally may received the same message twice. Hence they race to produce the same result. (Update as paid twice. Hence it claim the serial twice).
The only bad result come out of this is that it update the same transaction twice; However we use optimistic approach to track the Serial on Redis; Hence it doesn't inherit the safety of single Transaction; (this trade off was made to optimize database query, and prevent multiple transaction claiming same (Range of Serial)).
The Solution
To fix this; we employ another Redis atomic key blocking strategy; By using Redis to gurantee idemptotent of the service per TransactionId.
The way we look at this is that to make the service idempotent; We only allow 1 lambda to consume transaction at a time. Using Redis to INCR:<transactionId>
then check INCR
result if it is 1, you are good to go -- set EXPIRE
for 10~15 seconds and perform your own task. Otherwise Discard this request.
export class IdempotentManager {
public constructor(private redis: () => RedisClient) {}
public async requestForAccess(resourceKey: string, blockForSeconds: number): Promise<boolean> {
const rd = this.redis()
const val = await rd.incr(resourceKey)
if (val > 1) {
// disallow access
return false
}
// allow access but expire it within X seconds.
await rd.expire(resourceKey, Math.floor(blockForSeconds), 'NX')
return true
}
}
Side effect of this Idempotent problem
EVENT A: Transaction A (normal)
SET NEW VALUE = -----2-------------------------------------------------------------------->
A
EVENT B: Transaction B (2 events, raced)
SET NEW VALUE = ----------3--------------------------------------------------------------->
SET NEW VALUE = ----------4--------------------------------------------------------------->
B
ON REDIS = --1--2----4--------------------------------------------------------------->
| | |
V V V
ON RDS = --1--2----4----------------------------------- (Because RDS trust REDIS) ->
A B
This create the gap in RDS note that 3 is missing. How do we identify this gap in single SQL Query? To do this we can leverage on LAG()
a function that allow you to select data
based on previous row.
SELECT
transaction_id,
serial_number_from,
last_serial,
last_serial - serial_number_from AS diff
FROM (
SELECT
serial_number_from,
LAG(serial_number_to) OVER (ORDER BY serial_number_to) AS last_serial,
transaction_id,
T.purchase_status
FROM
transaction_postcards P
INNER JOIN transactions T ON T.id = P.transaction_id
WHERE
P.variant_campaign_campaign_id = 'EURO2024' ORDER BY serial_number_to
) A
WHERE
last_serial + 1 <> serial_number_from;