Imagine that you are writing an e-commerce application and you have a customer with a full shopping cart who is ready to check out and give you their money. If they can’t complete the transaction fast, you may lose the sale. Things get much worse when you have 1,000 or 10,000 customers all trying to buy at the same time! Unless you have applied some advanced performance optimization techniques, your application may fail at the worst time possible.
This article expands on the most advanced technique in my earlier blog post, New Relic and JMeter Performance Remediation – Part 3, Introduce transactional asynchronous messaging.
Processing a Shopping Cart Order
Assume you already have a durable order number stored in a SQL database, and that your user has a cart full of goods or services. In order to fully complete an order from the checkout screen, your system might have to do many things:
- Verify that the credit card passes the Luhn algorithm check
- Verify that the name and address are something you could actually ship to
- Authorize the charge on their credit card
- Mark the order as “purchased” in the database via a SQL transaction
- Alter the inventory in the system for items in the cart as sold instead of pending order via a SQL transaction
- Send an email to the purchaser with the order details
- Create a shipment notification and send a pick list to the warehouse via an internal SOAP API
- Update 3rd-party sales analytics systems with information about what has sold via an external REST API
- Return an Order Complete message to the user and display it in the application
In many systems, all these operations happen in order, synchronously, before the customer sees the results from hitting the Submit Order button. However, completing all these items could take 10 seconds or more, especially when interactions with external systems such as credit card authorization systems or 3rd-party APIs have unpredictable latencies. Some of these items are processing-intensive, such as updating inventory systems, retrieving all the customer and order details and composing a good-looking email. Remember – latency is the enemy of throughput due to Little’s Law.
Sync Systems Fall Down and Go Boom
When traffic is light, it might be OK to have users wait 10 seconds to have the order complete. However, this will kill your application and melt your servers when you release a popular, limited quantity item and announce to your customers that it goes on sale at noon on Monday and you face a traffic spike with 1,000+ users trying to order at the same time.
Async to the Rescue
If you introduce transactional asynchronous messaging and the Publish / Subscribe pattern, you can pare down the amount of work you have to do at checkout time to a much more minimal set:
- Verify that the credit card passes the Luhn algorithm check
- Publish a message containing the order ID and the information in the shopping cart checkout forms (name, address, credit card information) to a queue
- Alter the inventory in the system for items in the cart as Sold instead of Order Pending (do it in a fast in-memory database such as Redis instead of in a SQL database)
- Return an Order Submitted message to the user and display it in their web browser. Tell them to expect an email shortly regarding the status of their order.
- If the user tries to edit an order while it is in this state, you can tell them that the order is still pending and can’t be changed until it is processed.
Maximize Deferred Processing with Queues
Once you have the order stored in a durable queue, you can have queue workers read the order messages and process them at a rate that will not overwhelm your database servers. All these operations can be done through message queue subscribers:
- Verify that the name and address are something you could actually ship to
- Authorize the charge on the credit card
- Mark the order as “purchased” in the database via a SQL transaction
- Alter the inventory in the system for items in the cart as Sold instead of Order Pending via a SQL transaction , and by updating the in-memory database so that they stay in sync. Remember that the in-memory database is just a cache for the true state of the system.
- Create a shipment notification and send a pick list to the warehouse via an internal SOAP API
- Update 3rd-party sales analytics systems with information about what has sold via an external REST API
- If anything goes wrong, cancel the order, and reverse the inventory adjustments
- Send an email to the purchaser with the order details and either a confirmation or a cancellation notice
What About Parallelism?
Arguably, only items 1, 2, and 3 of these follow-up operations have to be done in strict sequence. You could use a message fan-out pattern to handle post-order-submission chores for the nexitems 4, 5, and 6, and save items 7 and 8 (potential cancellation and email notification) for when items 4 and 5 have completed. Your end user doesn’t care whether some 3rd party sales analytics system gets updated so don’t wait for item 6 to complete before sending the user an update.
Sometimes you might have to choose carefully what has to be done synchronously. If you want to avoid PCI compliance issues, you may want to use a 3rd party payment provider such as Stripe, Square or PayPal and hand off those credit card numbers without storing them. However, you might have to wait for that to complete before you show the user an Order Submitted message. Alternatively, you might need to ensure your message is encrypted both in transit and at rest.
Tradeoffs abound here.
Platforms, Tools and Libraries
Many messaging systems can support these patterns, including IBM MQ, RabbitMQ, Microsoft MSMQ, AWS SNS, AWS SQS, AWS EventBridge, Google Cloud Pub/Sub, Azure Service Bus, and ZeroMQ. The Apache project has a kitchen sink full of messaging systems: ActiveMQ (see Amazon MQ), Apollo, Kafka, Pulsar, QPid, RocketMQ, and Synapse. Common libraries used to work with some of these systems include JMS, Spring Messaging, MassTransit, Celery, and Qpid Proton.
Fully featured asynchronous messaging systems have these capabilities:
- Audit message delivery
- Support local development
- Retry faulting messages
- Process messages in parallel
- Process a dead letter queue of completely failed deliveries
- Encrypt messages in transit and at rest
One async system I designed in 2015 used MassTransit and RabbitMQ, for both the simplicity of writing C# clients and servers in MassTransit, and the proven reliability of RabbitMQ. If I had to pick a replacement transport in 2019 for that system, I might select SQS or Azure Service Bus instead, since MassTransit now supports them, and both of these SaaS queue systems require nothing more than an Internet connection and a credit card on file as supporting infrastructure.
Case Study: Huge Lift from Asynchronous Messaging
In one hospitality e-commerce system I helped design, the initial performance target was 1,000 simultaneous users. The system used a multi-tenant, normalized SQL database for both OLTP and OLAP purposes, which is convenient but potentially risky. The original intent was that the system ought to be load tested thoroughly to verify performance. Alas, pressure to deliver features outweighed that concern and the load testing performed was with only 50 concurrent users, instead of 1000 concurrent users, in part because at the time load testing was both difficult and expensive. A 1000 user load test with the first-generation load test systems chosen would have cost about $750 to run once!
Hard experience had taught the organization that major failures could lead to loss of clients and a hit to their reputation – the system faced an early production failure when 400 people vying to reserve hotel rooms as soon as they were made available crashed the software.
This Microsoft stack shop used C# and SQL Server to implement the system. When remediation efforts began, a 100 concurrent user load test would bring the system to its knees.
After 9 months of work, applying fixes to SQL queries, adding both session and data caching with Redis (a fast in-memory data structure server), and horizontally scaling the processing workload, we could still only get about halfway to the target. The first year after we began remediation, on the biggest day the system again failed under load when the number of users exceeded 1,000. <emWe had to double our expected load targets to 2,000 concurrent users!
Moving inventory bookkeeping to Redis increased performance to about 1,300 simultaneous users. However, it took adding asynchronous messaging in a manner similar to the scenario explained above to get the system to perform at a higher level. We used MassTransit and RabbitMQ to handle the pub-sub transactions and used AWS Key Management Service and envelope encryption to encrypt the data at rest within the RabbitMQ system as part of our PCI compliance efforts.
In the lead up to the second year’s peak day, we were able to exhaustively load test the system in tandem with the hospitality firm’s corporate client. We ran more than 1,000 load tests over the course of the next year, both large and small, to incrementally verify that the changes we were making worked. Once it was fully debugged, the system could handle more than 3,500 concurrent users in a load test. Then on peak day with real users reserving hotel rooms with real credit cards the system performed flawlessly, with just over 2,000 concurrent users observed using the system.
2,000 concurrent users might not sound like much, but the typical cart might contain a week’s worth of hotel rooms in a major city, with an average cart size that could be in the $1,000 range. Consumers get very upset if a thousand-dollar-plus transaction fails to complete! The system managed to handle transactions worth several million dollars, with all users submitting orders and getting transaction identification codes within 20 minutes. The orders were all completely fulfilled, including sending email to users confirming their orders, over the next few hours.
Conclusion
Although using asynchronous messaging can complicate the design of your systems, it can give a dramatic lift to performance once you have exhausted other avenues of performance improvement. Many different tools, libraries, and services support this paradigm. Even if the road is long to verify that the systems behave correctly after introducing asynchronous messaging, the results can be well worth the journey.
Richard Bullington-McGuire
Related Posts
-
Jmeter for Performance Testing
We all agree how important functional automated testing is, however not many applications go through…
-
New Relic and JMeter Performance Remediation - Part 3
New Relic and JMeter identify and help to solve your application’s biggest issues, and are…