Our processing capacity is split into many nodes both for redundancy and scalability. Prior to this event one of our processing nodes was running at reduced capacity due to maintenance, and then an internal error caused the redundant node to fail. This caused a small subset of customers to receive transaction declines. As our techs started responding to the issue, they experienced difficulty in identifying the problematic node. This caused us to lose valuable time in bringing the node back online and also resulted in the resetting of a working node, which in turn affected additional customers. As additional information was received, the correct node was identified and immediately brought back online, completely resolving the issue.
We will be evaluating and improving the following as a result of this incident:
• Improving our response runbook to better identify problematic nodes and prevent healthy nodes from being affected
• Better and more timely communication via status.cardknox.com
• Improved error messaging and monitoring to alert us sooner to these types of errors
• Maintaining redundant capacity even during maintenance
• Automated failover of customers to healthy nodes
With the awareness that any interruption to normal processing is extremely disruptive to our merchants, we sincerely apologize for this incident. We strive to maintain excellence in all levels of our service. Incidents like these show that we can do better and we will.