Initially, NetDocuments received a Customer support request stating that ndMail had stopped working within their environment. During that investigation, we began to receive additional tickets from Customers referencing a similar scenario. At that point, the issue was escalated to the Cloud Operations team. It was determined that multiple services including Tasks, SmartView and SetBuilder were also experiencing issues.
During the investigation, the only recent change that had been implemented involved a network WAN configuration on the core switches in Frankfurt. As part of our mitigation efforts, that change was reverted back to the previous configuration and the services resumed normal operation.
Our postmortem investigation revealed that the initial change to the Frankfurt switches, which was implemented to correct a misconfiguration of summary routes to facilitate the correct routing of log data, was impacted by a missing static summary route. The missing static summary route caused a subset of traffic to be dropped. The removal of this traffic impacted access between certain applications and the platform.
Mitigation Strategy
We reviewed the Core switches in all data centers to ensure proper static summary routes were in place. No additional missing routes were found. We’ve expanded our post-change regression testing to include additional application specific tests. We have also reviewed how we monitor the health of the impacted applications to improve our ability to see this kind of issue.