Apollo - Manager Portal

Incident Report for Viirtue

Postmortem

Update

Date: July 18th, 2025 1:09PM EDT

A complete Root Cause Analysis (RCA) has been prepared and will be provided to all partners and customers with active support tickets related to this incident. If you do not currently have an open ticket but would like to request a copy of the RCA, please contact your account manager or submit a support ticket.

Incident Report: NJ Data Center Network Event & Related Service Impact

Date of Incident: July 14, 2025

Prepared by: Alberto Fernandez (Director of Platform Engineering)

Start Time: 10:53 AM EST

Partial Restoration Time 11:10 AM EST

Full Resolution Time: 2:09 PM EST

Services Impacted: Apollo services across NJ, FL, and Las Vegas data centers

Customer Impact: Potential voice degradation, call setup delays, or dropped sessions for some users during routing transitions.

Executive Summary

At 10:53 AM EST, our monitoring systems detected a sharp and unusual increase in SBUS queue activity within our NJ data center. This triggered an immediate alert to our Network Operations and Engineering teams, who launched a rapid response investigation.

As a precautionary and proactive measure, we rerouted all real-time traffic away from NJ to our geographically redundant data centers in Las Vegas and Florida. This action ensured continuity of service while we investigated the root cause of the anomaly in NJ.

During the transition period, we identified two additional, unrelated issues affecting the Vegas and Florida clusters. Each was diagnosed and mitigated independently. All systems are now fully operational and stable across all locations. Further details on these issues will be provided in the final RCA.

Potential Root Cause

The incident was triggered by a transient physical network cut in our NJ data center’s upstream routing path. While the disruption lasted approximately 5 minutes, it had downstream effects on signaling queues (SBUS) and session routing.

The system behaved as designed by rerouting traffic through redundant sites, but the rapid load redistribution temporarily stressed configuration limits which temporarily impacted service quality.

Initial Remediation

NJ Network Cut Mitigation: Root cause confirmed with upstream provider. Cable rerouted and physical safeguards enhanced.
SBUS Monitoring: Sensitivity tuned for improved queue behavior detection.

Customer Communication

We recognize that service interruptions, even brief ones, can be disruptive to your operations. We remain committed to transparency, reliability, and continuous improvement.

Next Steps

Perform full internal post-incident review.
Continue monitoring all three data centers at increased granularity for the next 72 hours.
Deliver a final Root Cause Analysis (RCA) document within 7 business days.

If you have any further questions or would like to discuss the incident in more detail, please contact our support team.

Thank you for your continued trust.

Posted Jul 15, 2025 - 12:09 EDT

Resolved

All services have been fully restored and are now operational. An incident report will be shared later today, with a full root cause analysis to follow.

Posted Jul 14, 2025 - 14:09 EDT

Update

We’ve pinpointed the root cause and are currently validating the fix.

Posted Jul 14, 2025 - 13:50 EDT

Update

We’re continuing to work through service stabilization efforts across both regions. Teams remain engaged and progress is ongoing.

Posted Jul 14, 2025 - 13:26 EDT

Monitoring

The 503 has been removed from core2-nj, and voice traffic has been redirected to this core.

Posted Jul 14, 2025 - 12:51 EDT

Update

We are aware of the audio issue affecting a subset of calls on the Las Vegas core and are investigating the issue.

Posted Jul 14, 2025 - 12:35 EDT

Update

All voice and web portal traffic directed to the core2-nj core has been redirected to our other cores. Service should improve while we work on finding the root cause.

Posted Jul 14, 2025 - 11:22 EDT

Update

Due to ongoing connectivity issues at our NJ data center, we are proactively returning HTTP 503 responses to divert traffic away as a preventative measure. Services remain operational via our redundant infrastructure.

Posted Jul 14, 2025 - 11:12 EDT

Update

We are continuing to investigate this issue.

Posted Jul 14, 2025 - 11:06 EDT

Investigating

We are currently investigating an issue causing the manager portal to load slowly. Updates to follow.

Posted Jul 14, 2025 - 11:06 EDT

This incident affected: Apollo (Manager Portal, NJ2 Core Server, LV Core Server).