NOC Engineer / NOC Lead

<h2>NOC Engineer / NOC Lead</h2><p style="min-height:1.5em">Infrastructure operations · shared across customers</p><p style="min-height:1.5em"><strong>Reports to: </strong>Manager, NOC (or Director, Service Operations)</p><p style="min-height:1.5em"><strong>Location: </strong>Remote (US) with assigned shift; rotating coverage</p><p style="min-height:1.5em"><strong>Department: </strong>Infrastructure & DC Operations / Network Engineering</p><p style="min-height:1.5em"></p><h3>Position summary</h3><p style="min-height:1.5em">The NOC Engineer operates STN's 24/7 monitoring and first-response capability for GPU One (GPUaaS) infrastructure. The role triages alerts, executes documented runbooks, and coordinates with on-call specialists during incidents to protect customer SLAs.</p><h3>Key responsibilities</h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Monitor infrastructure alerts, customer SLA dashboards, and system health on a 24/7 basis</p></li><li><p style="min-height:1.5em">Triage incidents and engage on-call SREs, Network, Hardware, or Field Engineering as needed</p></li><li><p style="min-height:1.5em">Execute documented runbooks for common platform, network, and hardware issues</p></li><li><p style="min-height:1.5em">Manage the incident lifecycle including initial customer notification and status updates</p></li><li><p style="min-height:1.5em">Coordinate planned maintenance windows and change windows with internal teams and customers</p></li><li><p style="min-height:1.5em">Update status pages and customer-facing communications during incidents</p></li><li><p style="min-height:1.5em">Maintain shift handoff documentation and active-incident logs</p></li><li><p style="min-height:1.5em">Support ticket queue handling including Tier 1 ticket resolution</p></li><li><p style="min-height:1.5em">Contribute to continuous improvement of monitoring coverage, alert quality, and runbooks</p></li><li><p style="min-height:1.5em">Work rotating shifts including nights, weekends, and holidays</p></li></ul><h3>Required qualifications</h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">3+ years in a NOC, SOC, or IT operations function</p></li><li><p style="min-height:1.5em">Hands-on experience with monitoring tools (Datadog, Prometheus, Grafana, PagerDuty, or equivalent)</p></li><li><p style="min-height:1.5em">Strong Linux and basic networking fundamentals</p></li><li><p style="min-height:1.5em">Excellent written and verbal communication, particularly under pressure</p></li><li><p style="min-height:1.5em">Willingness and ability to work rotating shifts including overnight coverage</p></li></ul><h3>Preferred qualifications</h3><ul style="min-height:1.5em"><li><p style="min-height:1.5em">GPU, HPC, or large-scale cloud infrastructure background</p></li><li><p style="min-height:1.5em">ITIL Foundations certification</p></li><li><p style="min-height:1.5em">Demonstrated on-call and major-incident response experience</p></li><li><p style="min-height:1.5em">Scripting skills (Python, Bash) for runbook automation</p></li></ul>

Back to blog