LH
LLMHire
Browse JobsAgentsNewSalary InsightsCompaniesBlogPricing

Never Miss an AI Job

Get weekly AI job alerts delivered to your inbox.

Join the AI hiring radar. Unsubscribe anytime.

LH
LLMHire

The #1 job board for AI & LLM engineers. Find your next role in the AI revolution.

Jobs

  • Browse Jobs
  • Companies
  • Job Alerts
  • Post a Job
  • Pricing

Resources

  • Blog
  • CyberOS.devScan code for vulnerabilities
  • EndOfCoding.comStay ahead with AI news
  • Vibe Coding AcademyLearn skills employers want
  • Vibe Coding Ebook22 chapters, 200+ prompts
  • Video Tutorials@endofcoding on YouTube

Company

  • About
  • Contact
  • Privacy
  • Terms

Contact

  • hello@llmhire.com
  • Get in Touch

© 2026 LLMHire. All rights reserved.

VeriduxLabsBuilt by VeriduxLabs
Back to all jobs
F

Member of Technical Staff, Cluster Management

Fireworks AI
San Mateo, CAOnsite4 days ago
full-timesenior

About the Role

<div class="content-intro"><h2><strong>About Us:</strong></h2> <p data-start="107" data-end="729">At Fireworks, we’re building the future of generative AI infrastructure. Our platform delivers the highest-quality models with the fastest and most scalable inference in the industry. We’ve been independently benchmarked as the leader in LLM inference speed and are driving cutting-edge innovation through projects like our own function calling and multimodal models. Fireworks is a Series C company valued at $4 billion and backed by top investors including Benchmark, Sequoia, Lightspeed, Index, and Evantic. We’re an ambitious, collaborative team of builders, founded by veterans of Meta PyTorch and Google Vertex AI.</p></div><h2>The Role:</h2> <p>As a Member of Technical Staff, Cluster Management at Fireworks AI, you will play a critical role in making our world-scale virtual AI cloud reliable, performant, and efficient. You will apply your expertise in large-scale distributed systems, cloud infrastructure, and operational excellence. You will partner closely with world-class software engineers and AI experts to scale cutting-edge AI platforms to meet the fast-growing demands and ever-evolving application paradigms. This role is for someone passionate about operating highly robust, observable, and automated systems and enabling customer successes.</p> <h2>Key Responsibilities:</h2> <ul> <li><strong>Ensuring System Reliability:</strong> Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure.</li> <li><strong>Incident Management &amp; Response:</strong> Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability.</li> <li><strong>Observability &amp; Monitoring:</strong> Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance.</li> <li><strong>Automation &amp; Toil Reduction:</strong> Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management.</li> <li><strong>Capacity Planning &amp; Performance Tuning:</strong> Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization.</li> <li><strong>Reliability Best Practices:</strong> Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence.</li> <li><strong>On-call Rotation:</strong> Participate in a periodic on-call rotation to support our production environment and respond to critical alerts.</li> </ul> <h2>Minimum qualifications:</h2> <ul> <li>Bachelor's degree in Computer Science, related technical field, or equivalent practical experience.</li> <li>5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems.</li> <li>Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems.</li> <li>Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services.</li> <li>Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes).</li> <li>Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing.</li> <li>Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development.</li> <li>In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging.</li> <li>Proven ability to troubleshoot complex issues across the entire stack.</li> <li>Excellent communication, collaboration, and problem-solving skills.</li> <li>Willingness to participate in on-call rotations.</li> </ul> <h2>Preferred qualifications:</h2> <ul> <li>Experience of managing data center grade GPU clusters with GPU (and peripherals like HBM and RDMA enabled networking) monitoring, troubleshooting, and fixing.</li> <li>Experience with machine learning infrastructure, model serving, or distributed AI frameworks.</li> <li>Hands-on experience in security and data protection.</li> </ul> <p>&nbsp;</p><div class="content-conclusion"><h2><strong>Why Fireworks AI?</strong></h2> <ul> <li>Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure, from low-latency inference to scalable model serving.</li> <li>Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally.</li> <li>Ownership &amp; Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results.</li> <li>Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation.</li> </ul> <p><em>Fireworks AI is an equal-opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all innovators.</em></p></div>

Required Skills

PythonPyTorchKubernetesDockerAWSGCPAzureScala

About Fireworks AI

Fast and affordable AI inference platform for production workloads.

Visit Company Website

Ready to Apply?

Join Fireworks AI and work on cutting-edge AI technology