About the Role

<div class="content-intro"><h3><strong><span style="font-family: arial, helvetica, sans-serif;">About xAI</span></strong></h3> <p><span style="font-family: arial, helvetica, sans-serif;">xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. </span><span style="font-family: arial, helvetica, sans-serif;">Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. </span><span style="font-family: arial, helvetica, sans-serif;">We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. </span><span style="font-family: arial, helvetica, sans-serif;">All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.</span></p></div><h3>ABOUT THE ROLE:</h3> <p>The ML and Data Infrastructure team is responsible for building the foundational infrastructure that powers frontier AI models and truth-seeking agents—from petabyte-scale data acquisition and multimodal crawling, to web-scale search/retrieval systems, reliable high-throughput inference serving, low-level GPU/kernel optimizations, compiler/runtime innovations, and high-speed interconnect fabrics for massive clusters. In this role, you will collaborate across pre-training, multimodal, reasoning, and product teams in a fast-paced, meritocratic environment where you will tackle ambiguous, high-stakes problems with first-principles thinking and rigorous execution.</p> <h3>RESPONSIBILITIES:</h3> <ul> <li>Design, build, and operate petabyte-to-exabyte scale distributed systems for data acquisition, web crawling, preprocessing, filtering/classification, and multimodal pipelines (CPU/GPU workloads).</li> <li>Architect high-performance search/retrieval engines (vector/hybrid/semantic) at trillion-document scale, integrating with LLMs/agents for truth-seeking, low-hallucination reasoning, and real-time knowledge access.</li> <li>Develop reliable inference serving infrastructure: load balancing, autoscaling, KV cache, batching, fault-tolerance, monitoring (Prometheus/Grafana), CI/CD (Buildkite/ArgoCD), and benchmarking for 100% uptime and optimal tail latency.</li> <li>Optimize low-level performance: CUDA kernels (GeMM, attention), Triton/CUTLASS extensions, quantization/distillation/speculative decoding, GPU memory hierarchy, and model-hardware co-design for next-gen architectures.</li> <li>Innovate on compilers/runtimes (JAX/XLA/MLIR, custom features for Hopper/Blackwell), distributed profiling/debugging tools, and interconnect fabrics (copper/optical, 1.6T+, SerDes/photonics, topology simulation, vendor roadmaps).</li> <li>Manage complex workloads across clouds/clusters: orchestration (Kubernetes), data bookkeeping/verifiability, high-speed interconnect validation, failure analysis, and telemetry/automation for production reliability.</li> </ul> <h3>BASIC QUALIFICATIONS:</h3> <ul> <li>Strong systems engineering skills with proven impact on large-scale distributed infrastructure (data processing, search, inference, or cluster networking).</li> <li>Proficiency in Python and at least one compiled language (Rust, C++, Go, Java); experience building bespoke libraries, optimizing performance, and debugging complex systems.</li> <li>Hands-on experience with at least one key area: petabyte-scale data pipelines/crawling (Spark/Ray/Kubernetes), web-scale search/retrieval (vector DBs, ranking, RAG), inference optimization (SGLang, kernels, batching), compiler features (JAX/XLA), or high-speed interconnects (optical/copper, SerDes, signal integrity).job</li> <li>Deep understanding of distributed systems challenges: high-throughput ops/sec, latency/throughput tradeoffs, fault-tolerance, monitoring, and scaling to production billions-of-users or 100k+ GPUs.</li> <li>Passion for AI infrastructure: keeping up with SOTA techniques, first-principles problem-solving, meticulous organization/bookkeeping, and delivering rigorous, high-quality results.</li> </ul> <h3>PREFERRED SKILLS AND EXPERIENCE:</h3> <ul> <li>Experience with multimodal data (images/video/audio), epistemics/truth-seeking in retrieval, or agentic systems (long-horizon reasoning, feedback loops).</li> <li>Low-level optimizations: CUDA kernel development (Tensor cores, attention), GPU profiling (Nsight), low-precision numerics, or interconnect pathfinding (LPO/LRO/CPO, photonics).</li> <li>Production expertise in inference reliability (0% error target), CI/CD for ML, or cluster networking (topology, vendor collaboration, failure root-cause).</li> <li>Track record owning end-to-end projects in hyperscale environments, with strong debugging, vendor management, or open-source contributions (e.g., SGLang).</li> </ul> <h3>COMPENSATION AND BENEFITS:</h3> <p>$180,000 - $440,000 USD</p> <p>Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.</p><div class="content-conclusion"><p><em>xAI is an equal opportunity employer. For details on data processing, view our </em><em><a href="https://x.ai/legal/recruitment-privacy-notice" target="_blank">Recruitment Privacy Notice</a>.</em></p></div>

About the Role

Member of Technical Staff - ML & Data Infrastructure

About the Role

Required Skills

About xAI

Ready to Apply?

Member of Technical Staff - ML & Data Infrastructure

About the Role

Required Skills

About xAI

Ready to Apply?