SUDHAKAR CHUNDU

Professional Summary

Distinguished Cloud AI Architect and Platform Engineering Leader with 18+ years of hands-on experience building and operating large-scale GPU compute platforms, AI/ML infrastructure, and distributed systems at planetary scale. Expert in Slurm cluster management, GPU scheduling (NVIDIA A100/V100/T4/DGX), Kubernetes orchestration, and multi-cloud architecture (AWS, Azure, OCI). Lead global teams of 13+ engineers across Infrastructure, Security, Networking, and DevOps, managing $15M+ budgets while delivering $8M+ in documented cost savings. Proven track record designing scalable HPC and AI/ML platforms serving 10M+ users with 99.97% uptime across 15+ regions. Active conference speaker at KubeCon, SREcon, and MLOps World.

Key Achievements

$8M+
Annual Savings | 73% Cost Reduction
10M+
Users Served | 15+ Regions
99.97%
GPU Cluster Uptime

Core Technical Skills

GPU/HPC Computing

Slurm, Slinky, NVIDIA BCM/DGX, vcluster, CUDA, TensorRT, A100/V100/T4 scheduling, Triton Inference Server, vLLM

AI/ML Infrastructure

PyTorch, Ray, Flyte, MLflow, Kubeflow, SageMaker, Azure ML, TorchServe, KServe, distributed training

Cloud Platforms

AWS (EKS, EC2, SageMaker, EMR), Azure (AKS, APIM, ML), OCI

Big Data & Streaming

Databricks, Apache Spark, Flink, Kafka, Hadoop/HDFS, Hive, Trino, Airflow, Delta Lake, Kinesis

Container Orchestration

Kubernetes (EKS/AKS), Docker, Helm, Istio Service Mesh, KEDA, ArgoCD, FluxCD

Infrastructure as Code

Terraform (50+ modules), Ansible, ARM/Bicep, CloudFormation, Packer, GitOps, policy-as-code (OPA, Checkov)

Security & Compliance

SOC2, HIPAA, FedRAMP, ISO 27001, HITRUST, DevSecOps (Snyk, SonarQube), Vault, ZTNA

SRE & Observability

SLI/SLO/SLA, error budgets, chaos engineering, Prometheus, Grafana, Datadog, OpenTelemetry, AIOps

Professional Experience

Distinguished Cloud AI Architect / Director of Platform Engineering

Trackonomy Systems | San Jose, CA
Oct 2023 – Present
  • GPU/HPC Infrastructure: Designed Slurm-based GPU compute platform — 65 GPUs across 8 nodes using Slurm, Slinky (Slurm-on-Kubernetes), and NVIDIA BCM; achieved 99.97% uptime serving 12+ enterprise clients for AI inference and training.
  • Serverless AI: Architected serverless GPU infrastructure using AWS Lambda containers, reducing inference costs by 65% while serving 5M+ daily predictions.
  • Cloud FinOps: Reduced cloud costs 73% ($10M→$2.7M/year) through GPU utilization optimization, fair-share scheduling, reserved instance planning, and vendor consolidation.
  • Platform Engineering: Built CI/CD pipelines for 100+ applications using GitHub Actions, Docker, Terraform, and GitOps (ArgoCD, FluxCD); reduced deployment time 95% (days→15 minutes).
  • Security & Compliance: Implemented DevSecOps with Trivy, Falco, OPA, Gitleaks, and HashiCorp Vault; achieved SOC2, HIPAA, FedRAMP, HITRUST compliance for GenAI/LLM platform.
  • Leadership: Built team 0 → 6 engineers | Managed $15M+ budgets | Led incident response and on-call rotations.

Senior SRE / Cloud Architect — ML Infrastructure

Wipro Technologies — OSDU Data Platform | Seattle, WA
Feb 2020 – Oct 2023
  • Platform: Architected OSDU R3 data platform processing exabytes of seismic data on GPU-accelerated Kubernetes clusters (EKS/Fargate) with Docker, Spark, Kafka, and Hadoop/HDFS for ExxonMobil, Chevron, BP, Shell.
  • Infrastructure: Delivered GPU cluster management at scale with Kubernetes, Istio, and KEDA, ensuring 99.95% availability for ML inference workloads across multi-region deployments.
  • IaC & GitOps: Created 50+ Terraform modules (VPC, EKS, RDS, DynamoDB, S3, SageMaker, IAM); implemented GitOps (ArgoCD/FluxCD) with Jenkins pipelines; reduced deployment time 80%.
  • Security & Quality: Integrated Trivy, Kube-bench, and SonarQube into CI/CD; implemented OPA policies and Ansible automation for compliance enforcement.
  • Observability: Built observability stack with Prometheus/Grafana/ELK and GPU metrics; created automation in Go/Python reducing manual operations 85%; led 24x7 on-call rotations.

Multi-Cloud Architect / Senior Infrastructure Engineer

Tata Consultancy Services | Multiple Locations
May 2007 – Feb 2020

Progressive 13-year career across Fortune 500 clients in healthcare, government, telecom, and financial services.

Cloud Architect — Harvard Pilgrim Health Care

Jun 2018 – Feb 2020
  • Led cloud modernization for HIPAA/HITRUST-regulated AI/ML applications to AWS; implemented GPU-enabled EKS clusters with NFS-backed persistent storage.
  • Implemented Jenkins/Ansible CI/CD pipelines; DevSecOps (HashiCorp Vault, SonarQube, OWASP); managed $6M infrastructure budget.

Cloud Senior Engineer — CNA Insurance

May 2015 – Jun 2018
  • Managed Kubernetes/Helm deployments on AWS; built reproducible K8s applications; automated provisioning with Ansible playbooks.
  • Led cloud migrations to AWS, Azure, OpenStack; pioneered Docker/Kubernetes adoption (2013-2014); implemented CI/CD with Jenkins/Ansible.

Solutions Architect — PwC

Jan 2011 – Apr 2015
  • Led infrastructure deployments for 20+ facility buildouts including branch offices, call centers, and data centers; managed $15M+ annual budgets.
  • Integrated Chef/Jenkins deployment pipelines; migrated VMware VMs to AWS (EC2, S3, ELB).

Middleware Engineer — Verizon, Owens Corning

May 2007 – Jan 2011
  • Deep Linux/Unix administration with WebSphere/WebLogic middleware; JVM optimization, clustering, and HA configurations.
  • Managed large-scale production systems on bare-metal serving millions of users; led capacity planning, performance testing, and 24x7 L3 operations.

Education & Certifications

Bachelor of Engineering in Mechanical Engineering

Acharya Nagarjuna University

AWS Solutions Architect Professional Azure Solutions Architect Expert AWS ML Specialty Azure AI Engineer NVIDIA DLI Infrastructure CKA (Kubernetes) HashiCorp Terraform Databricks Certified TOGAF 9 ITIL v4