Resume
Distinguished Cloud AI Architect & Platform Engineering Leader | GPU/HPC Infrastructure | AI/ML Platforms
SUDHAKAR CHUNDU
Professional Summary
Distinguished Cloud AI Architect and Platform Engineering Leader with 18+ years of hands-on experience building and operating large-scale GPU compute platforms, AI/ML infrastructure, and distributed systems at planetary scale. Expert in Slurm cluster management, GPU scheduling (NVIDIA A100/V100/T4/DGX), Kubernetes orchestration, and multi-cloud architecture (AWS, Azure, OCI). Lead global teams of 13+ engineers across Infrastructure, Security, Networking, and DevOps, managing $15M+ budgets while delivering $8M+ in documented cost savings. Proven track record designing scalable HPC and AI/ML platforms serving 10M+ users with 99.97% uptime across 15+ regions. Active conference speaker at KubeCon, SREcon, and MLOps World.
Key Achievements
Core Technical Skills
GPU/HPC Computing
Slurm, Slinky, NVIDIA BCM/DGX, vcluster, CUDA, TensorRT, A100/V100/T4 scheduling, Triton Inference Server, vLLM
AI/ML Infrastructure
PyTorch, Ray, Flyte, MLflow, Kubeflow, SageMaker, Azure ML, TorchServe, KServe, distributed training
Cloud Platforms
AWS (EKS, EC2, SageMaker, EMR), Azure (AKS, APIM, ML), OCI
Big Data & Streaming
Databricks, Apache Spark, Flink, Kafka, Hadoop/HDFS, Hive, Trino, Airflow, Delta Lake, Kinesis
Container Orchestration
Kubernetes (EKS/AKS), Docker, Helm, Istio Service Mesh, KEDA, ArgoCD, FluxCD
Infrastructure as Code
Terraform (50+ modules), Ansible, ARM/Bicep, CloudFormation, Packer, GitOps, policy-as-code (OPA, Checkov)
Security & Compliance
SOC2, HIPAA, FedRAMP, ISO 27001, HITRUST, DevSecOps (Snyk, SonarQube), Vault, ZTNA
SRE & Observability
SLI/SLO/SLA, error budgets, chaos engineering, Prometheus, Grafana, Datadog, OpenTelemetry, AIOps
Professional Experience
Distinguished Cloud AI Architect / Director of Platform Engineering
Trackonomy Systems | San Jose, CA- GPU/HPC Infrastructure: Designed Slurm-based GPU compute platform — 65 GPUs across 8 nodes using Slurm, Slinky (Slurm-on-Kubernetes), and NVIDIA BCM; achieved 99.97% uptime serving 12+ enterprise clients for AI inference and training.
- Serverless AI: Architected serverless GPU infrastructure using AWS Lambda containers, reducing inference costs by 65% while serving 5M+ daily predictions.
- Cloud FinOps: Reduced cloud costs 73% ($10M→$2.7M/year) through GPU utilization optimization, fair-share scheduling, reserved instance planning, and vendor consolidation.
- Platform Engineering: Built CI/CD pipelines for 100+ applications using GitHub Actions, Docker, Terraform, and GitOps (ArgoCD, FluxCD); reduced deployment time 95% (days→15 minutes).
- Security & Compliance: Implemented DevSecOps with Trivy, Falco, OPA, Gitleaks, and HashiCorp Vault; achieved SOC2, HIPAA, FedRAMP, HITRUST compliance for GenAI/LLM platform.
- Leadership: Built team 0 → 6 engineers | Managed $15M+ budgets | Led incident response and on-call rotations.
Senior SRE / Cloud Architect — ML Infrastructure
Wipro Technologies — OSDU Data Platform | Seattle, WA- Platform: Architected OSDU R3 data platform processing exabytes of seismic data on GPU-accelerated Kubernetes clusters (EKS/Fargate) with Docker, Spark, Kafka, and Hadoop/HDFS for ExxonMobil, Chevron, BP, Shell.
- Infrastructure: Delivered GPU cluster management at scale with Kubernetes, Istio, and KEDA, ensuring 99.95% availability for ML inference workloads across multi-region deployments.
- IaC & GitOps: Created 50+ Terraform modules (VPC, EKS, RDS, DynamoDB, S3, SageMaker, IAM); implemented GitOps (ArgoCD/FluxCD) with Jenkins pipelines; reduced deployment time 80%.
- Security & Quality: Integrated Trivy, Kube-bench, and SonarQube into CI/CD; implemented OPA policies and Ansible automation for compliance enforcement.
- Observability: Built observability stack with Prometheus/Grafana/ELK and GPU metrics; created automation in Go/Python reducing manual operations 85%; led 24x7 on-call rotations.
Multi-Cloud Architect / Senior Infrastructure Engineer
Tata Consultancy Services | Multiple LocationsProgressive 13-year career across Fortune 500 clients in healthcare, government, telecom, and financial services.
Cloud Architect — Harvard Pilgrim Health Care
Jun 2018 – Feb 2020- Led cloud modernization for HIPAA/HITRUST-regulated AI/ML applications to AWS; implemented GPU-enabled EKS clusters with NFS-backed persistent storage.
- Implemented Jenkins/Ansible CI/CD pipelines; DevSecOps (HashiCorp Vault, SonarQube, OWASP); managed $6M infrastructure budget.
Cloud Senior Engineer — CNA Insurance
May 2015 – Jun 2018- Managed Kubernetes/Helm deployments on AWS; built reproducible K8s applications; automated provisioning with Ansible playbooks.
- Led cloud migrations to AWS, Azure, OpenStack; pioneered Docker/Kubernetes adoption (2013-2014); implemented CI/CD with Jenkins/Ansible.
Solutions Architect — PwC
Jan 2011 – Apr 2015- Led infrastructure deployments for 20+ facility buildouts including branch offices, call centers, and data centers; managed $15M+ annual budgets.
- Integrated Chef/Jenkins deployment pipelines; migrated VMware VMs to AWS (EC2, S3, ELB).
Middleware Engineer — Verizon, Owens Corning
May 2007 – Jan 2011- Deep Linux/Unix administration with WebSphere/WebLogic middleware; JVM optimization, clustering, and HA configurations.
- Managed large-scale production systems on bare-metal serving millions of users; led capacity planning, performance testing, and 24x7 L3 operations.
Education & Certifications
Bachelor of Engineering in Mechanical Engineering
Acharya Nagarjuna University