Distinguished Cloud AI Architect & Platform Engineering Leader
Building large-scale GPU compute platforms, AI/ML infrastructure, and distributed systems at planetary scale. Expert in Slurm cluster management, GPU scheduling (NVIDIA A100/V100/DGX), Kubernetes orchestration, and multi-cloud architecture (AWS, Azure, GCP, OCI).
Sudhakar Chundu
San Jose, California
What Colleagues Say
Trusted by leaders across engineering, product, and executive teams at Trackonomy
Building and operating infrastructure at scale for global enterprises
Professional Experience
With 18+ years of hands-on experience building and operating large-scale GPU compute platforms, AI/ML infrastructure, and distributed systems at planetary scale. Leading global teams of 13+ engineers across Infrastructure, Security, Networking, and DevOps.
Currently Distinguished AI Architect at Trackonomy, managing $15M+ budgets while delivering $8M+ in documented cost savings. Built infrastructure team from 0 → 6 engineers serving Pharma, Airlines, Government, Manufacturing, Healthcare, and IoT sectors globally across 8 countries.
Expert in Slurm-based GPU compute platforms (65 GPUs, 99.97% uptime), serverless GPU infrastructure, Databricks/Spark/Kafka real-time pipelines, and comprehensive DevSecOps with SOC2, HIPAA, FedRAMP, and HITRUST compliance.
Professional Experience
Building and operating infrastructure at scale for global enterprises
Distinguished Cloud AI Architect / Director of Platform Engineering
Trackonomy Systems
- Designed Slurm-based GPU compute platform (65 GPUs, 8 nodes) with Slinky and NVIDIA BCM; achieved 99.97% uptime serving 12+ enterprise clients for AI inference and training
- Reduced cloud costs 73% ($10M→$2.7M/year) through GPU utilization optimization, fair-share scheduling, and vendor consolidation; led FinOps practice with showback/chargeback models
- Built team 0 → 6 engineers | Managed $15M+ budgets | SOC2, HIPAA, FedRAMP, HITRUST compliance | Secured GenAI/LLM platform against prompt injection and data exfiltration
Senior SRE / Cloud Architect — ML Infrastructure
Wipro Technologies (OSDU Data Platform)
- Architected OSDU R3 data platform processing exabytes of seismic data on GPU-accelerated Kubernetes (EKS/Fargate) with Spark, Kafka, Hadoop/HDFS for ExxonMobil, Chevron, BP, Shell
- Created 50+ Terraform modules; implemented GitOps (ArgoCD/FluxCD), KEDA autoscaling; reduced deployment time 80%
- Built observability stack with Prometheus/Grafana/ELK and GPU metrics; implemented HA architecture achieving 55% downtime reduction via capacity planning
Multi-Cloud Architect / Senior Infrastructure Engineer
Tata Consultancy Services
Progressive 13-year career across Fortune 500 clients in healthcare, government, telecom, and financial services.
- Led cloud modernization for HIPAA/HITRUST-regulated AI/ML applications to AWS with GPU-enabled EKS clusters
- Implemented Jenkins/Ansible CI/CD pipelines | DevSecOps with HashiCorp Vault, SonarQube | Managed $6M budget
- Managed Kubernetes/Helm deployments on AWS; pioneered Docker/Kubernetes adoption (2013-2014)
- Led cloud migrations to AWS, Azure, OpenStack | CI/CD with Jenkins/Ansible
- Led infrastructure deployments for 20+ facility buildouts including branch offices, call centers, and data centers
- Integrated Chef/Jenkins deployment pipelines | Migrated VMware VMs to AWS | Managed $15M+ budgets
- 5 years deep Linux/Unix administration with WebSphere/WebLogic middleware; kernel tuning, JVM optimization
- Managed large-scale production systems on bare-metal serving millions of users | 24x7 L3 operations
Tools & Technologies
Technologies I work with daily to build scalable, reliable cloud infrastructure and AI/ML platforms.
Areas of Expertise
Building and scaling enterprise infrastructure across bare-metal and multi-cloud platforms
Linux & Systems
Deep expertise in Linux systems administration, performance tuning, and kernel debugging at scale.
Containers & Kubernetes
Managing large-scale Kubernetes clusters with advanced orchestration and GitOps workflows.
Multi-Cloud Architecture
Designing infrastructure across AWS, Azure, GCP, and OCI with focus on cost optimization.
Observability & SRE
Implementing comprehensive monitoring, alerting, and SLO-driven reliability engineering.
Infrastructure as Code
Automating infrastructure provisioning with modern IaC tools and robust CI/CD pipelines.
Networking & Load Balancing
Expert in TCP/IP, DNS, service mesh, and multi-region load balancing architectures.
Let's Connect
Currently exploring Cloud AI Architect and platform engineering leadership roles. Open to both full-time positions and consulting engagements.