Projects | Sudhakar Chundu

AI/ML & Healthcare

Enterprise AI/ML Platform

Built enterprise AI/ML platforms for healthcare, supply chain, airline, and government clients with 65-GPU multi-tenant infrastructure for LLM/GenAI workloads delivering $8M+ cost savings.

Trackonomy

Oct 2023 – Present

San Jose, CA (Remote)

Project Overview

As Distinguished Cloud AI Architect at Trackonomy, I architected and built enterprise-grade AI/ML platforms serving healthcare, supply chain, airline, and government clients. The platform enables organizations to deploy, train, and serve large language models (LLMs) and generative AI workloads at scale with enterprise security and compliance.

I built the infrastructure team from 0 to 6 engineers and managed $6M+ Azure/AWS budgets while achieving 73% cloud cost reduction, 75% faster deployments, and maintaining 99.97% uptime with zero critical security findings across SOC2, HIPAA, and FedRAMP compliance frameworks.

Azure Platform Architecture

Multi-tenant Azure Platform - AKS, CPU Infrastructure, and Enterprise Security

Key Achievements

$8M+
Cost Savings Delivered

73%
Cloud Cost Reduction

65 GPU
Multi-Tenant Platform

99.97%
Platform Uptime

75%
Faster Deployments

0→6
Team Built

Industry Verticals

Healthcare Supply Chain Airline Government

Key Responsibilities

GPU Infrastructure: Architected 65-GPU multi-tenant platform using Slurm, Kubernetes, and NVIDIA Triton for LLM/GenAI workloads with optimized resource scheduling and cost allocation.
ML Pipelines: Built end-to-end ML pipelines with Apache Airflow, Apache Flink, MLflow, and Kubeflow for model training, versioning, and deployment automation.
Observability: Implemented comprehensive observability across 50+ microservices using Prometheus, Grafana, Datadog, and custom dashboards for GPU utilization monitoring.
Multi-Cloud Strategy: Designed and deployed infrastructure across Azure, AWS, and OCI with Terraform and GitOps for consistent, repeatable provisioning.
Application Migration: Led migration of 50+ applications to Kubernetes with zero-downtime deployment strategies and automated rollback capabilities.
Security & Compliance: Achieved SOC2, HIPAA, and FedRAMP compliance via Vanta and Snyk integration with zero critical security findings.
Cost Optimization: Implemented FinOps practices delivering 73% cloud cost reduction and $8M+ savings through right-sizing, reserved instances, and spot instance strategies.
Team Leadership: Built infrastructure team from 0 to 6 engineers, establishing best practices, documentation, and on-call rotations.

Technology Stack

Azure

AWS

OCI

Kubernetes

Terraform

ArgoCD

Helm

Docker

GitHub Actions

Prometheus Grafana

Grafana

Datadog

NVIDIA GPU Slurm Python

Python MLflow

Kubeflow

Snyk

Vault

Energy & Oil/Gas Sector

OSDU Data Platform

Architected and deployed the Open Subsurface Data Universe (OSDU) R3 platform on AWS and Azure for major energy enterprises, enabling petabyte-scale seismic and subsurface data management.

Wipro

Feb 2020 – Jun 2023

Remote (Global Clients)

Project Overview

The Open Subsurface Data Universe (OSDU) is an industry-standard data platform that enables oil and gas companies to manage, integrate, and leverage subsurface and wells data at scale. I led the cloud architecture and infrastructure implementation for OSDU R3 across multiple major energy clients, building GPU-accelerated data ingestion and ML training pipelines for petabyte-scale seismic data processing.

This project involved designing multi-tenant landing zones, implementing DevSecOps practices, and ensuring compliance with enterprise security standards across AWS and Azure environments.

Architecture Overview

Multi-tenant deployment with AKS, Event Hubs, and Cosmos DB

Key Achievements

5+
Global Energy Enterprises

PB-Scale
Seismic Data Processing

55%
Downtime Reduction (DR)

Multi-Cloud
AWS & Azure Deployment

Enterprise Clients

ExxonMobil Chevron BP Shell TotalEnergies

Key Responsibilities

Platform Architecture: Designed and implemented OSDU R3 data platform on AWS and Azure, enabling seamless integration of subsurface, wells, and seismic data for exploration and production workflows.
GPU Infrastructure: Architected GPU-accelerated EKS/AKS clusters for large-scale seismic data processing and ML model training, optimizing compute costs while meeting performance SLAs.
Multi-Tenant Landing Zones: Provisioned secure AWS and Azure Landing Zones for 5+ global energy enterprises with isolated environments, network segmentation, and compliance controls.
CI/CD Automation: Built end-to-end deployment pipelines using GitHub Actions, Azure DevOps, Terraform, and Helm for consistent and repeatable infrastructure provisioning.
DevSecOps Implementation: Integrated security scanning with Snyk, SonarQube, and JFrog Xray into CI/CD pipelines. Managed secrets using AWS Secrets Manager and Azure Key Vault.
Disaster Recovery: Designed and implemented DR strategies reducing potential downtime by 55% with multi-region failover capabilities and automated recovery procedures.
Data Pipeline Optimization: Built high-throughput data ingestion pipelines handling petabyte-scale seismic data with Apache Kafka, Azure Event Hubs, and custom ETL workflows.

Technology Stack

AWS

Azure

EKS/AKS

Terraform

Helm

GitHub Actions

Azure DevOps Docker

Docker

Python

Snyk

Vault

Prometheus Grafana

Grafana PostgreSQL Apache Kafka NVIDIA

NVIDIA GPU