Enterprise AI/ML Platform
Built enterprise AI/ML platforms for healthcare, supply chain, airline, and government clients with 65-GPU multi-tenant infrastructure for LLM/GenAI workloads delivering $8M+ cost savings.
Project Overview
As Distinguished Cloud AI Architect at Trackonomy, I architected and built enterprise-grade AI/ML platforms serving healthcare, supply chain, airline, and government clients. The platform enables organizations to deploy, train, and serve large language models (LLMs) and generative AI workloads at scale with enterprise security and compliance.
I built the infrastructure team from 0 to 6 engineers and managed $6M+ Azure/AWS budgets while achieving 73% cloud cost reduction, 75% faster deployments, and maintaining 99.97% uptime with zero critical security findings across SOC2, HIPAA, and FedRAMP compliance frameworks.
Azure Platform Architecture
Multi-tenant Azure Platform - AKS, CPU Infrastructure, and Enterprise Security
Key Achievements
Industry Verticals
Key Responsibilities
- GPU Infrastructure: Architected 65-GPU multi-tenant platform using Slurm, Kubernetes, and NVIDIA Triton for LLM/GenAI workloads with optimized resource scheduling and cost allocation.
- ML Pipelines: Built end-to-end ML pipelines with Apache Airflow, Apache Flink, MLflow, and Kubeflow for model training, versioning, and deployment automation.
- Observability: Implemented comprehensive observability across 50+ microservices using Prometheus, Grafana, Datadog, and custom dashboards for GPU utilization monitoring.
- Multi-Cloud Strategy: Designed and deployed infrastructure across Azure, AWS, and OCI with Terraform and GitOps for consistent, repeatable provisioning.
- Application Migration: Led migration of 50+ applications to Kubernetes with zero-downtime deployment strategies and automated rollback capabilities.
- Security & Compliance: Achieved SOC2, HIPAA, and FedRAMP compliance via Vanta and Snyk integration with zero critical security findings.
- Cost Optimization: Implemented FinOps practices delivering 73% cloud cost reduction and $8M+ savings through right-sizing, reserved instances, and spot instance strategies.
- Team Leadership: Built infrastructure team from 0 to 6 engineers, establishing best practices, documentation, and on-call rotations.