介绍 AI Runtime：Databricks 上用于模型训练与微调的可扩展、无服务器 NVIDIA GPU

GPUs power today’s most advanced AI workloads—from forecasting and recommendations to multimodal foundation models. However, teams struggle with procuring and managing GPU infrastructure, configuring distributed training environments, and debugging data loading bottlenecks. Deep learning researchers prefer to focus on the modeling, not troubleshooting infrastructure.

We’re excited to announce the Public Preview ofAI Runtime (AIR), a new training stack that enables _on-demand_ distributed GPU training on A10s and H100s. AI Runtime contains all the technology used for large scale training of LLMs such as MPT andDBRX. Even in Beta, several hundreds of customers, including Rivian, Factset, and YipitData have used AIR to train and ship deep learning models into production. Use cases span the gamut from computer vision models to recommendation systems to finetuned LLMs for agentic tasks. Our own Databricks AI Research team used AIR for reinforcement learning of models such as in our recentKARL paper.

WithAI Runtime, Databricks users now have:

* Serverless, on-demand NVIDIA GPUs:Simply configure your notebook in 2-3 clicks, and get fast attach to Serverless A10 and H100 GPUs to start training – no cluster needed. Only pay for the GPUs that you use, without worrying about idle time utilization. * Robust orchestration tools:Use the full power of Databricks’ orchestration suite with Lakeflow Jobs and DABs support for long-running GPU workloads * Optimized distributed training:AIR bundles distributed GPU performance enhancements, like RDMA and high-performance data loading * Centralized governance and observability: run, observe, and govern GPU workloads exactly where your data resides, with built in experiment management via MLflow, access management with Unity Catalog, and agent-assisted debugging

On-demand NVIDIA H100 and A10 GPUs in notebooks

!Image 1: AI Runtime

For interactive development and debugging, connect to on-demand A10s and H100s in Databricks Notebooks with just a few clicks. From there, leverage all the developer ergonomics that Databricks is known for, from environment management for common Python packages to agent-powered authoring and debugging withGenie Code. Easily mount data from the Lakehouse to train deep learning models, or even invoke a fleet of remote CPUs for Spark data processing workloads from your GPU-powered notebook to prepare your data.

!Image 2: Genie Code demo

Use Genie Code to help resolve performance bottlenecks, experiment with new architectures, or debug tricky bugs around model convergence or cryptic framework errors.

Lakeflow for production-ready workloads

AI Runtime is a production-grade platform for accelerated computing. Develop your deep learning code in interactive notebooks, and then use the full power of Lakeflow to submit and orchestrate jobs on GPU compute. Both notebooks and custom code repositories can be executed by Lakeflow for long-running or scheduled jobs. For production needs such as CI/CD (continuous integration and continuous deployment), AI Runtime is fully compatible with ourDeclarative Automation Bundles (DABs).

With our Lakeflow integration, customers can keep model training and fine-tuning tightly synchronized with upstream data pipelines and downstream production systems.

> “Databricks'AI Runtime greatly streamlined the process of training a custom Text To Formula (TTF) model. With no infrastructure setup or delays, it was easy to choose the right compute based on prompt size and output token generation. This allowed us to move quickly, maintain our Lakehouse workflows, and deliver a high-quality model with full governance, reducing time to setup, train and deploy our model from days to hours.”— Nikhil Sunderraj Principal Machine Learning Engineer, FactSet Research Systems, Inc.

!Image 3: Test job

Runtime optimized for distributed deep learning

Distributed training workloads can be painful to prepare, debug, and observe. From troubleshooting RDMA setups to tracking telemetry from multiple GPUs to proper software configuration, users can easily miss critical details that dramatically slow model training.

Instead, AI Runtime is optimized for the entire deep learning lifecycle—and is designed to save you time. Key dependencies like PyTorch and CUDA come pre-installed, along with optimized support for distributed training frameworks such as Ray, Hugging Face Transformers, Composer, and other libraries, so you can start training immediately without managing environments. Customers are also welcome to bring their own libraries, from Unsloth to TorchRec to custom training loops.

!Image 4: Integrated SDKs and observability tools simplify the management of distributed training workloads.

Integrated SDKs and observability tools simplify the management of distributed training workloads. MLFlow enables deep observability of GPU workloads, with automatic tracking of GPU utilization and training experiments. Whether you're fine-tuning foundation models or training forecasting and personalization models, the runtime is optimized to accelerate training workflows with minimal setup.

!Image 5: MLFlow enables deep observability of GPU workloads, with automatic tracking of GPU utilization and training experiments.

Today’s Public Preview of AI Runtime supports distributed training across 8x H100s in a single-node, with multi-node support currently in Private Preview.

> "Databricks' AI Runtime enables us to efficiently run LLM workloads (fine tuning and inference) without infrastructure overhead, directly in our lakehouse. This seamless integration simplifies our pipelines and provides efficient use of GPUs, enabling us to deliver high quality AI insights to our customers and focus on innovation, not on infrastructure."— Lucas Froguel, Senior AI Platform Engineer, YipitData

介绍 AI Runtime：Databricks 上用于模型训练与微调的可扩展、无服务器 NVIDIA GPU

On-demand NVIDIA H100 and A10 GPUs in notebooks

Lakeflow for production-ready workloads

Runtime optimized for distributed deep learning

🤖 問 AI