#069 – AI on CPUs with Earl Ruby

In episode 69, Earl Ruby discusses his career highlights and his current role at Broadcom. He explains the Private AI Foundation with Intel and how it enables customers to run AI and ML workloads. The discussion then focuses on choosing between CPUs and GPUs for ML workloads, debunking misconceptions about CUDA, and the future of software tools like OneAPI. Earl also provides insights into AMX and its support in vSphere for running ML workloads on CPUs. In this conversation, Earl Ruby III discusses various topics related to AMX and large language models. He explains the concept of quantization and how it is used to run models on AMX. He also discusses the challenges of sizing virtual machines for large language models and the power consumption differences between GPUs and CPUs. The conversation touches on heterogeneous clusters and workload placement, as well as the future of AMX and Intel GPUs. Finally, Earl mentions his blog articles where he shares his insights and experiences.

Takeaways

  • The Private AI Foundation with Intel enables customers to run AI and ML workloads using Intel’s AMX instruction set and GPUs.
  • When choosing between CPUs and GPUs for ML workloads, consider factors such as use case, model complexity, and performance requirements.
  • CUDA is not the only option for writing optimized AI workloads, as Intel’s oneAPI provides an open API for working with their hardware.
  • AMX is a set of instructions backed by hardware in Intel CPUs for matrix multiplication and other matrix operations, and it is supported in vSphere for running ML workloads on CPUs. Quantization is a technique used to convert high bit numbers into lower bit equivalents, allowing for smaller memory footprint and accelerated processing on AMX.
  • Sizing virtual machines for large language models can be challenging, and it is important to consider the memory footprint and CPU cores required.
  • Power consumption of GPUs is higher than CPUs, especially when GPUs are underutilized. CPUs can become power competitive when not fully utilized.
  • Heterogeneous clusters can be used to ensure specific workloads land on AMX-enabled CPUs, while Kubernetes provides automatic workload placement based on hardware capabilities.
  • The future of AMX and Intel GPUs involves extensibility and integration with other GPU technologies. OneAPI allows for seamless software compatibility with new hardware.
  • AVX-512 can be used to accelerate ML workloads on older machines without AMX, but the performance boost is not as significant as with AMX or GPUs.
  • Earl Ruby shares his insights and experiences through his blog articles, where he provides solutions to unique challenges and saves others from similar frustrations.

Some links to topics discussed:

Leave a Reply

Your email address will not be published. Required fields are marked *