Innovating Cloud-Native with AI Edition #2

Welcome to the second edition of the Cloud-Native Innovators newsletter, Innovating Cloud-Native with AI.

Mélony Qin Published on July 22, 2024 0

Welcome to the second edition of the Cloud-Native Innovators newsletter, April Triumph. In our first issue, we introduced the concept of Cloud-Native Innovation. That edition outlined the essentials for thriving in the rapidly evolving cloud-native world. In this issue, we delve deeper into the latest trends and developments in the field, as we continue to refine our approach as Cloud-Native Innovators.

Table of Contents

Cloud-Native and AI Synergy

Generative AI is one of the most exciting and rapidly evolving areas of the tech industry today. OpenAI is at the forefront of this innovation. With a total of $11.3B in funding across 7 rounds, according to data from Crunchbase. OpenAI has made significant strides in advancing the field of AI, particularly through its flagship project, ChatGPT.

In 2018, at the 2-year mark after OpenAI started to use Kubernetes for deep learning model training. At a time, OpenAI pushed the cluster to scale to over 2,500 nodes on Azure on D15v2 and NC24 VMs. 3-years later, this quickly got quickly pushed over again to scaling to over 7500 nodes reported in Jan 2021. It’s not advisable to scale a single Kubernetes cluster to this size.

Based on CNCF’s official Kubernetes documentation, even the latest Kubernetes 1.27 supports only up to 5000 nodes (this limit hasn’t been changed and has been part of the release process for quite some time). Sidenote that the official recommendation was only 1000 nodes back in 2016. Which was changed to 5,000 from Kubernetes 1.6 in 2017 (Check out this AWS office hours Youtube video to learn the whole story). Notably, AI has always been revolutionizing Kubernetes. This is demonstrated by Alibaba’s need for a 10,000-node Kubernetes cluster for a major shopping festival in 2019.

Running AI is incredibly demanding, especially for open AI. It runs large machine-learning jobs spanning many Kubernetes nodes, which need full GPU power. It relies on GPUDirect for direct communication with the NIC or NVLink for cross-communication with the GPU to achieve this.

On March 21st, 2023, NVIDIA announced the availability of the NVIDIA H100 Tensor Core GPU, which is introduced as the world’s most powerful option ( and pricey ) for generative AI training and machine learning inference. But it is crucial for advancements like GPT4. By the way, you may get this if you have ChatGPT plus membership and OpenAI limits GPT-4 to 25 messages every 3 hours. This official white paper gives an overview of NVIDIA H100 Tensor Core GPU architecture.

This aligns with Microsoft Azure introduced the ND H100 v5 VM on March 13th, which is the most powerful and massively scalable AI virtual machine series in Azure by far. Amazon Web Services announced EC2 UltraClusters of Amazon EC2 P5 instances is coming soon. And Oracle Cloud Infrastructure (OCI) announced the new OCI Compute bare-metal GPU instances featuring H100 GPUs in limited availability. I am intrigued to see what these announcements will bring to the table. In particular, in terms of advancements in the cloud-native and AI synergy.

Although computational power is not the only factor, there also comes the challenges with networking, service reliability ( particularly from high-demanding incoming requests for API servers), observability (using tools like Prometheus and Grafana ), and of course, security also plays critical roles in all scenarios.

Over the past years, Cloud-native ecosystems like Kubernetes and serverless have significantly changed software design and deployment. As the importance of AI capabilities continues to grow, developers are increasingly infusing cloud-native apps with AI. To enable new use cases and leverage the power of AI to improve resource utilization and delivery efficiency. Businesses can build and deploy software that leverages powerful AI capabilities. While capitalizing on cloud-native design principles like scalability, resiliency, modularity, and agility. With the right approach, we can harness the potential of cloud-native and AI technologies to drive innovation and business success. I can’t wait to see what this will bring!

Thinking in Systems

The real world is complex, with many unpredictable changes and interconnected factors. Systems thinking can help us address complex problems and make better decisions over time by taking into account the full complexity of the systems we are working with.

At times, we become so consumed with our individual experiences that we get stuck in a bubble. It can be challenging to step outside and take a comprehensive view, but doing so is crucial to understanding how all the different factors fit together and influence one another. This is why I like a book called ‘thinking in system’ by Donella Meadows, which explores the concept of systems thinking and its applications in various fields.

Systems thinking is a discipline for seeing wholes. It is a framework for seeing interrelationships rather than things, for seeing ‘patterns of change’ rather than ‘static snapshots’.

A personal example is about improving our well-being during challenging times. One may discover that a high-stress job is impacting our ability to sleep and exercise regularly, which is also making it harder for us to stick to a healthy diet. By applying systems thinking to personal health, we can identify the interconnected factors impacting our health, such as stress, diet, and sleep. By taking a holistic approach, such as avoiding screen time before bedtime, doing more yoga, or following a sleeping-aid program such as mindfulness meditation, make time for our exercise routines. Those small, sustainable changes can help us improve our overall well-being.

You may expect an example of cloud-native innovation by applying thinking in systems in this post too. The most convincing example is cost optimization. Imagine a company looking into optimizing its cloud infrastructure for cost efficiency. For example, they may discover that some applications require more CPU and memory resources than others, leading to unnecessary costs for their cloud infrastructure overall.

To address this, they could implement containerization using Docker and Kubernetes for better resource utilization. For some applications, they may be better off leveraging auto-scaling capabilities for some applications to automatically adjust resource allocation based on application demand. To achieve this, they could leverage serverless computing, such as Azure Functions or AWS Lambda etc.

By applying systems thinking, all companies can find a holistic approach to identify interconnected factors and achieve better performance.

Related Resources

Kubernetes radically transforms how we build and deploy applications in the cloud. The updated edition of the ‘Kubernetes up and running’ ebook shows developers and ops personnel how to use Kubernetes and container technology to achieve new velocity, agility, reliability, and efficiency levels. It’s available to download for free from here, or purchase it from Amazon here if you wish to get a physical copy.

Thinking in system — Kubernetes Up & Running

During a challenging period in my life, I authored the Certified Kubernetes Administrator (CKA) Exam Guide to help individuals learn Kubernetes and obtain certification. The book aims to open doors to new career paths as Kubernetes administrators and add value to their organizations. I am so happy to receive forewords for the book from industry experts Brendan Burns, Alessandro Vozza, and Mark Whitby, who have been a great source of inspiration to me.

To give back to the community, I will be donating 100% of the loyalty from my books to support the critical work of Doctors Without Borders (Médecins Sans Frontières). From now to May 13th, 2023, the book is available at the best price on Amazon with a 20% discount code provided by Packt Publishing. You can purchase it in either paperback or Kindle format using the direct link here.

No alt text provided for this image — Certified Kubernetes Administrator (CKA) Exam Guide

Another book, Thinking in Systems by Donella Meadows. This book involves understanding how different components of a complex system interact with one another. And how changes in one component can have ripple effects throughout the entire system. You can purchase this book from here or listen to the Audible audiobook here.

What’s Next

In our next issue, we’ll explore the in-market distribution and ecosystems surrounding Kubernetes. And I will share my personal recap for this year’s KubeCon + CloudNativeCon Europe 2023. Stay tuned!

I’m currently working on a series of short-form videos on generative AI. Also frequently asked questions from its core related to Cloud-Native, Kubernetes, and Serverless technologies. Also a few long-form videos aimed at helping viewers get CKA certified. My YouTube channel, CloudMelon Vis, will post this video soon. Subscribe so you don’t miss any updates!

Join my private distribution list to access exclusive rewards, such as early access to content and the ability to request topics for future articles. You can also receive discount codes and coupons for my latest books or other products by simply clicking on this link and then confirming your subscription.

Follow our journey

You’ll become a part of our community of cloud-native innovators. And the latest updates and insights will directly sit in your inbox.

Thanks, community, for your continued support, and I will see you in May booster!

Best wishes to all,

M.
Originally published on Linkedin on April 16th, 2023, follow us there if you’re interested.