NVIDIA and Google Collaborate to Bring Gemma 4 AI Models to Local Devices

Introduction

Open-source AI models are driving a revolution in on-device intelligence, pushing the boundaries of innovation from the cloud to everyday hardware. As these models become more powerful, their true value lies in their ability to process local, real-time context—transforming insights into immediate action. Google's latest additions to the Gemma 4 family are purpose-built for this transition: a suite of compact, fast, and omni-capable models designed to run efficiently on a wide range of devices, from edge modules to high-performance workstations.

NVIDIA and Google Collaborate to Bring Gemma 4 AI Models to Local Devices — Source: blogs.nvidia.com

In a strategic partnership, Google and NVIDIA have optimized Gemma 4 for NVIDIA GPUs, ensuring seamless performance across diverse systems. This includes data center deployments, NVIDIA RTX-powered PCs and workstations, the personal DGX Spark AI supercomputer, and even Jetson Orin Nano edge AI modules. The collaboration marks a significant step in making advanced AI accessible locally, without relying on cloud connectivity.

Gemma 4: Compact Models with Big Capabilities

The new Gemma 4 family spans multiple configurations—E2B, E4B, 26B, and 31B variants—each tailored for specific use cases. All versions are designed for efficient deployment from edge devices to high-performance GPUs. Notably, performance benchmarks were measured using Q4_K_M quantizations with batch size 1, input sequence length 4096, and output sequence length 128 on NVIDIA GeForce RTX 5090 and Mac M3 Ultra desktops, leveraging llama.cpp b7789 and the llama-bench tool.

Core Capabilities

This generation of compact models supports a broad range of tasks, making them versatile for developers and enterprises alike:

Reasoning: Strong performance on complex problem-solving tasks, including multi-step logical deductions.
Coding: Code generation and debugging capabilities that streamline developer workflows.
Agents: Native support for structured tool use via function calling, enabling autonomous task execution.
Vision, Video, and Audio: Rich multimodal interactions for object recognition, automated speech recognition, and document or video intelligence.
Interleaved Multimodal Input: Users can mix text and images in any order within a single prompt, simplifying complex queries.
Multilingual: Out-of-the-box support for over 35 languages, with pretraining on more than 140 languages.

Optimized for NVIDIA Hardware

The E2B and E4B models are engineered for ultra-efficient, low-latency inference at the edge. They run completely offline with near-zero latency on a variety of devices, including Jetson Nano modules. In contrast, the 26B and 31B models are designed for high-performance reasoning and developer-centric workflows, making them ideal for agentic AI applications. Optimized to deliver state-of-the-art, accessible reasoning, these larger models run efficiently on NVIDIA RTX GPUs and the DGX Spark, powering development environments, coding assistants, and agent-driven workflows.

Local Agentic AI and Real-World Applications

As local agentic AI gains momentum, applications like OpenClaw are enabling always-on AI assistants on RTX PCs, workstations, and the DGX Spark. The latest Gemma 4 models are compatible with OpenClaw, allowing users to build capable local agents that draw context from personal files, applications, and workflows to automate tasks. This marks a shift toward truly decentralized AI, where sensitive data never leaves the device and responses are instantaneous.

For those eager to explore, Google and NVIDIA provide resources to get started: learn how to run OpenClaw for free on RTX GPUs and DGX Spark, or use the DGX Spark OpenClaw playbook. The collaboration between NVIDIA and Google ensures that Gemma 4 models are not just powerful, but also practical for real-world deployment—from edge computing to personal AI supercomputers. Check out the Google DeepMind announcement blog for further technical details.