ai, open source, llm,

Google Launches Gemma 4: Agent-Ready Models from Mobile to Workstation

Cui Cui Follow Apr 02, 2026 · 4 mins read
Google Launches Gemma 4: Agent-Ready Models from Mobile to Workstation
Share this

Google just released Gemma 4, calling it “the most intelligent open model series to date.” These models target complex reasoning and agent workflows, shipping under the Apache 2.0 license.

Four Model Sizes

Gemma 4 comes in four variants:

  • E2B (Effective 2B) and E4B (Effective 4B): Optimized for mobile and IoT, activating ~2B and ~4B parameters during inference to cut memory and power use.
  • 26B MoE: A mixture-of-experts model that activates 3.8B parameters during reasoning tasks while keeping the full knowledge base intact.
  • 31B Dense: Provides advanced reasoning for IDE, coding assistants, and agent workflows, optimized for consumer GPUs.

According to DeepMind researchers Clement Farabet and Olivier Lacombe, the team extracted more “intelligence per parameter,” letting these models punch above their weight. The 31B Dense version currently ranks third among open models on industry benchmarks.

On-Device Agent Capabilities

The E2B and E4B models already work with Google Pixel, Qualcomm, and MediaTek. They run offline on smartphones, Raspberry Pi, and NVIDIA Jetson Nano with near-zero latency.

The 26B and 31B models bring higher-level reasoning to consumer workstations, turning them into local-first AI servers for students, researchers, and developers.

Technical Improvements

Built on the same architecture as Gemini 3, Gemma 4 adds:

  • Stronger reasoning: All models optimize for complex tasks and offer configurable “thinking” modes.
  • Expanded multimodal support: Text and image input across all models (variable aspect ratios, different resolutions). E2B and E4B add native video and audio input.
  • Larger context windows: 128K for on-device models, up to 256K for the 26B/31B versions.
  • Better coding and agent features: Improved code benchmark scores, built-in function calling to drive autonomous agents.
  • Native system prompts: Built-in system role support makes conversation structure clearer and model behavior easier to control.

Farabet and Lacombe explain that previous Gemma models needed extra design work to interact with external tools. Gemma 4 natively supports function calling, structured JSON output, system instructions, and over 140 languages—making it ready for autonomous agents, third-party tool integration, and multi-step task planning.

Benchmarks

According to Arena AI’s text leaderboard (as of Feb 1, 2026), the 31B model ranks 3rd globally among open models, with the 26B MoE at 6th.

Google claims Gemma 4 outperforms some models with 20× more parameters on certain benchmarks.

However, some users report that Qwen3.5-27B edges out Gemma 4 31B in their own tests.

Open Source + Local Deployment

Gemma 4 continues with the Apache 2.0 license, allowing commercial use, modification, and deployment. Google says this gives developers full control over data, infrastructure, and models, supporting secure deployment in local or cloud environments—removing the restrictions found in other AI models.

Google also detailed the GPU/TPU memory requirements for running inference with each Gemma 4 variant.

The E2B and E4B models use PLE (Per-Layer Embedding) to boost parameter efficiency during on-device deployment. While PLE doesn’t add model layers, it assigns independent small embeddings to each token in each decoder layer. This means the static weights loaded into memory often exceed what the “effective parameter count” would suggest.

The 26B version uses a mixture-of-experts (MoE) architecture. Although it activates only ~4B parameters per token, all 26B parameters must be loaded into memory for routing and inference speed. So its actual memory footprint is closer to a dense 26B model, not a 4B model.

Official memory estimates typically cover only static model weights, excluding overhead from frameworks, context windows, and KV cache. Fine-tuning will push memory requirements even higher than inference, depending on the framework, batch size, and whether you use full fine-tuning or parameter-efficient methods like LoRA.

Industry Perspective

This release highlights Google’s ambition to lead the “local AI” industry. Constellation Research analyst Holger Mueller notes that even the larger Gemma 4 models are small enough to run on a single GPU, making them ideal for edge scenarios and applications that demand low latency and digital sovereignty.

He adds: “Google is expanding its lead in AI—not just through Gemini, but also with open models like the Gemma 4 family. These models are crucial for building an AI developer ecosystem and will help the company enter different device form factors and vertical applications. Google set a high bar with Gemma 3, so this release carries a lot of expectations.”

Availability

Developers can access these models directly through Google Cloud, or get models and weights on Hugging Face, Kaggle, and Ollama. Android developers can try agent workflow prototypes in the AICore Developer Preview.

Google provides multiple inference and fine-tuning paths, including: Hugging Face, LiteRT-LM, vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM and NeMo, LM Studio, Unsloth, SGLang, Cactus, Docker, MaxText, Tunix, and Keras. Cloud deployment options include Vertex AI, Cloud Run, GKE, Sovereign Cloud, and TPU acceleration.

Gemma 4 supports NVIDIA (from Jetson Nano to Blackwell GPUs), AMD GPUs (via the open-source ROCm™ stack), and Google Cloud TPU out of the box.

Google says the new models use the same infrastructure security protocols as its proprietary models, meeting the high security and reliability standards of enterprises and sovereign institutions.

Reference

Join Newsletter
Get the latest news right in your inbox. We never spam!
Cui
Written by Cui Follow
Hi, I am Z, the coder for cuizhanming.com!

Click to load Disqus comments