Loading...
Loading...
Google DeepMind’s Gemini is expanding beyond text to multimodal, with demonstrations like a context-aware mouse pointer and the new Gemini Omni video capabilities that identify objects and follow complex visual tasks. Users and creators celebrate hidden productivity features that promise to convert hours of work into seconds, yet real-world use appears limited to a fraction of the model’s potential. At the same time, incidents of harmful or biased outputs—such as a viral claim Gemini made inflammatory statements about Islam—underscore persistent safety, moderation, and trust challenges. The trend shows rapid technical progress paired with growing scrutiny over responsible deployment.
Gemini's multimodal advances change how interfaces and workflows are designed, creating new opportunities for automation and UX innovation while raising product safety and trust requirements for engineers and managers.
Dossier last updated: 2026-05-19 00:14:14
ByteDance researchers introduced Lance, a 3-billion-parameter unified multimodal model that performs image and video understanding, generation, and editing within one framework. Built with a staged multi-task training recipe and trained from scratch on a 128-A100-GPU budget (excluding pretrained ViT and VAE encoders), Lance claims efficient performance across text-to-image, text-to-video, editing, and visual question-answering tasks. The repo includes demos showing video generation, editing, multi-turn consistency, and visual QA examples, highlighting practical capabilities like object manipulation through screens and chart interpretation. The project emphasizes efficiency at modest scale and invites community contributions via issues and pull requests. This matters for multimodal AI practicality and cost-effective deployment of unified vision-language models.
Carl Franzen / VentureBeat : Google launches the Gemini Omni multimodal model, saying it can “create anything from any input”, starting with video generation, for Google AI subscribers — Although it was already discovered by intrepid AI power users weeks ahead of the official unveiling today at Google's annual …
Google’s Gemini Omni turns images, audio, and text into video — and that’s just the start
Google DeepMind : Google DeepMind details a Gemini-powered mouse pointer that understands what it is pointing at, allowing users to perform tasks without using text-heavy prompts — We are developing more seamless, intuitive ways to collaborate with AI — The mouse pointer has been a constant companion …
A widely shared Reddit screenshot claims Google’s Gemini AI stated that Islam promotes hatred. The post sparked controversy online, prompting debate about model bias, moderation, and factual accuracy. Key players include Google (developer of Gemini) and users on Reddit and social platforms amplifying the clip. If true, the incident matters because it highlights risks of large language models producing harmful or inflammatory claims about religions, potentially fueling misinformation, bias, and platform moderation challenges. It underscores the need for robust safety training, prompt engineering, and incident response from AI developers to prevent and correct harmful outputs and to maintain public trust.
@wanerfu: Gemini 功能强大,但少有人用! 大多数人只用它处理简单提示…而谷歌添加的工具能将数小时工作化为秒速完成。 你只用了 Gemini 不到 5% 的真实能力。以下是 10 个隐藏功能 👇 https://t.co/Bf5MD2SYU
谷歌全新Gemini Omni首曝,视频版「香蕉」来了,教授黑板推公式全对