Multimodal input as a first-class feature
Gemini Omni Flash treats text, images, audio, and video as equal input modalities. Earlier video models were primarily text-driven with images bolted on; Omni Flash routes all four through the same architecture, so reference media actually steers the result instead of being a hint.

