As the race for Artificial Intelligence Chatbots is taking revolutionary turns, Microsoft has recently launched its own AI model, KOSMOS-1, which is said to be one step ahead of ChatGPT’s text and message prompts, and can also respond to visual cues or images.
It describes itself as a multimodal large language model (MLLM) that can react to not only spoken but also visual inputs. This allows it to do a variety of tasks, such as image captioning and visual question answering, among others.
“A big convergence of language, multimodal perception, action, and world modelling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context and follow instructions,” said Microsoft’s AI researchers in a paper.
The study contends that in order to advance beyond ChatGPT-like skills to artificial general intelligence (AGI), multimodal perception, knowledge acquisition, and “grounding” in the actual world are required.
“More importantly, unlocking multimodal input greatly widens the applications of language models to more high-value areas, such as multimodal machine learning, document intelligence, and robotics,” the paper says.
According to experimental results, Kosmos-1 performs impressively in terms of language recognition, production, and even when fed directly with document images.
It also showed satisfactory results in perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and vision tasks, such as image recognition with descriptions (specifying classification via text instructions) which put across the goal to align perception with LLMs and perceive general modalities quite favourably.
The researchers also tested how Kosmos-1 performed in the zero-shot (follow instructions] Raven IQ test. Its accuracy showed the potential for MLLMs to “perceive abstract conceptual patterns in a nonverbal context” via the inculcation of perception in LLMs.
Microsoft plans to use Transformer-based language models to make Bing a better rival to Google search and also add to its Office Suite as well as Edge Browser.