Table of Contents
Created to compete with the Chat GPT from X. AI, Elon Musk's company, the Grok is a chatbot that has always stood out for its sarcastic and politically incorrect sense of humor. Available to subscribers of the Premium+ plan of X (formerly Twitter), the Grok it is also updated in real time based on data from the platform, offering context on trending topics and popular posts, in addition to offering additional features, such as image generation, navigation via Bing and advanced data analysis.
Now X. AI, Elon Musk's artificial intelligence company, announced the Grok-1.5 Vision Preview, a new version of Elon Musk's AI which will expand its capabilities for analyzing images, spreadsheets and documents, allowing not only text processing, but also the interpretation and extraction of information from images.
Version news
Combining its word processing capabilities with the ability to analyze a wide variety of visual information such as documents, diagrams, graphs, screenshots and photographs, the Grok-1.5V promises to impress. This new version will soon be available to early testers and existing users of the Grok, however, in previous tests, the Grok-1.5V It has already demonstrated to be highly competitive with multimodal models in several domains.
However, what is most impressive are the capabilities of the Grok-1.5V in understanding the physical world, including interpreting images from screenshots and photographs. This ability opens up new possibilities in terms of interaction between humans and machines, as well as applications in areas such as computer vision and virtual assistance.
A X. AI demonstrated the impressive capabilities of the new version of Grok in interpreting images, as exemplified by his ability to write code from a specific diagram. As we see below, the diagram describes a guessing game based on a logical flowchart and user interactions. When asked if he could translate the diagram into code Python, Grok-1.5V responded accurately, providing a code that represents the logic of the game described in the flowchart.
In the following example, the Grok-1.5V demonstrated its ability to calculate calories from nutritional information provided in an image. The image showed a close-up of the nutrition label on a food package, listing various nutritional details, such as serving size and the number of calories per serving. When asked how many calories would be in 5 slices of the product, the Grok responded accurately, explaining that if a serving is 3 slices and contains 60 calories, then 5 slices would be approximately 100 calories.
Regarding another demonstration (photo below), the Grok used his ability to create a bedtime story from a drawing made by a child. The drawing showed a boy next to a boat. When asked if he could tell a story based on the drawing, the Grok responded with an engaging narrative about a brave boy named Timmy. This ability of Grok-1.5V transforming a simple drawing into a captivating story demonstrates his ability to interpret and create narratives.
The ability to interpret and create narratives is repeated in the following example, with the Grok explaining a meme that satirizes the differences between startups and large companies. In the image, there are two panels: on the left, titled “Startups”, a group of construction workers are actively digging a hole; on the right, titled “Big Business,” a group of people watch a single man dig. The explanation of Grok highlights the contrast between the intense collaboration and efficiency of startups, compared to the possible bureaucracy and lack of agility of large companies.
In the following image, the Grok-1.5V was able to convert the table to CSV format using their natural language processing skills and interpreting visual information. When analyzing the table showing Morocco's Olympic medal winners at the 2016 Summer Paralympic Games, the Grok identified the relevant columns, such as “medal”, “name”, “sport”, “event” and “date”. Then, he organized this information into comma-separated lines, as per the CSV format standard. This ability of Grok demonstrates your ability to extract and reorganize data in a precise way, useful for converting tabular information into more easily manipulated formats.
A X. AI is already planning significant improvements to its multimodal capabilities in the coming months. Focusing on different modalities, such as images, audio and video, the objective is to continue advancing towards a beneficial artificial general intelligence (AGI), capable of understanding and interacting with the universe in an increasingly sophisticated way.
Understanding the real world
O Grok-1.5V is also preparing to acquire a “spatial understanding of the real world”, allowing a better interpretation of the physical world represented in the images uploaded by its users. This improvement is crucial to developing more useful AI assistants for the real world. To achieve this objective, a new benchmark is being introduced, the RealWorldQA, designed specifically to evaluate the spatial understanding capabilities of multimodal models such as Grok-1.5V.
While many of the examples in the benchmark may appear simple to humans, they pose a significant challenge to current AI models, highlighting the need for advancements in this area to improve AIs' ability to understand and interact with the physical world in a more comprehensive way. effective.
In the image above, for example, artificial intelligence was able to analyze and answer the question “Which object is bigger: the pizza cutter or the scissors?”. This ability to compare sizes requires a spatial understanding of the physical world. The AI was able to identify the objects in the image, recognizing their relative shapes and sizes. Based on its analysis, the AI determined that the pizza cutter is bigger than the scissors. This ability demonstrates how AI can be trained to understand and answer questions about physical objects in images, which is critical to its development as a useful assistant in the real world.
In this other example (image above), the Grok-1.5V determined the cardinal direction in which the dinosaur is facing. The image does not provide clear visual references, such as a compass or landmarks in the environment around the dinosaur, but the Grok answered the question correctly, indicating that the dinosaur is facing East.
Comparison with other AIs
O Grok-1.5 Vision Preview demonstrated exceptional performance compared to other artificial intelligences in a new benchmark called RealWorldQA, which assesses spatial understanding of the real world. This benchmark was performed in a zero-shot configuration, without the need for a specific chain of thought request.
When analyzing different sets of data, the Grok-1.5V outperformed its peers in several key areas. In the benchmark Multi-discipline (MMMU), which involves a variety of disciplines, the Grok-1.5V achieved a score of 53.6%, slightly outperforming other AIs such as GPT-4V and Claude 3 Sonnet.
No Mathvista, which focuses on mathematical questions, the Grok-1.5V achieved a score of 52.8%, once again outperforming its competitors. In AI2D, which assesses understanding of diagrams, the Grok-1.5V achieved an impressive score of 88.3%, significantly outperforming other AIs like GPT-4V and Gemini Pro 1.5.
Em DocVQA, which involves understanding documents, Grok-1.5V performed with a score of 85.6%, falling behind GPT-4V, Claude 3 Sonnet e Claude 3 Opus. In the RealWorldQA benchmark, which assesses understanding of the real world, the Grok-1.5V obtained a score of 68.7%, once again demonstrating its superiority in relation to the other AIs evaluated.
These results highlight the ability of Grok-1.5 Vision Preview of understanding a variety of complex and contextually relevant tasks, which makes it a promising choice for a wide range of real-world AI applications. However, it is important to highlight that although the Grok-1.5V demonstrated impressive performance compared to other artificial intelligences in the RealWorldQA benchmark, the results of these benchmarks are not necessarily 100% reliable.
They are indicative of the relative performance of different AIs in different data sets and scenarios, but should not be considered a definitive measure of an AI's overall capability. The accurate interpretation of results depends on a number of factors, including the nature of the data sets, the evaluation methodology, and the complexity of the tasks at hand.
See the video
See also:
Sources: Grok, Interesting Engineering e Mashable
reviewed by Glaucon Vital in 15 / 4 / 24.
Discover more about Showmetech
Sign up to receive our latest news via email.