We hypothesize the inability of today's MLLMs to effectively perceive basic geometric annotations and relationships stems from two factors:
- The lack of high-fidelity geometric visual perception training data.
- The problem of their model architectures and training strategy.
Overcoming the Lack of High-Fidelity Geometric Visual Perception Training Data
To provide sufficient high-fidelity training datasets, we develop a dataset generation engine to programmatically produce geometry shapes. Our geometry shape generation engine is built on AlphaGeometry. Given an input formal language describing a geometry shape, the geometry engine will first check the validity of the geometry shape. Then it will create numerical positions for all points following the restrictions given by the input. After the creation of all points, it will connect the line as specified in the input. To avoid inductive bias during training (e.g. point A is always on top of a triangle), letters are first picked from a letter pool (e.g. all 26 capital letters) and then randomly assigned to each point.
Next, we will empirically explore the model architecture and training strategy to improve the performance of MLLMs on Geoperception.
Lesson 1: Under the same training dataset, scaling LLM sizes does not lead to better performance.
We first vary the sizes of LLMs, Qwen-2.5~\citep{qwen2.5} in a range of 0.5B, 1.5B, and 3B while keep other components consistent. The result is shown below. We do not observe an obvious trend that larger LLMs can learn such low-level visual perception task faster or better. Moving forward, we will use Qwen-2.5-1.5B to continue our exploration.
Lesson 2: CNN architecture performs better than ViT.
We then study the choice of visual encoder architectures, including two families of architectures: Vision Transformer (ViT) and ConvNeXT; as well as two visual representation learning objectives: language-supervised learning and self-supervised learning. ConvNeXT-Large shows superior learning performance with the vision transformers which are 3-5 times larger.
Lesson 3: Tuning vision encoder does not provide significant help.
We next study the effect of tuning versus freezing the visual encoder. Below, we show the testing accuracy curves of tuning and freezing visual encoders. We find that compared with using a frozen encoder, tuning the visual encoder does not help the model learn low-level geometry relationships faster or better. In what follows, we will freeze the encoder for simplicity.
Lesson 4: Curriculum learning unleashes full potential.
We train the model sequentially from simple to more complex shapes and compare testing accuracy just on hard level tasks. During training, we monitor the model's performance and dynamically adjust the distribution of training data (i.e., the curriculum stage) based on this performance.