Introducing LUMA

A VLM that sees and understands with unprecedented efficiency.

Today we're launching LUMA, our breakthrough vision-language model that achieves state-of-the-art performance across visual understanding benchmarks while maintaining exceptional efficiency at just 7 billion parameters.

LUMA represents a significant advancement in how AI systems process and understand visual information. Despite its compact size, LUMA outperforms much larger models including GPT-4o and Claude 3.5 Sonnet on critical VL benchmarks.

Often, performance gains of this kind are driven by distilling the understanding of a larger parent model into a smaller child model.

LUMA is different: in pretraining, we train a native dynamic-resolution Vision-Transformer with Window Attention, dynamically allocate tokens based off of input size, then modify MRoPE to align to dynamic absolute time. In post-training, we implement an adaptive RL system built specifically for improving visual reasoning capabilities.

The combination of these factors allows LUMA to attend more efficiently to cross-modal tokens. With a training budget of 4.7B tokens, LUMA's 7B parameters are able to effectively learn the complex representation dynamics inherent to visual data.

LUMA has an exceptional ability to analyze video. LUMA can reason through videos up to an hour in length, and can effectively think about live video from up to three channels at 15 FPS.

As the cornerstone of our perception stack, LUMA bridges the gap between visual understanding and language reasoning, enabling unprecedented accuracy in scene understanding and multimodal reasoning tasks. LUMA's efficient architecture makes it deployable across diverse environments while maintaining the robustness required for real-world applications.

We're especially interested in LUMA's applicability on the edge: we built and finetuned LUMA for use in robot perception, and we're particularly excited to see how its powerful video reasoning abilities translate in-deployment.

Action is Downstream of Perception

How much harder would it be to cook dinner if the only thing you had ever seen were pictures of tax forms? Could you walk if your visual processing had to be hosted remotely, then streamed back to you with varying rates of delay?

To take action in the world, you must first percieve and understand it. Building robust and efficient perception models is a prerequisite to developing systems which can interact with the physical world with the same confidence and competence as humans do.

LUMA's performance signals the potential for efficient, lightweight models to achieve superior understanding compared to their larger counterparts across diverse modalities, enabling the creation of a new generation of autonomous systems that can understand and react to the complex dynamics inherent to the real world.

Next Steps: Advancing Multimodal Intelligence

We're exploring how LUMA's architectural innovations can be extended to other domains, potentially inspiring innovations in how AI systems process and integrate information across multiple modalities. We're interested to see how we can make inferencing LUMA even more efficient, and if we can maintain our current inference speeds while adding in CoT reasoning.

/Key Capabilities

Superior Visual Reasoning

LUMA's innovative architectural mechanisms enable unprecedented understanding of visual scenes, from complex document layouts to intricate spatial relationships, achieving SOTA performance on visual question answering and scene understanding tasks.

Efficient Architecture

At just 7 billion parameters, LUMA delivers performance that surpasses models 40x its size. Our architectural innovations maximize capability per parameter, enabling deployment in resource-constrained environments without sacrificing accuracy.

Video Understanding

LUMA achieves state-of-the-art performance on video reasoning, and is able to dynamically adapt to changes in FPS, video dimension, lighting, and scene composition without reductions in performance.

Real-World Robustness

Trained on diverse, real-world data, LUMA maintains high performance across varied lighting conditions, image quality, and visual contexts, making it reliable for production deployments and practical applications.

/Performance Benchmarks

LUMA achieves state-of-the-art performance across key vision-language benchmarks, consistently outperforming larger models including GPT-4o and Claude 3.5 Sonnet.

Model	DocVQA	InfoVQA	MathVista	ORCBench v2	MMMU	VideoMME	LVBench	MMBench-Video
LUMA	96.7	83.3	69.6	58.1	61.2	72.6	46.3	1.81
GPT-4o	91.1	80.7	63.8	46.5	70.3	71.9	30.8	1.68
Claude 3.5 Sonnet	95.2	74.3	70.5	45.2	70.4	62.9	-	1.38

96.7%

DocVQA (SOTA)

+5.6 vs GPT-4o

83.3%

InfoVQA (SOTA)

+9.0 vs Claude 3.5

58.1%

ORCBench v2 (SOTA)

+12.9 vs Claude 3.5

87%

MMMU Efficiency

of GPT-4o at ~5% size

* All benchmarks evaluated using standard protocols. LUMA achieves SOTA performance on 7 out of 8 benchmarks and demonstrates remarkable parameter efficiency on MMMU, achieving 87% of GPT-4o's performance with ~5% of the parameters and no Chain-of-Thought reasoning.

/Technical Specifications

Architecture	Vision-Language Model (VLM)
Parameters	7 billion
Modalities	Vision, Text
SOTA Scores On	DocVQA InfoVQA MathVista ORCBench VideoMME LVBench MMBench-Video
Release Date	June 2025

Research Publication

For detailed technical insights into LUMA's architecture and benchmarking results, please refer to our post:

Efficient Perception on the Edge (Coming Soon)

Experience LUMA

Interested in integrating LUMA's advanced vision-language capabilities into your applications? Request access to our API or schedule a demonstration to see how LUMA can transform your visual understanding workflows.

Request LUMA Access