8× — GiVA reduces the rank required to match competing methods, cutting training compute proportionally.

VeRA Training Overhead

2.5× — VeRA requires ~2.5× the training time of LoRA on Qwen 2 (0.5B); GiVA brings this close to 1.0×.

m + r per layer — GiVA trains only two scaling vectors per weight matrix — one of length m, one of length r.

4 domains — NLU, NLG (commonsense, math, code, instruction tuning), and image classification tested across 5+ model families.

7B parameters — OLMo 2 (7B) and Mistral (7B) were the largest models evaluated, with gains expected to grow at larger scales.

GiVA: 8× More Efficient AI Fine-Tuning

The 8× Efficiency Trick That Could Make AI Fine-Tuning Far Cheaper

Fine-tuning a 7-billion-parameter language model using yesterday's best practices takes roughly 2.5 times longer than it needs to. That might sound like a minor inconvenience — a few extra hours on a server farm somewhere. But multiply that overhead across thousands of researchers, across dozens of tasks, across a dozen model generations, and it adds up to an enormous amount of wasted compute, wasted energy, and wasted money. A paper from researchers at the University of Illinois Urbana-Champaign, Amazon, and Stanford now offers a surprisingly elegant fix: instead of guessing which directions in parameter space matter most, just ask the model.

The result is GiVA — Gradient-Informed Bases for Vector-Based Adaptation — a method that uses the very first gradient computed during fine-tuning to set up the adaptation process far more efficiently than prior approaches. In experiments spanning natural language understanding, mathematical reasoning, code generation, and image classification, GiVA reduces the required rank (a key measure of model complexity during fine-tuning) by a factor of eight compared to its closest competitors, while matching or exceeding their accuracy (Gangwar et al., 2026).

The Science

To understand why GiVA matters, it helps to understand the problem it's solving — a problem that has quietly become one of the central engineering challenges of the AI era.

Pre-trained language models like GPT-4 or OLMo are trained on vast quantities of text and develop broad, generalized knowledge. But to make them useful for specific tasks — answering medical questions, generating legal summaries, writing Python code — they typically need to be fine-tuned on a smaller, task-specific dataset. Full fine-tuning updates every parameter in the model. For a model with 7 billion parameters, that means storing and computing gradients for 7 billion numbers simultaneously. It's brutally expensive.

Parameter-efficient fine-tuning, or PEFT, is the field's answer to this problem. Rather than updating all parameters, PEFT methods introduce a small set of trainable additions — leaving the original model frozen — and achieve surprisingly good results with a fraction of the compute. The most widely adopted PEFT method is LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2022. LoRA's insight is that the changes a model needs to make during fine-tuning tend to live in a low-dimensional subspace. So instead of updating a full matrix W, LoRA approximates the update as the product of two much smaller matrices, dramatically cutting the number of parameters that need to be learned.

Vector-based adaptation methods take this idea even further. Methods like VeRA (Kopiczko et al., 2024) and OSoRA (Han et al., 2025) go one step beyond LoRA by freezing both low-rank matrices and training only small diagonal scaling vectors — essentially just a list of numbers that stretch or shrink the frozen matrices' influence. This makes them extraordinarily parameter-efficient: instead of training two full low-rank matrices, you're training just two vectors. The catch is that frozen random matrices aren't very informative about the downstream task, so these methods require much higher rank — essentially, they need wider matrices to compensate for the lack of task-relevant structure. VeRA often needs a rank of 1,024 to match LoRA's rank of 16. That 64× gap in rank translates directly into longer training times.

GiVA's core insight is that this trade-off is not inevitable. The reason prior vector-based methods need such high ranks is that their frozen matrices are essentially uninformed — random, or derived only from the pre-trained weights, with no knowledge of the downstream task. What if the frozen matrices were set up using actual information about what the task demands?

The answer turns out to involve the gradient — the mathematical signal that tells a neural network which direction to adjust its parameters in order to reduce its errors. Before GiVA begins fine-tuning, it computes one step of full fine-tuning on a single batch of training data. It doesn't apply that gradient update; it just looks at it. Specifically, it performs a Singular Value Decomposition (SVD) — a standard linear algebra technique that decomposes a matrix into its most important directions — on the gradient matrix. The resulting directions, called right singular vectors, become the rows of GiVA's frozen A matrix. The frozen B matrix is initialized as any orthonormal matrix (one whose columns are perpendicular and unit-length). From that point on, only the tiny scaling vectors Λ and Γ are trained.

The theoretical justification is clean. Gangwar et al. prove that this initialization minimizes the difference between GiVA's first training step and what a full fine-tuning update would have done — meaning GiVA starts out already pointed in the right direction. The math shows that setting A equal to the top-r right singular vectors of the gradient, and constraining B to have orthonormal columns, is the optimal choice for approximating the full fine-tuning update (Gangwar et al., 2026).

The team evaluated three variants of this scheme — TopSV-B (B initialized from the top singular vectors of the gradient), SecSV-B (from the next batch of singular vectors), and Rand-B (from a random orthonormal matrix) — finding that all three perform comparably, with Rand-B often edging out the others. The fact that even a random B works well when A is well-chosen suggests that most of the gain comes from the informed initialization of A.

What They Found

The experiments cover an impressive range of tasks and model sizes, from the 87-million-parameter DINOv2 vision model to the 7-billion-parameter OLMo 2 language model.

On the GLUE benchmark — a standard suite of natural language understanding tasks including sentiment analysis, textual similarity, and grammatical acceptability — GiVA using RoBERTa-large outperformed VeRA while using a rank of just 8, compared to VeRA's much higher rank requirements (Gangwar et al., 2026). Across GLUE tasks, all three GiVA initialization strategies produced comparable results, suggesting the method is robust.

GiVA vs. Competing Methods: Average GLUE Score (RoBERTa-Large)

Average GLUE validation performance across tasks for RoBERTa-Large. GiVA (Rand-B) at rank 8 achieves the highest average score among vector-based methods.

GiVA vs. Competing Methods: Average GLUE Score (RoBERTa-Large)
Label	Value
LoRA (r=8)	87
VeRA (r=1024)	86.5
RandLoRA (r=8)	87
GiVA TopSV-B (r=8)	87.2
GiVA SecSV-B (r=8)	87.2
GiVA Rand-B (r=8)	87.4

On commonsense reasoning tasks — where models are fine-tuned on 15,000 examples and then tested on seven benchmarks including BoolQ, HellaSwag, and WinoGrande — GiVA surpassed both VeRA and OSoRA across models ranging from Qwen 2 (0.5B) to OLMo 2 (7B), while using a rank of 64 compared to VeRA's rank of 1,024 (Gangwar et al., 2026). That's a 16× reduction in rank on that particular comparison.

Commonsense Reasoning Accuracy: OLMo 2 (7B)

Average accuracy across seven commonsense reasoning benchmarks for OLMo 2 (7B), comparing adaptation methods at their respective ranks.

Commonsense Reasoning Accuracy: OLMo 2 (7B)
Label	Value
LoRA (r=64)	78.4
VeRA (r=1024)	76.9
OSoRA (r=64)	77.8
RandLoRA (r=64)	78.3
GiVA Rand-B (r=64)	78.5

The training time results are where the implications become most concrete. Fine-tuning Qwen 2 (0.5B) on the commonsense reasoning task, VeRA requires approximately 2.5× the wall-clock time of LoRA, because its rank-1024 matrices are simply bigger to work with. GiVA, operating at rank 64, closes that gap substantially — approaching LoRA's training speed while preserving the extreme parameter efficiency that makes vector-based methods attractive in the first place (Gangwar et al., 2026). As model sizes increase toward 7 billion parameters, GiVA's advantage over competing vector-based methods becomes even more pronounced.

Training Time Relative to LoRA (Commonsense Reasoning, Qwen 2 0.5B)

Wall-clock training time for fine-tuning Qwen 2 (0.5B) on 15K commonsense reasoning examples, normalized to LoRA = 1.0×.

Training Time Relative to LoRA (Commonsense Reasoning, Qwen 2 0.5B)
Label	Value
LoRA	1
VeRA	2.5
OSoRA	2.1
RandLoRA	2.3
GiVA (Rand-B)	1.1

On mathematical reasoning (fine-tuning OLMo 2 7B on MetaMathQA and testing on GSM8k), and code generation (fine-tuning on Code-Feedback and testing on HumanEval), GiVA again matched or exceeded VeRA and OSoRA. On the instruction-following benchmark MT-Bench — where a fine-tuned Mistral 7B's responses are scored by GPT-4 — GiVA using its simplest Rand-B initialization matched both LoRA and VeRA overall, and actually outperformed both on first-turn questions (Gangwar et al., 2026).

Image classification results on four datasets (CIFAR100, Food101, Flowers102, and RESISC45) using both DINOv2 and CLIP vision models told the same story: GiVA at rank 32 is competitive with all methods across the board.

Why This Changes Things

The significance of GiVA isn't just about a benchmark number going up by a point or two. It's about what becomes possible when fine-tuning gets cheaper and more parameter-efficient simultaneously.

Consider the context. The number of trainable parameters in GiVA is just m + r per weight matrix — where m is the matrix's row count and r is the rank. At rank 8 or 32, this is extraordinarily small. A fine-tuned model checkpoint, in the GiVA framework, is essentially just two tiny vectors per layer. That makes it trivially easy to store thousands of fine-tuned task-specific adapters, switch between them at runtime, or distribute them over a network.

That last point matters enormously for federated learning — the approach where models are trained collaboratively across many devices (hospitals, phones, edge servers) without sharing raw data. In federated learning, every update that needs to be communicated over a network is a cost, and often a privacy risk. Vector-based adaptation methods, by reducing the payload to just scaling vectors, are already attractive for this use case. GiVA makes them faster too, which could be the difference between a federated fine-tuning pipeline that's practical and one that isn't.

The mixture of experts architecture — a design where a single model routes different inputs to different specialized sub-networks — is another natural home for GiVA. Each "expert" could be a separately fine-tuned adapter using GiVA's tiny parameter footprint, enabling a huge number of specialized capabilities to coexist in a single deployable system without ballooning storage requirements.

There's also a broader economic and environmental argument. AI compute is expensive. Training runs for large models have been estimated to cost millions of dollars and emit significant quantities of CO₂. Fine-tuning, while far cheaper than pre-training, happens constantly — every time a company deploys a specialized model, every time a researcher runs an ablation. An 8× reduction in rank means roughly 8× fewer computations in the matrix multiplications that dominate training time. Across the scale at which fine-tuning now happens globally, that's a meaningful efficiency gain.

It's worth noting what GiVA doesn't claim. It is not always the best-performing method. On code generation tasks in particular, LoRA outperforms all vector-based methods, including GiVA, by a considerable margin (Gangwar et al., 2026). The paper is honest about this: when raw performance is the only goal and parameter count is not a constraint, LoRA (or full fine-tuning) may still be the right tool. GiVA shines specifically in the regime where you care about both performance and efficiency — which describes the vast majority of real-world deployment scenarios.

The overhead GiVA does introduce is worth acknowledging. Computing the first-step gradient requires one forward-and-backward pass before training begins. Storing the frozen A and B matrices (once per task) requires more disk space than storing just vectors. And performing SVD on the gradient adds a small upfront computation. The authors address this by using a single batch for the gradient computation and a low-rank SVD algorithm, keeping the overhead manageable. Subsequent checkpoints are as lightweight as those of any other vector-based method (Gangwar et al., 2026).

What's Next

GiVA opens several threads worth pulling on. The most immediate is scale: the experiments here top out at 7 billion parameters. The gap between LoRA and high-rank vector-based methods grows as models get larger, which means GiVA's advantages should become even more dramatic at 70 billion or 400 billion parameters — the scales at which frontier models now operate. Testing this is a natural next step.

The three initialization strategies for B — TopSV-B, SecSV-B, and Rand-B — all perform similarly across most benchmarks, which raises an interesting theoretical question: why? If the choice of B barely matters when A is well-chosen, what does that tell us about the geometry of fine-tuning loss landscapes? The paper doesn't fully resolve this, and understanding it could lead to further simplifications.

There's also the question of whether GiVA's gradient-informed initialization strategy could be applied to other PEFT methods beyond vector-based adaptation. The authors draw an explicit connection to LoRA-GA (Wang et al., 2024) and LoRA-One (Zhang et al., 2025), which use similar gradient-based initialization for LoRA's trainable matrices. GiVA applies the same philosophy but freezes those matrices and trains only the scaling vectors instead. A unified framework that lets practitioners choose their efficiency-performance trade-off along this spectrum would be a valuable contribution.

Finally, the code generation results — where LoRA beats vector-based methods significantly — suggest that the compressed parameterization of GiVA and its cousins may struggle with tasks that require fundamentally richer weight updates. Understanding which tasks benefit most from gradient-informed bases, and which require more expressive adaptation, could help practitioners make better choices.

What Gangwar et al. have demonstrated is that the perceived trade-off between parameter efficiency and training speed in fine-tuning is not a law of nature. It was an artifact of uninformed initialization. By spending a single batch's worth of computation to look at the gradient before training begins, GiVA sets up its frozen matrices to already be pointing in the right direction — and everything downstream becomes cheaper. In a field where the dominant trend is relentlessly growing model sizes and relentlessly growing fine-tuning costs, that's a genuinely useful thing to know.

The 8× Efficiency Trick That Could Make AI Fine-Tuning Far Cheaper

The 8× Efficiency Trick That Could Make AI Fine-Tuning Far Cheaper

The Science

What They Found

Why This Changes Things

What's Next

Source articles

Comments (0)