Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

1Harbin Institute of Technology, Shenzhen   2Harbin Institute of Technology   3Southeast University
4Central South University   5National University of Singapore
6The Hong Kong University of Science and Technology, Guangzhou
shuoyang@hit.edu.cn,
Key finding overview figure
Selectively bypassing a single layer of a pretrained model, can lead to substantial performance improvements on certain tasks.

Abstract

Current Vision-Language Models (VLMs) have demonstrated remarkable capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. Surprisingly, we find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. To understand when and why this occurs, we systematically investigate how individual layers influence different tasks via layer intervention (e.g., parameter zeroing). Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks' performance. To further analyze this phenomenon, we introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. Crucially, these task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout) a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task as a means to validate and operationalize our observations. Without parameter updates, TaLo consistently improves performance across various models and datasets-even boosting Qwen-VL’s accuracy on the Maps task in ScienceQA by up to 16.6%, serving as a proof-of-concept that demonstrates the tangible impact of this phenomenon. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

Key Contributions

  • Task-Interfering Layers: Through systematic layer-wise interventions, we observe that bypassing certain layers can lead to improved task performance. We refer to these as Task-Interfering Layers, denoting pretrained components whose knowledge is inconsistent with the objectives of specific downstream tasks.
  • Task-Layer Interaction Vector: We establish a quantitative framework for analyzing the relationship between tasks and model layers by introducing the Task-Layer Interaction Vector, enabling further examination of how similar tasks exhibit consistent responses to layer interventions.
  • TaLo: We develop a practical, plug-and-play algorithm TaLo that leverages these insights to improve model performance at test time without any parameter updating. Using this method, LLaVA and Qwen-VL achieve peak improvements of up to 10.4% and 16.6%, respectively, across 10 tasks spanning 5 benchmarks.

Interactive Layer Bypass

Click a layer to bypass it; then pick a task to see performance vs. baseline.

Model (32 Layers)

Tip: click the same layer again to return to baseline.

Task Performance

MMStar
Baseline
Bypass
Δ vs. Baseline

Discovering Task-Interfering Layers

Through systematic layer-wise interventions such as parameter zeroing and uniform scaling, we observe that bypassing certain layers can improve task performance, revealing the existence of Task-Interfering Layers whose pretrained knowledge is inconsistent with the objectives of specific downstream tasks. By nullifying the self-attention mechanism of individual layers while preserving residual connections, we find that 54.1% of tasks in LLaVA-Next and 75.6% of tasks in Qwen-VL exhibit performance gains exceeding 5% upon intervention. This phenomenon provides direct empirical evidence that the primary source of task interference lies in the cross-modal reasoning and task execution processes handled by the LLM backbone rather than the visual encoder. Ultimately, these findings suggest that certain layers capture features that, while beneficial on average across diverse pretraining data, introduce noise or misalignment when applied to a particular local optimum, thus hindering specific task performance.

Characterizing Task-Interfering Layers

To systematically analyze task-specific sensitivities, we introduce the Task-Layer Interaction Vector, defined for a task \( \mathcal{T} \) and a model with \(L\) layers as \( v^{(\mathcal{T})} = (v_{1}^{(\mathcal{T})}, v_{2}^{(\mathcal{T})}, \ldots, v_{L}^{(\mathcal{T})}) \in \mathbb{R}^{L} \). Each dimension \( v_{i}^{(\mathcal{T})} \) represents a layer sensitivity score, formally calculated as \( v_{i}^{(\mathcal{T})} = Acc(\mathcal{M}_{intv}^{(i)}, \mathcal{T}) - Acc(\mathcal{M}_{base}, \mathcal{T}) \), where a positive value indicates the layer interferes with the task and a negative value indicates a positive contribution. To analyze task relationships, a distance metric \( d_{ij} = 1 - \rho_{ij} \) is established based on the Pearson correlation coefficient \( \rho_{ij} = Corr(v^{(i)}, v^{(j)}) \) between task vectors. Cluster analysis across nearly 100 tasks confirms that tasks drawing upon similar underlying cognitive skills form coherent clusters. Notably, this clustering is reliable, achieving an Adjusted Rand Index (\( \mathrm{ARI} \)) of \(0.72 \pm 0.24\) and a Silhouette score of \(0.61\) (dropping to \(0.32\) under task-label permutation), demonstrating that \( v^{(\mathcal{T})} \) effectively encodes task-specific patterns.

Method Overview

TaLo

Inspired by the discovery of Task-Interfering Layers, the researchers proposed TaLo (Task-Adaptive Layer Knockout), a minimalist, training-free, and test-time adaptation framework designed to dynamically identify and bypass the most interfering layer for a given task. TaLo operates in two distinct stages: dynamic layer selection for a specific task, and task-interfering layer knockout on the model. Using a probing set \( D_{probe} = \{(x_i, y_i)\}_{i=1}^N \) sampled from a downstream task, the method first establishes a baseline performance score \( \mathcal{B}=Acc(f_{\theta},D_{probe}) \) on the unmodified model. It then systematically measures the accuracy gain \( \Delta_l \) for each layer \( l \) by applying parameter zeroing (\( I(\theta_l)=0 \)) to create a modified model \( f_{\theta}^{(l)} \), where the gain is defined as:

$$\Delta_l = Acc(f_{\theta}^{(l)}, D_{probe}) - Acc(f_{\theta}, D_{probe}) $$

The optimal layer \( l^* \) is identified by selecting the one responsible for the maximal positive performance gain:

$$l^* = \text{argmax}_{l \in \{1, \dots, L\}} \Delta_l $$

Crucially, this adaptation is task-level and static; once the intervention is determined, it is applied to all subsequent instances of that task during inference, requiring no per-instance decisions or parameter updates. This training-free strategy allows for efficient, plug-and-play model customization that avoids the computational overhead of large-scale fine-tuning or complex prompt modifications.

Results

Conclusion

Through extensive empirical analysis, we reveal the existence of specific layers within large-scale pretrained Vision-Language models that actively suppress performance on certain downstream tasks. We term these Task-Interfering Layers, as strategically bypassing them yields significant performance improvements. Our further investigation uncovers a crucial pattern: tasks that demand similar functional abilities exhibit highly consistent response patterns to layer interventions. This suggests that the interference phenomenon is systematically organized around the model's functional capabilities, allowing the effects of Task-Interfering Layers to generalize across related tasks. Based on these findings, we introduce TaLo, a training-free adaptation method that identifies and bypasses these interfering layers at inference time. The performance of TaLo across a diverse range of models demonstrates that simple, targeted layer intervention can be a efficient strategy for model adaptation, obviating any parameter updates.

BibTeX

@misc{liu2026individuallayershelpempirical,
      title={Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models}, 
      author={Zhiming Liu and Yujie Wei and Lei Feng and Xiu Su and Xiaobo Xia and Weili Guan and Zeke Xie and Shuo Yang},
      year={2026},
      eprint={2602.01167},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.01167}, 
}