Unlocking Efficiency: Introducing ADMN, the Adaptive Multimodal Network Mastering Dynamic Resources and Input Noise
Multimodal deep learning systems are essential for robust sensing, leveraging redundant information across various sensory inputs to mitigate issues like sensor failure, noise corruption, and adverse environmental conditions. However, deploying these powerful systems in dynamic real-world environments—such as battlefield settings or edge devices—presents two major challenges: variable compute resource availability and fluctuating input quality-of-information (QoI).
Current state-of-the-art multimodal networks typically employ static resource provisioning, meaning every input modality is fully processed regardless of its quality. This fixed approach leads to valuable compute resources being needlessly wasted on low-QoI modalities, severely impacting systems with energy or latency constraints. Furthermore, statically provisioned models cannot adjust their resource usage to meet dynamic compute constraints, such as those caused by multi-tenancy or thermal throttling.
The Solution: ADMN
mDOT Center researchers have created ADMN, a layer-wise Adaptive Depth Multimodal Network, designed to jointly tackle the challenges of dynamic compute resources and variable QoI inputs. ADMN achieves high robustness and unprecedented efficiency by adapting the total number of active layers across all modalities to meet compute constraints, and then continually reallocating those layers among input modalities according to their quality.
How ADMN Achieves Dynamic Adaptation
ADMN’s effectiveness relies on two core mechanisms:
1. Dynamic Compute Adaptation: ADMN utilizes adaptive backbones containing adjustable layer configurations. By employing the same set of model weights, ADMN activates only a subset of backbone layers based on the available compute resources. This strategy allows the network to dynamically adjust its depth and resource usage, offering a highly effective solution to fluctuating compute resources without requiring the training and storage of multiple large models.
2. QoI-Aware Layer Allocation: Given a total layer budget established by the available resources, ADMN introduces a lightweight multimodal controller that learns the optimal layer allocation based on the relative QoI of each input sample. This controller assesses the information content of the sensor data—identifying, for instance, a depth modality that is severely corrupted—and directs the available computational budget toward higher-QoI modalities, such as a clean image input. The controller performs this critical adaptation on a per-sample basis.
To achieve this flexible resource management, the system adapts the LayerDrop technique—originally developed for text transformers—to multimodal visual networks. This procedure makes the backbones resilient to missing layers at test-time.
Proven Efficiency and Performance
Evaluations demonstrate that ADMN provides significant efficiency gains while maintaining performance excellence. Tested on diverse tasks, including multimodal localization and multimodal action recognition, ADMN consistently showcases its efficiency.
In benchmarks, ADMN was shown to match the accuracy of state-of-the-art networks while providing meaningful reductions in computational costs. Specifically, ADMN can reduce floating-point operations (FLOPS) by up to 75% and latency by up to 60%. For instance, in the Binary noise case for localization, ADMN achieved performance competitive with the Upper Bound (the maximum layer allocation) with only 6 layers, highlighting the value of input-aware resource allocation in compute-limited settings.
Crucially, the ADMN controller itself is extremely lightweight, consuming only a negligible amount of resources—constituting about 1% of the model’s total operations at the fewest layer allocation in localization tasks. The ability of ADMN to initialize a full-size model with pretrained weights and then dynamically trim its size by disabling layers also proves advantageous, particularly in tasks where training purely from scratch results in poor convergence.
ADMN provides a sophisticated and practical approach to deploying multimodal deep learning in dynamic scenarios, ensuring computational resources are spent only where they deliver the most value.
