Microsoft CEO, Satya Nadella, highlighted Armada at Ignite 2024—watch how we're enabling Azure in disconnected environments. Watch NowClose

Deployed Galleon
All Resources

Empowering Real-Time AI at the Remote Edge


Empowering Real-Time AI at the Remote Edge

In our pursuit of advancing AI for remote edge environments, our deployment of various AI applications on Armada's mobile and modular high-performance data centers—the Galleons—underscores their exceptional suitability for real-time edge computing tasks. These applications cover a spectrum of functionalities, including generative models, multimodal AI, and multiple GPU configurations, catering to a wide range of industrial applications. They leverage real-time computer vision (CV), natural language processing (NLP), large language models (LLMs), text-to-speech (TTS), video-to-text (VTT), and more.

The deployment of low-latency video and language applications on the edge requires three key components: raw compute power, memory capacity, and sensor connectivity. Compute power involves performing complex calculations, which we measure in terms of floating-point operations per second (FLOPS) or multiply-accumulate operations (MACs). Memory capacity ensures sufficient memory to host models, while sensor connectivity enables the efficient transfer of high-bandwidth data from diverse sensors, such as cameras, LiDARs, microphones, sonar arrays, as well as devices like drones, uncrewed ground vehicles, and autonomous robots to compute nodes on a Galleon. High-performance computing power, often achieved through specialized chips like GPUs and FPGAs, is essential for processing these data streams in real-time during inference.

Our engineering teams in AI, platform, and hardware have meticulously identified the requirements for processing raw data streams—including video, audio, time-series sensor readings, and range scans—on high-performance edge devices in real-time. Their effort showcases why the data processing capacity of Armada's Galleon is ideally suited for a wide range of customer use cases involving vision, language, and other generative AI models. Here's a closer look at the specific requirements the Galleons meet:

Galleon Rack

Memory Capacity

The deep learning-based vision and language models not only need to process data in real-time but also must fit within the memory constraints of the edge computing platform. For instance, a language model like LLaMA-2 70-billion parameter, used in conversational AI assistants and Copilot-style applications, demands a minimum of 120GB of VRAM memory, at FP16 (floating point) or BF16 quantization (16 bits), just to load the model for inference. We've also optimized these models by pruning and quantizing them to run efficiently on A100s, which have 80GB of memory. Smaller models, with fewer parameters, are deployed on A30s, each equipped with 24GB of memory. For example, models like Mistral, Zephyr, and LLaMA-7B utilize the entire 24GB memory on an A30 GPU.

Once the model is loaded, high-speed data streaming into the GPU becomes essential to fully leverage its computational power. Vision models like BLIP (Bootstrapping Language-Image Pre-training) and Video-LLaMA, employed for real-time image and video frame processing in video-to-text applications, require a minimum of 16GB of memory for even the smallest variants. The smallest BLIP-2 model, with 2.7B parameters, requires 16GB of GPU memory, while a regular-sized BLIP-2 model with 6.7B parameters, requires 24GB at FP32 precision. In the case of Video-LLaMA, even the smallest model, with 7B parameters and FP16 precision, requires 24GB of GPU memory. A standard Video-LLaMA model with 13 billion parameters at FP16 precision consumes 40GB of GPU memory. The gist here is: when it comes to computer vision applications employing the latest vision transformers (ViTs), a minimum of 24GB of GPU memory is essential. As evident, these multimodal models tailored for vision processing demand high-performance computing coupled with sufficient memory capacity to host the models—both of which are provided by the Galleons.

Computational Power

The computational requirements for running models in video and language processing depend on several factors, including the number of parameters, layers, desired accuracy, bit or frame rate, latency, resolution, and optimization methods within the algorithm. For instance, backbone structures like ResNet or DenseNet, which are CNN-based and commonly used for various visual classification and recognition tasks, typically range from 1 GFLOP to 8 GFLOPs per image or video frame. Therefore, when performing classification at 30 frames per second (FPS), these models can consume anywhere from 30 GFLOPS to 240 GFLOPS per second. As another example, a ResNet model with 55 million parameters and 152 layers for classification at 80% accuracy requires 2.5 GFLOPs, while a 26 million-parameter DenseNet with 190 layers for the same task demands 1.2 GFLOPs. A commonly used object detector such as YOLOv3 requires 70 GFLOPS at 35 frames a second for 80% accuracy.

In contrast, transformer-based models demand significantly more computational power. For instance, a Vision Transformer with 42 million parameters and a feature map size of 2048×7×7 necessitates 2066 GFLOPs for a latency of 3.5 milliseconds, resulting in a rate of 600 TFLOPS (TFLOPs divided by latency). An NVIDIA A100 GPU provides between 312 and 624 TFLOPS for FP16 or BF16 precision, depending on data sparsity. The gist here is: running some of our transformer-based models for real-time vision applications, particularly those for VTT, requires the full capacity of at least one A100 or similar GPU card. The H100s and H200s, which are the latest generation GPUs from NVIDIA, support Transformer Engine for accelerating transformer models, offer sufficient memory for loading larger models (H100's 80GB and H200's 141GB), and memory bandwidth (2 to 4.8 TB/s) for large data streams. These attributes make them well-suited for running larger, more versatile, and accurate models for various real-time vision applications.

Sensor Connectivity

A single IP camera streams data at varying rates, typically ranging from 0.01 Mbit/sec to 1.2 Mbit/sec. This translates to approximately 5 MB to 500 MB of data per hour, depending on factors such as resolution, compression, and frame rate. In contrast, an autonomous car or drone equipped with multiple LiDAR, RADAR, cameras, ultrasonic sensors, and IMU/INS can generate an exceptionally high data throughput of up to 50 Gbit/sec. Much of this data requires immediate and local processing in real-time.

Therefore, regardless of the specific vision, audio, or language processing application under consideration, it is essential to establish a seamless stream of data from these sensors to local high-performance computing resources, which may include one or more GPUs, multi-core CPUs, FPGAs, and more. In cases where multiple GPUs and multi-core CPUs are employed to process data from one or more sensors within a single application, it becomes important to provide high-speed interconnectivity for efficient communication between various GPU and CPU processors. The compute units within the Galleon platform are designed to support and facilitate such demanding processing requirements. 

Cooling

Another equally important factor to consider is the electrical power and cooling requirements. These factors also contribute to the Galleon's suitability for video processing at the edge. For instance, when running a Convolutional Neural Network (CNN) like ResNet-152 for object classification on a FP32 GPU, it can consume as much as 250W in real-time. To put this in perspective, it's equivalent to the power consumption of five laptops or an 82-inch LED TV running continuously. While each Galleon rack is designed to consume a maximum of 18kW, running several AI workloads on multiple GPU nodes continuously on a single rack 24/7 will require adequate cooling, which the Galleon provides.

Armada's innovative deployment of AI applications on the Galleon's Commander platform represents a significant advancement in real-time edge computing technology. By addressing the critical components of compute power, memory capacity, sensor connectivity, and power efficiency, Armada's Galleons offer a comprehensive solution for processing diverse data streams at the edge. As industries continue to adopt AI-driven applications for a wide range of use cases, the Galleon's capability to support complex AI models in remote and resource-constrained environments will drive the next wave of innovation and efficiency of AI solutions across various sectors. Armada's technology is paving the way for a future where intelligent edge devices empower organizations—whether urban or remote—to harness the full potential of AI in real-world scenarios. 

Welcome to the remote edge!

 

Summary

Armada's Galleons support diverse AI applications, many of which involve generative AI, multimodal sensor integration, robotics, and computer vision, for industrial operations. By prioritizing computational power, memory capacity, and sensor connectivity, Armada facilitates real-time data processing at the edge, paving the way for widespread AI adoption in various industries, regardless of location or sector.