AIDC Controller
Simplifying Networking Management for AI Infrastructure
- 1. Background
- 2. AIDC Controller System Architecture
- 2.1 Overall Architecture
- 2.2 Controller Components
- 2.3 Communication Mechanisms
- 3. Core Functions and Features
- 3.1 Rapid Service Deployment and Day-0 Provisioning
- 3.1.1 Automatic Device Onboarding and Centralized Management
- 3.1.2 Typical AI Service Scenario Deployment
- 3.2 AI-Oriented Optimization
- 3.3 Centralized Control and Automated Operations
- 3.3.1 Centralized Device Management
- 3.3.2 Network Observability Capabilities
- 3.3.3 Operational Assurance: Real-Time Alerts and Intelligent Inspection
- 3.3.4 Operational Assurance: Real-Time Alerts and Intelligent
- 3.4 System Migration and Configuration Reuse
- 4. Appendix: Feature List
1. Background
The rapid advancement of artificial intelligence is reshaping how data centers are designed and deployed. As large language models scale to trillions of parameters, compute clusters are evolving from thousands of GPUs to tens of thousands, and even hundreds of thousands of GPUs. In this process, the network is becoming a key factor that limits AI training and inference efficiency.
Compared with traditional internet services, AI workloads introduce very different traffic patterns. These include high-concurrency communication caused by large-scale distributed training (Incast), long-duration elephant flows, and low-latency, high-throughput networking requirements based on RoCE/RDMA. These characteristics place higher demands on data center networks in bandwidth utilization, latency control, and lossless transport capabilities.
Under this trend, the traditional device-centric and distributed management model used in data centers is showing clear limitations. Manual and semi-automated provisioning methods are difficult to scale for large AI clusters and cannot guarantee configuration consistency. At the same time, the lack of real-time network visibility and closed-loop control makes it difficult to respond quickly to congestion and performance fluctuations, which directly affects the stability of AI workloads.
As a result, building a centralized, automated, and intelligent network control and management system has become a necessary direction for Artificial Intelligence Data Centers (AIDC).
To address these requirements, the industry is increasingly adopting centralized control architectures for unified network management. uCentral is a communication protocol defined by the Telecom Infra Project (TIP) for controller-to-device communication. It is also a core component of the OpenWiFi centralized network management architecture.
Within the OpenWiFi framework, the Cloud Controller acts as the central control component and uses the uCentral protocol for southbound device management. Through standardized protocols, the architecture enables unified device onboarding, automated configuration deployment, and telemetry reporting. This provides the foundation for building a centralized control plane in large-scale networks.
Based on this architecture, the Asterfusion AIDC Controller extends and optimizes the OpenWiFi Cloud Controller framework for AI data center environments. The controller enhances network deployment, operations, and lifecycle management for AIDC scenarios. It provides an intuitive and efficient web-based management interface focused on centralized switch management and network optimization.
The platform delivers unified configuration management, status monitoring, and topology visualization. It also improves deployment efficiency and operational stability through automated and policy-driven control mechanisms. These capabilities help support large-scale and high-concurrency AIDC network environments.
In summary, for AIDC deployments, building a centralized network controller with automation and closed-loop management capabilities has become an inevitable architectural direction.
2. AIDC Controller System Architecture
The AIDC Controller is built on the OpenWiFi Cloud SDK and adopts a microservices architecture. System functions are divided into multiple independent services and deployed in a containerized manner. Each service communicates through standard interfaces, enabling functional decoupling and elastic scalability.
2.1 Overall Architecture
The following figure shows the overall architecture of the AIDC Controller. The system consists of three layers: the northbound interface layer, the controller core layer, and the southbound device access layer.
The northbound interface provides a unified management entry through the Web UI and RESTful APIs, supporting both visualized operations and external system integration. The controller core layer is built on a microservices architecture and decouples functional modules such as device configuration and firmware management.
Inter-service coordination is implemented through a unified east-west communication bus. Synchronous communication uses OpenAPI-based interfaces, while asynchronous messaging is handled through the Kafka message bus using a publish-and-subscribe model. This architecture improves system scalability and processing capability.
At the data layer, the controller uses PostgreSQL for centralized storage and management of configuration and status data. On the southbound side, the controller establishes persistent connections with switches through the uCentral over WebSocket protocol. This enables automatic device onboarding, telemetry reporting, and configuration deployment.
The overall architecture provides a complete closed-loop workflow from user operations to control logic and device execution. It offers strong scalability and high-concurrency processing capability, making it suitable for centralized management and automated operations in large-scale data center networks.
2.2 Controller Components
The controller consists of several core service modules, including device management, configuration management, topology management, and status monitoring. The Web UI also runs as an independent microservice and provides a unified visualized management interface.
Table 2-1 describes the functions and roles of each service. Figure 2-2 shows the operational status of the microservices running in the controller.
| Service Name | Function Description |
|---|---|
| owanalytics | Provides collection and analysis of device operational data to support performance monitoring and troubleshooting |
| owfms | Provides centralized firmware and patch management and upgrade capabilities |
| owgw | Provides southbound access capabilities and enables communication between the controller and network devices through the uCentral protocol |
| owmgmt | Provides configuration data import, export, and migration functions |
| owom | Provides alarm and event management capabilities |
| owprov | Provides centralized device configuration and management, including device grouping and batch configuration |
| owsec | Provides user authentication and access control to ensure system security |
| owupgrade | Provides software upgrade and maintenance capabilities for the controller itself |
2.3 Communication Mechanisms
The controller uses several communication methods for both internal service interaction and external connectivity.
- Asynchronous Communication: Kafka-Based Message Bus
The controller uses a Kafka message bus for asynchronous communication between services. This mechanism is mainly used for event distribution and state synchronization. It reduces inter-service coupling and improves system scalability and processing efficiency.
- Synchronous Communication: REST API-Based Service Calls
When real-time data exchange or operation triggering is required between services, the controller uses REST API-based synchronous communication to implement request-and-response interactions.
- Northbound and Southbound Communication
The controller also provides communication interfaces for interaction with external systems.
-
- Northbound Interface:
Provides standardized REST APIs for upper-layer management systems and the Web UI. This enables centralized network management and operations. - Southbound Interface:
The owgw gateway service establishes WebSocket connections with network devices. Device configuration deployment and operational status reporting are implemented through the uCentral protocol.
- Northbound Interface:
- Container Network Communication
At the deployment layer, microservice containers typically run within the same Docker custom bridge network, such as the OpenWiFi bridge. This allows containers to communicate directly through IP addresses or container aliases.
3. Core Functions and Features
3.1 Rapid Service Deployment and Day-0 Provisioning
In AI data center environments, networks are built at large scale with a high number of devices. Traditional deployment methods that rely on manual per-device configuration can no longer meet the requirements for rapid delivery and configuration consistency. When cluster size reaches hundreds or even thousands of switches, manual provisioning becomes inefficient and introduces configuration inconsistencies and human errors, which can affect deployment timelines and network stability.
The Asterfusion AIDC Controller simplifies the entire deployment workflow through automated provisioning and template-based orchestration. This includes device onboarding, topology planning, and configuration deployment, significantly improving Day-0 deployment efficiency.
The main capabilities include:
- Automatic device onboarding and centralized management through ZTP
- Automatic topology generation based on deployment templates
- Automatic validation between planned topology and physical topology
- Staged configuration deployment and rapid service provisioning
With these capabilities, the controller can support rapid deployment and centralized management of large-scale Leaf-Spine networks. It enables fast provisioning for networks with hundreds of nodes, reduces manual operational overhead, and delivers standardized and repeatable network deployment workflows.
3.1.1 Automatic Device Onboarding and Centralized Management
After the device is physically installed and configured with the controller address, it automatically establishes a WebSocket connection with the controller. Once the connection is established, the device is automatically onboarded into the controller management domain and added to the default resource pool (organization).
Based on deployment requirements, administrators can batch-assign devices to specific sites or service domains for subsequent topology planning and service deployment.
3.1.2 Typical AI Service Scenario Deployment
To support common AI data center network architectures, the controller provides multiple built-in deployment templates, including AI training, AI inference, and traditional data center network scenarios. Each template is pre-integrated with the corresponding network architecture and key features, allowing users to quickly complete network planning and deployment based on predefined scenarios.
The main deployment scenarios include:
- AIDC Backend Network (GPU Training Fabric)
Uses a Layer 3 Spine-Leaf architecture and supports large-scale node connectivity. The fabric integrates RoCE, intelligent path selection, and ARS (Adaptive Routing Switching) to meet the requirements for high bandwidth, low latency, and lossless transport. - AIDC Frontend Network (Service Access Network)
Uses EVPN MC-LAG technology to provide link redundancy and service isolation, ensuring high reliability for service access. - AIDC Storage Network
Combines distributed gateway, MC-LAG, and RoCE technologies to improve storage access performance and reliability. - DC Converged Network
Supports typical deployment models based on EVPN MC-LAG or EVPN Multihoming, making it suitable for traditional data center service environments.
Administrators can select device models and quantities based on the actual network scale. The controller can then automatically generate the corresponding planned topology with a single click. Users can also customize the generated topology based on deployment requirements. Figure 3-3 shows an example of a one-click topology template generation for an AIDC backend network scenario.
After devices come online, the controller automatically discovers the physical link relationships between devices and generates the actual network topology. It then performs consistency validation against the planned topology, as shown in Figure 3-4. Once the validation is passed, the system proceeds to the configuration deployment phase.
As shown in Figure 3-5 and Figure 3-6, configuration deployment is carried out in two phases:
- Basic network configuration: Establishes basic connectivity between devices, such as Spine-Leaf underlay connectivity.
- Service configuration: Enables required services based on business requirements. The configuration is then pushed to devices to complete service activation.
Through staged configuration and template-based provisioning, users can quickly complete the full process from network construction to service go-live.
Through automation and template-based capabilities, the AIDC Controller transforms the traditional “per-device configuration” deployment model into an “intent-based one-click orchestration” approach, significantly improving the delivery efficiency of AI data center networks.
3.2 AI-Oriented Optimization
In AI data center networks, the configuration of RoCE-related parameters such as PFC and ECN has a significant impact on network performance. Different service scenarios have varying requirements for latency, throughput, and congestion control. Relying on manual per-parameter configuration increases complexity and makes it difficult to ensure optimal parameter combinations.
The Asterfusion AIDC Controller adopts a “template-based configuration + parameter fine-tuning” approach to enable efficient deployment and fine-grained optimization of RoCE networks:
- Scenario-based RoCE configuration templates
As shown in Figure 3-7, the controller provides built-in RoCE parameter templates for typical AI workloads. These templates predefine key parameter sets, including PFC priorities and ECN marking policies. Users can select an appropriate template based on the service type to quickly complete baseline network configuration and reduce configuration complexity.
- Parameter fine-tuning capability
As shown in Figure 3-8, the controller allows flexible adjustment of key parameters such as PFC and ECN on top of the templates. This includes ECN marking thresholds and queue watermarks, enabling optimization based on specific workload characteristics and further improving network performance.
3.3 Centralized Control and Automated Operations
The Asterfusion AIDC Controller is designed for large-scale data center networks. It provides unified device management and operational capabilities. Through centralized control and an observability framework, it improves operational efficiency and reduces the complexity of manual operations.
3.3.1 Centralized Device Management
As shown in Figure 3-9, the controller provides unified onboarding and management for all switches in the network, along with batch-oriented and automated device management capabilities, including:
- Batch configuration deployment and modification
- Unified configuration file management (view, import, export)
- Remote command execution and automated script delivery
- Device maintenance operations (reboot, upgrade, etc.)
Through centralized management, the system eliminates inefficiencies and configuration inconsistencies caused by per-device operations, enabling unified, network-wide operational management.
3.3.2 Network Observability Capabilities
As shown in Figures 3-10 to 3-15, the controller collects and analyzes device operational status and network traffic in a unified manner, and presents the overall network state through visualization.
- Device status visualization
Includes metrics such as CPU utilization, memory usage, interface status, and hardware information. A comprehensive health score is used to provide a global view of device conditions. - Service status monitoring
Provides detailed status information for critical services, enabling rapid fault detection and troubleshooting. - Traffic and queue monitoring
Supports visibility into interface traffic, queue buffer utilization, and RoCE-related statistical metrics. This helps administrators understand network load and traffic trends over time.
3.3.3 Operational Assurance: Real-Time Alerts and Intelligent Inspection
As shown in Figures 3-16 to 3-19, the controller provides alerting and inspection mechanisms based on monitoring data, improving network stability and maintainability.
- Alerting mechanism
Supports customizable alert thresholds and notification policies. It enables real-time monitoring and alerting for device status, resource utilization, and hardware health conditions. - Inspection capability
Supports on-demand one-click inspection and scheduled periodic inspections. The system automatically checks key device metrics, including CPU, memory, process status, and logs, and generates inspection reports.
Through integrated alerting and inspection mechanisms, the controller helps administrators identify potential issues in a timely manner and take corrective actions.
3.3.4 Multi-Tenant Organization and Access Control
As shown in Figures 3-20 and 3-21, the controller supports a multi-tenant organizational structure. Devices and resources can be hierarchically managed based on regions, departments, or business units, enabling fine-grained segmentation of devices and access permissions.
Through a hierarchical role-based access control (RBAC) mechanism, different administrators are restricted to operations within their respective permission domains, meeting the operational requirements of large-scale organizational environments.
3.4 System Migration and Configuration Reuse
As shown in Figure 3-22, the Asterfusion AIDC Controller supports full export and import of system configurations, enabling fast migration between controllers and reuse of existing configurations.
When deploying a new controller or performing system migration, administrators can export the current controller configuration with a single operation and import it into the target controller. This enables rapid restoration of the network management environment and reduces repetitive configuration work.
This capability is applicable to scenarios such as system scaling, migration deployment, and disaster recovery. It improves deployment efficiency and increases operational flexibility.
4. Appendix: Feature List
The AIDC Controller supports the following feature set:
| Feature | Level 1 | Level 2 |
|---|---|---|
| Navigation | Map | Tree structure view showing all organizations and device information under the current account |
| Navigation | Map | Map-based hierarchical structure |
| Navigation | Map | Organization, site, and canvas operations |
| Navigation | Physical Topology | Topology generation and viewing; device basic information; interface rate and LACP negotiation status display |
| Dashboard | Aggregated Dashboard | Top 10 egress/ingress traffic flows, switch count, alarms |
| Dashboard | Site Dashboard | Switch count, historical egress throughput statistics, top interconnect interface bandwidth utilization |
| Dashboard | Custom Dashboard | Built-in multi-dimensional visualization components, component operations, dashboard reset |
| Configuration Management | Inventory Management | Inventory device information, import/export, device configuration, device classification |
| Configuration Management | Planned Topology | Network planning, validation between planned and physical topology, device health check for planned topology |
| Configuration Management | Alarm Configuration | Alarm severity, description, thresholds, and alarm suppression switch |
| Configuration Management | Alarm Configuration | Synchronization with subordinate organizations/sites |
| Configuration Management | Alarm Notification | Recipient settings, email enablement, alarm type/severity triggering, synchronization with subordinate organizations/sites |
| Configuration Management – Planned Topology & Service Provisioning | AIDC Backend Network Deployment | Basic network configuration: interface configuration, BGP configuration, import/export configuration |
| Configuration Management – Planned Topology & Service Provisioning | AIDC Backend Network Deployment | Service configuration: intelligent routing, RoCE, ARS, CSV batch import, filtering and batch push |
| Configuration Management – Planned Topology & Service Provisioning | AIDC Frontend Network Deployment | Basic network configuration: interface configuration, BGP configuration, MC-LAG configuration, import/export configuration |
| Configuration Management – Planned Topology & Service Provisioning | AIDC Frontend Network Deployment | Service configuration: service provisioning, RoCE, CSV batch import, filtering and batch push |
| Configuration Management – Planned Topology & Service Provisioning | AIDC Storage Network Deployment | Basic network configuration: interface configuration, BGP configuration, MC-LAG configuration, import/export configuration |
| Configuration Management – Planned Topology & Service Provisioning | AIDC Storage Network Deployment | Service configuration: service provisioning, RoCE, CSV batch import, filtering and batch push |
| Configuration Management – Planned Topology & Service Provisioning | Data Center Convergence Network Deployment | Basic network configuration: interface/sub-interface/VLAN/LAG, VRF/Route-Map/IP Prefix List/BGP/OSPF/static routing/SLA/Track, MC-LAG, import/export configuration |
| Configuration Management – Planned Topology & Service Provisioning | Data Center Convergence Network Deployment | Service configuration: service provisioning, RoCE, ACL, CSV batch import, filtering and batch push |
| Device Management | Device Basic Information | Name, type, IP address, MAC address, OS version, serial number, creation time, last update time, organization/site, Loopback IP |
| Device Management | Device Status Information | Controller connection status, uptime, last contact time, local time, load, real-time CPU/memory usage, health status |
| Device Management | Device Peripheral Information | Temperature, fan, and power supply status |
| Device Management | Device Alarm Information | - |
| Device Management | Interface Usage Information | Utilization rate, number of UP/DOWN interfaces |
| Device Management | Device Log Information | - |
| Device Management | Device Health Statistics | - |
| Device Management | Device Coredump Logs | - |
| Device Management | Switch LLDP Information | - |
| Device Management | Switch Detail Information | MAC, neighbors, PBR, RoCE, intelligent routing, ARS, MC-LAG, BGP, EVPN tunnel information |
| Device Management | Device Statistics | Interface rx/tx Bps and packet statistics; overall device rx/tx Bps and packet statistics; CPU usage history; memory usage history; interface information; optics module information; buffer statistics; RoCE metrics |
| Device Management | Device Operation Records | - |
| Device Management | Device Notes | Notes content and creator |
| Device Management | Remote Device Operations | Reboot, factory reset, packet capture, firmware upgrade, commands/scripts, configuration files, patch application, device inspection, controller IP modification |
| Inventory Management | Inventory List | MAC address, name, configuration tags, organization/site, device type, Loopback0 IP, interface type, license MD5, description, status |
| Inventory Management | Inventory Operations | Filtering, create/edit/delete inventory, change organization/site assignment, CSV export |
| Operations Management | Alarm Functions | Active and historical alarms, alarm settings, alarm notifications |
| Operations Management | Inspection Functions | One-click inspection, scheduled inspection, inspection record export/view/delete, threshold configuration |
| Operations Management | Firmware | Firmware management, patch management |
| Operations Management | Scripts | - |
| User Management | User Settings | - |
| User Management | Built-in Users | - |
| User Management | User Operations | Create, modify, delete users |
| User Management | Roles | - |
| User Management | User Actions | - |
| System | Controller Configuration | - |
| System | Controller Upgrade | Version deployment, version management, patch management, upgrade logs |
| System | Configuration Migration | - |
| System | System Logs | - |
| System | Service Management | - |
| System | Release Notes | - |
Ready to Implement?
Explore our detailed implementation guides to turn these white paper insights into real-world networking solutions. From RoCE to Zero-Touch Provisioning, we’ve got you covered.