Skip to main content

AIDC Controller
Simplifying Networking Management  for AI Infrastructure

1. Background

The rapid advancement of artificial intelligence is reshaping how data centers are designed and deployed. As large language models scale to trillions of parameters, compute clusters are evolving from thousands of GPUs to tens of thousands, and even hundreds of thousands of GPUs. In this process, the network is becoming a key factor that limits AI training and inference efficiency.

Compared with traditional internet services, AI workloads introduce very different traffic patterns. These include high-concurrency communication caused by large-scale distributed training (Incast), long-duration elephant flows, and low-latency, high-throughput networking requirements based on RoCE/RDMA. These characteristics place higher demands on data center networks in bandwidth utilization, latency control, and lossless transport capabilities.

Under this trend, the traditional device-centric and distributed management model used in data centers is showing clear limitations. Manual and semi-automated provisioning methods are difficult to scale for large AI clusters and cannot guarantee configuration consistency. At the same time, the lack of real-time network visibility and closed-loop control makes it difficult to respond quickly to congestion and performance fluctuations, which directly affects the stability of AI workloads.

As a result, building a centralized, automated, and intelligent network control and management system has become a necessary direction for Artificial Intelligence Data Centers (AIDC).

To address these requirements, the industry is increasingly adopting centralized control architectures for unified network management. uCentral is a communication protocol defined by the Telecom Infra Project (TIP) for controller-to-device communication. It is also a core component of the OpenWiFi centralized network management architecture.

Within the OpenWiFi framework, the Cloud Controller acts as the central control component and uses the uCentral protocol for southbound device management. Through standardized protocols, the architecture enables unified device onboarding, automated configuration deployment, and telemetry reporting. This provides the foundation for building a centralized control plane in large-scale networks.

Based on this architecture, the Asterfusion AIDC Controller extends and optimizes the OpenWiFi Cloud Controller framework for AI data center environments. The controller enhances network deployment, operations, and lifecycle management for AIDC scenarios. It provides an intuitive and efficient web-based management interface focused on centralized switch management and network optimization.

The platform delivers unified configuration management, status monitoring, and topology visualization. It also improves deployment efficiency and operational stability through automated and policy-driven control mechanisms. These capabilities help support large-scale and high-concurrency AIDC network environments.

In summary, for AIDC deployments, building a centralized network controller with automation and closed-loop management capabilities has become an inevitable architectural direction.

2. AIDC Controller System Architecture

 The AIDC Controller is built on the OpenWiFi Cloud SDK and adopts a microservices architecture. System functions are divided into multiple independent services and deployed in a containerized manner. Each service communicates through standard interfaces, enabling functional decoupling and elastic scalability.

2.1 Overall Architecture

The following figure shows the overall architecture of the AIDC Controller. The system consists of three layers: the northbound interface layer, the controller core layer, and the southbound device access layer.

The northbound interface provides a unified management entry through the Web UI and RESTful APIs, supporting both visualized operations and external system integration. The controller core layer is built on a microservices architecture and decouples functional modules such as device configuration and firmware management.

Inter-service coordination is implemented through a unified east-west communication bus. Synchronous communication uses OpenAPI-based interfaces, while asynchronous messaging is handled through the Kafka message bus using a publish-and-subscribe model. This architecture improves system scalability and processing capability.

At the data layer, the controller uses PostgreSQL for centralized storage and management of configuration and status data. On the southbound side, the controller establishes persistent connections with switches through the uCentral over WebSocket protocol. This enables automatic device onboarding, telemetry reporting, and configuration deployment.

The overall architecture provides a complete closed-loop workflow from user operations to control logic and device execution. It offers strong scalability and high-concurrency processing capability, making it suitable for centralized management and automated operations in large-scale data center networks.

Figure 2-1 Overall Architecture of the AIDC Controller

2.2 Controller Components

The controller consists of several core service modules, including device management, configuration management, topology management, and status monitoring. The Web UI also runs as an independent microservice and provides a unified visualized management interface.

Table 2-1 describes the functions and roles of each service. Figure 2-2 shows the operational status of the microservices running in the controller.

Table 2-1 AIDC Controller Service Overview
Service NameFunction Description
owanalyticsProvides collection and analysis of device operational data to support performance monitoring and troubleshooting
owfmsProvides centralized firmware and patch management and upgrade capabilities
owgwProvides southbound access capabilities and enables communication between the controller and network devices through the uCentral protocol
owmgmtProvides configuration data import, export, and migration functions
owomProvides alarm and event management capabilities
owprovProvides centralized device configuration and management, including device grouping and batch configuration
owsecProvides user authentication and access control to ensure system security
owupgradeProvides software upgrade and maintenance capabilities for the controller itself
Figure 3-2 Built-in Scenario Templates

2.3 Communication Mechanisms

The controller uses several communication methods for both internal service interaction and external connectivity.

  • Asynchronous Communication: Kafka-Based Message Bus
    The controller uses a Kafka message bus for asynchronous communication between services. This mechanism is mainly used for event distribution and state synchronization. It reduces inter-service coupling and improves system scalability and processing efficiency.
  • Synchronous Communication: REST API-Based Service Calls
    When real-time data exchange or operation triggering is required between services, the controller uses REST API-based synchronous communication to implement request-and-response interactions.
  • Northbound and Southbound Communication
    The controller also provides communication interfaces for interaction with external systems.
    • Northbound Interface:
      Provides standardized REST APIs for upper-layer management systems and the Web UI. This enables centralized network management and operations.
    • Southbound Interface:
      The owgw gateway service establishes WebSocket connections with network devices. Device configuration deployment and operational status reporting are implemented through the uCentral protocol.
  • Container Network Communication
    At the deployment layer, microservice containers typically run within the same Docker custom bridge network, such as the OpenWiFi bridge. This allows containers to communicate directly through IP addresses or container aliases.

3. Core Functions and Features

3.1 Rapid Service Deployment and Day-0 Provisioning

In AI data center environments, networks are built at large scale with a high number of devices. Traditional deployment methods that rely on manual per-device configuration can no longer meet the requirements for rapid delivery and configuration consistency. When cluster size reaches hundreds or even thousands of switches, manual provisioning becomes inefficient and introduces configuration inconsistencies and human errors, which can affect deployment timelines and network stability.

The Asterfusion AIDC Controller simplifies the entire deployment workflow through automated provisioning and template-based orchestration. This includes device onboarding, topology planning, and configuration deployment, significantly improving Day-0 deployment efficiency.

The main capabilities include:

  • Automatic device onboarding and centralized management through ZTP
  • Automatic topology generation based on deployment templates
  • Automatic validation between planned topology and physical topology
  • Staged configuration deployment and rapid service provisioning

With these capabilities, the controller can support rapid deployment and centralized management of large-scale Leaf-Spine networks. It enables fast provisioning for networks with hundreds of nodes, reduces manual operational overhead, and delivers standardized and repeatable network deployment workflows.

3.1.1 Automatic Device Onboarding and Centralized Management

After the device is physically installed and configured with the controller address, it automatically establishes a WebSocket connection with the controller. Once the connection is established, the device is automatically onboarded into the controller management domain and added to the default resource pool (organization).

Based on deployment requirements, administrators can batch-assign devices to specific sites or service domains for subsequent topology planning and service deployment.

Figure 3-1 Automatic Device Onboarding and Inventory Registration

3.1.2 Typical AI Service Scenario Deployment

To support common AI data center network architectures, the controller provides multiple built-in deployment templates, including AI training, AI inference, and traditional data center network scenarios. Each template is pre-integrated with the corresponding network architecture and key features, allowing users to quickly complete network planning and deployment based on predefined scenarios.

Figure 3-2 Built-in Scenario Templates

The main deployment scenarios include:

  • AIDC Backend Network (GPU Training Fabric)
    Uses a Layer 3 Spine-Leaf architecture and supports large-scale node connectivity. The fabric integrates RoCE, intelligent path selection, and ARS (Adaptive Routing Switching) to meet the requirements for high bandwidth, low latency, and lossless transport.
  • AIDC Frontend Network (Service Access Network)
    Uses EVPN MC-LAG technology to provide link redundancy and service isolation, ensuring high reliability for service access.
  • AIDC Storage Network
    Combines distributed gateway, MC-LAG, and RoCE technologies to improve storage access performance and reliability.
  • DC Converged Network
    Supports typical deployment models based on EVPN MC-LAG or EVPN Multihoming, making it suitable for traditional data center service environments.

Administrators can select device models and quantities based on the actual network scale. The controller can then automatically generate the corresponding planned topology with a single click. Users can also customize the generated topology based on deployment requirements. Figure 3-3 shows an example of a one-click topology template generation for an AIDC backend network scenario.

Figure 3-3 Example of Planned Topology Generation for AIDC Backend Network Scenario

After devices come online, the controller automatically discovers the physical link relationships between devices and generates the actual network topology. It then performs consistency validation against the planned topology, as shown in Figure 3-4. Once the validation is passed, the system proceeds to the configuration deployment phase.

Figure 3-4 Topology Validation Illustration

As shown in Figure 3-5 and Figure 3-6, configuration deployment is carried out in two phases:

  • Basic network configuration: Establishes basic connectivity between devices, such as Spine-Leaf underlay connectivity.
  • Service configuration: Enables required services based on business requirements. The configuration is then pushed to devices to complete service activation.

Through staged configuration and template-based provisioning, users can quickly complete the full process from network construction to service go-live.

Figure 3-5 Example of Basic Network Configuration for AIDC Backend Network Scenario
Figure 3-6 Service Configuration Template for AIDC Backend Network Scenario

Through automation and template-based capabilities, the AIDC Controller transforms the traditional “per-device configuration” deployment model into an “intent-based one-click orchestration” approach, significantly improving the delivery efficiency of AI data center networks.

3.2 AI-Oriented Optimization

In AI data center networks, the configuration of RoCE-related parameters such as PFC and ECN has a significant impact on network performance. Different service scenarios have varying requirements for latency, throughput, and congestion control. Relying on manual per-parameter configuration increases complexity and makes it difficult to ensure optimal parameter combinations.

The Asterfusion AIDC Controller adopts a “template-based configuration + parameter fine-tuning” approach to enable efficient deployment and fine-grained optimization of RoCE networks:

  • Scenario-based RoCE configuration templates
    As shown in Figure 3-7, the controller provides built-in RoCE parameter templates for typical AI workloads. These templates predefine key parameter sets, including PFC priorities and ECN marking policies. Users can select an appropriate template based on the service type to quickly complete baseline network configuration and reduce configuration complexity.
Figure 3-7 RoCE Template Creation Example
  • Parameter fine-tuning capability
    As shown in Figure 3-8, the controller allows flexible adjustment of key parameters such as PFC and ECN on top of the templates. This includes ECN marking thresholds and queue watermarks, enabling optimization based on specific workload characteristics and further improving network performance.
Figure 3-8 ECN Parameter Tuning Example

3.3 Centralized Control and Automated Operations

The Asterfusion AIDC Controller is designed for large-scale data center networks. It provides unified device management and operational capabilities. Through centralized control and an observability framework, it improves operational efficiency and reduces the complexity of manual operations.

3.3.1 Centralized Device Management

As shown in Figure 3-9, the controller provides unified onboarding and management for all switches in the network, along with batch-oriented and automated device management capabilities, including:

  • Batch configuration deployment and modification
  • Unified configuration file management (view, import, export)
  • Remote command execution and automated script delivery
  • Device maintenance operations (reboot, upgrade, etc.)

Through centralized management, the system eliminates inefficiencies and configuration inconsistencies caused by per-device operations, enabling unified, network-wide operational management.

Figure 3-9 Unified Device Management

3.3.2 Network Observability Capabilities

As shown in Figures 3-10 to 3-15, the controller collects and analyzes device operational status and network traffic in a unified manner, and presents the overall network state through visualization.

  • Device status visualization
    Includes metrics such as CPU utilization, memory usage, interface status, and hardware information. A comprehensive health score is used to provide a global view of device conditions.
  • Service status monitoring
    Provides detailed status information for critical services, enabling rapid fault detection and troubleshooting.
  • Traffic and queue monitoring
    Supports visibility into interface traffic, queue buffer utilization, and RoCE-related statistical metrics. This helps administrators understand network load and traffic trends over time.
Figure 3-10 Device Status Overview
Figure 3-11 Device System Information
Figure 3-12 Device Service Details
Figure 3-13 Device Interface Traffic Statistics
Figure 3-14 Device Buffer Statistics
Figure 3-15 Device RoCE Statistics

3.3.3 Operational Assurance: Real-Time Alerts and Intelligent Inspection

As shown in Figures 3-16 to 3-19, the controller provides alerting and inspection mechanisms based on monitoring data, improving network stability and maintainability.

  • Alerting mechanism
    Supports customizable alert thresholds and notification policies. It enables real-time monitoring and alerting for device status, resource utilization, and hardware health conditions.
  • Inspection capability
    Supports on-demand one-click inspection and scheduled periodic inspections. The system automatically checks key device metrics, including CPU, memory, process status, and logs, and generates inspection reports.

Through integrated alerting and inspection mechanisms, the controller helps administrators identify potential issues in a timely manner and take corrective actions.

Figure 3-16 Alarm Notification Settings
Figure 3-17 Alarm Threshold Settings
Figure 3-18 Real-Time Device Alarms
Figure 3-19 Device Inspection

3.3.4 Multi-Tenant Organization and Access Control

As shown in Figures 3-20 and 3-21, the controller supports a multi-tenant organizational structure. Devices and resources can be hierarchically managed based on regions, departments, or business units, enabling fine-grained segmentation of devices and access permissions.

Through a hierarchical role-based access control (RBAC) mechanism, different administrators are restricted to operations within their respective permission domains, meeting the operational requirements of large-scale organizational environments.

Figure 3-20 Multi-Organization and Site Management
Figure 3-21 User Management

3.4 System Migration and Configuration Reuse

As shown in Figure 3-22, the Asterfusion AIDC Controller supports full export and import of system configurations, enabling fast migration between controllers and reuse of existing configurations.

When deploying a new controller or performing system migration, administrators can export the current controller configuration with a single operation and import it into the target controller. This enables rapid restoration of the network management environment and reduces repetitive configuration work.

This capability is applicable to scenarios such as system scaling, migration deployment, and disaster recovery. It improves deployment efficiency and increases operational flexibility.

Figure 3-22 Controller Configuration Import-Export

4. Appendix: Feature List

The AIDC Controller supports the following feature set:

Table 4-1 AIDC Controller Feature List
FeatureLevel 1Level 2
NavigationMapTree structure view showing all organizations and device information under the current account
NavigationMapMap-based hierarchical structure
NavigationMapOrganization, site, and canvas operations
NavigationPhysical TopologyTopology generation and viewing; device basic information; interface rate and LACP negotiation status display
DashboardAggregated DashboardTop 10 egress/ingress traffic flows, switch count, alarms
DashboardSite DashboardSwitch count, historical egress throughput statistics, top interconnect interface bandwidth utilization
DashboardCustom DashboardBuilt-in multi-dimensional visualization components, component operations, dashboard reset
Configuration ManagementInventory ManagementInventory device information, import/export, device configuration, device classification
Configuration ManagementPlanned TopologyNetwork planning, validation between planned and physical topology, device health check for planned topology
Configuration ManagementAlarm ConfigurationAlarm severity, description, thresholds, and alarm suppression switch
Configuration ManagementAlarm ConfigurationSynchronization with subordinate organizations/sites
Configuration ManagementAlarm NotificationRecipient settings, email enablement, alarm type/severity triggering, synchronization with subordinate organizations/sites
Configuration Management – Planned Topology & Service ProvisioningAIDC Backend Network DeploymentBasic network configuration: interface configuration, BGP configuration, import/export configuration
Configuration Management – Planned Topology & Service ProvisioningAIDC Backend Network DeploymentService configuration: intelligent routing, RoCE, ARS, CSV batch import, filtering and batch push
Configuration Management – Planned Topology & Service ProvisioningAIDC Frontend Network DeploymentBasic network configuration: interface configuration, BGP configuration, MC-LAG configuration, import/export configuration
Configuration Management – Planned Topology & Service ProvisioningAIDC Frontend Network DeploymentService configuration: service provisioning, RoCE, CSV batch import, filtering and batch push
Configuration Management – Planned Topology & Service ProvisioningAIDC Storage Network DeploymentBasic network configuration: interface configuration, BGP configuration, MC-LAG configuration, import/export configuration
Configuration Management – Planned Topology & Service ProvisioningAIDC Storage Network DeploymentService configuration: service provisioning, RoCE, CSV batch import, filtering and batch push
Configuration Management – Planned Topology & Service ProvisioningData Center Convergence Network DeploymentBasic network configuration: interface/sub-interface/VLAN/LAG, VRF/Route-Map/IP Prefix List/BGP/OSPF/static routing/SLA/Track, MC-LAG, import/export configuration
Configuration Management – Planned Topology & Service ProvisioningData Center Convergence Network DeploymentService configuration: service provisioning, RoCE, ACL, CSV batch import, filtering and batch push
Device ManagementDevice Basic InformationName, type, IP address, MAC address, OS version, serial number, creation time, last update time, organization/site, Loopback IP
Device ManagementDevice Status InformationController connection status, uptime, last contact time, local time, load, real-time CPU/memory usage, health status
Device ManagementDevice Peripheral InformationTemperature, fan, and power supply status
Device ManagementDevice Alarm Information-
Device ManagementInterface Usage InformationUtilization rate, number of UP/DOWN interfaces
Device ManagementDevice Log Information-
Device ManagementDevice Health Statistics-
Device ManagementDevice Coredump Logs-
Device ManagementSwitch LLDP Information-
Device ManagementSwitch Detail InformationMAC, neighbors, PBR, RoCE, intelligent routing, ARS, MC-LAG, BGP, EVPN tunnel information
Device ManagementDevice StatisticsInterface rx/tx Bps and packet statistics; overall device rx/tx Bps and packet statistics; CPU usage history; memory usage history; interface information; optics module information; buffer statistics; RoCE metrics
Device ManagementDevice Operation Records-
Device ManagementDevice NotesNotes content and creator
Device ManagementRemote Device OperationsReboot, factory reset, packet capture, firmware upgrade, commands/scripts, configuration files, patch application, device inspection, controller IP modification
Inventory ManagementInventory ListMAC address, name, configuration tags, organization/site, device type, Loopback0 IP, interface type, license MD5, description, status
Inventory ManagementInventory OperationsFiltering, create/edit/delete inventory, change organization/site assignment, CSV export
Operations ManagementAlarm FunctionsActive and historical alarms, alarm settings, alarm notifications
Operations ManagementInspection FunctionsOne-click inspection, scheduled inspection, inspection record export/view/delete, threshold configuration
Operations ManagementFirmwareFirmware management, patch management
Operations ManagementScripts-
User ManagementUser Settings-
User ManagementBuilt-in Users-
User ManagementUser OperationsCreate, modify, delete users
User ManagementRoles-
User ManagementUser Actions-
SystemController Configuration-
SystemController UpgradeVersion deployment, version management, patch management, upgrade logs
SystemConfiguration Migration-
SystemSystem Logs-
SystemService Management-
SystemRelease Notes-

Ready to Implement?

Explore our detailed implementation guides to turn these white paper insights into real-world networking solutions. From RoCE to Zero-Touch Provisioning, we’ve got you covered.