We’ve been using GenieACS on our network for a while now, managing around 30k active CPEs. However, we’re currently facing challenges with the CWMP process on our VM, where CPU and memory usage consistently hit 100%. We suspect that certain CPEs are contributing to this issue by ignoring the InformInterval, and we’re actively collaborating with a vendor to address this behavior.
The server hosting our GenieACS VM has the following specifications:
CPU(s): 80 x Intel(R) Xeon(R) E7-4850 @ 2.00GHz (4 Sockets)
Kernel Version: Linux 6.5.11-7-pve
Our GenieACS VM itself is equipped with 32 cores, 16GB RAM, and runs all the necessary services, including database, NBI, CWMP, etc.
While working to resolve the CPE-related problem, the idea of segmenting GenieACS services came up as a potential solution. However, we want to clarify that this is just a thought that occurred during our troubleshooting process, and we’re not certain whether segmenting the services is a necessary step.
We welcome any insights or suggestions you may have on this matter. Thank you!
Obviously PeriodicInformInterval must be set correctly on all devices. => Prio 1
Optimize Provisions and check on which events they need to run (BOOSTRAP, BOOT, PERIODIC, etc.) => Prio 2
cwmp carries the most load (separating fs, nbi and ui from cwmp usually doesn’t help) mongodb might a bit, also there disk-access, files-system-type, etc… matter)
splitting cwmp can be done but increases complexity a lot. You might need a loadbalancer (haproxy) etc…
But looking at the number of devices and the server specs. You really need to focus on Prio 1 and 2
Thank you for your prompt response! We’re actively addressing the PeriodicInformInterval settings on all devices as our top priority and optimizing provisions based on relevant events, as you suggested.
We’ve been in discussions with the vendor to resolve the issues with the specific CPEs causing CPU and memory spikes. The events and scripts are also undergoing improvements to streamline the provisioning process.
Considering our network is expected to scale to around 90k CPEs or more by the end of the year, we are exploring potential strategies to enhance GenieACS performance. Given your experience, do you think it would be necessary to consider segmenting GenieACS services at this scale, or would you recommend focusing on further optimizations within the current setup?
What inform interval do you have set for your CPEs? With a network of 30k devices, you should easily be able to have an inform set to once / hour with the hardware you have. 30k devices/hr works out to an avg of one every 120 milliseconds, or about 8 requests per second.
You need to take stock of all the provision scripts that run when a CPE informs and figure out what you can/should scale back. For example, if one of your provision scripts refreshes the InternetGatewayDevice.DeviceInfo.UpTime param every time it connects, do you really need to do that? Or can you scale it back to refreshing every 6 hours?
One Option would be to manually check if the inform time works, e.g. picking one device, check the last inform timestamp, wait just before the next inform an check if the device checked in before.
Double the ram or add even more. Depending on the os and so on genieacs can use a lot of ram.
It is also quite important to check which variables are loaded when. It can make a major difference if you only check wan parameters once a day or load all connected hosts with mac addresses every 10 minutes.