Connection dropped on multiple cpe connections at the same time

we are a small ISP provider and we are currently installing optical fiber connections in our city since about 3 months. So far we have about 250 registered devices on GenieACS.

Thing is, while developing a solution to handle PPPoE and WiFi credentials, we established 12 hours as the periodic inform interval. Last weekend we had a several blackouts at sections of the city (I believe they were scheduled) and as result, we had dozens of CPE doing Inform event on the ACS at the same time.

I was really unable to identify other source of the problem than the ACS by itself. The error I got from genieacs-cwmp log say:

Connection dropped
Channel has faulted; channel=“boot” retries=0 faultCode=“session_terminated” faultMessage=“The TR-069 session was unsuccessfully terminated”

below the section of the log. I was wondering if it has anything to do with the VALUE event

Summary

2021-06-25T17:29:48.801Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: boot : : datos a grabar en CPE: {

“InternetGatewayDevice.ManagementServer.PeriodicInformInterval”: 43200
}
2021-06-25T17:29:48.801Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: boot : grabando InternetGatewayDevice.ManagementServer.PeriodicInformInterval = 43200
2021-06-25T17:29:48.801Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: boot : FIN ! ========================================
2021-06-25T17:29:48.804Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: ACS request; acsRequestId=“17a4437e5e40100” acsRequestName=“SetParameterValues”
2021-06-25T17:29:48.805Z [ERROR] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Connection dropped
2021-06-25T17:29:48.805Z [WARN] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Channel has faulted; channel=“boot” retries=0 faultCode=“session_terminated” faultMessage=“The TR-069 session was unsuccessfully terminated”
2021-06-25T17:30:01.946Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Inform; cpeRequestId=“38” informEvent=“4 VALUE CHANGE” informRetryCount=2
2021-06-25T17:30:16.061Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Inform; cpeRequestId=“39” informEvent=“2 PERIODIC” informRetryCount=0
2021-06-25T17:30:16.077Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: inform : : INICIO ! dev id: 00259E-HG8145V5-485754432D90B2A3
2021-06-25T17:30:16.077Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: inform : manufacturer: Huawei Technologies Co., Ltd
2021-06-25T17:30:16.360Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: inform : parametros a pedir: {
“InternetGatewayDevice.ManagementServer.PeriodicInformInterval”: “”,

and I found this on the journalctrl

Error" exceptionMessage=“Lock expired” exceptionStack=“Error: Lock expired\n at cs (/usr/local/lib/node_modules/genieacs/bin/genieacs-cwmp:2:56400)\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:97:5)\n at async In (/usr/local/lib/node_modules/genieacs/bin/genieacs-cwmp:2:117461)\n at async Dn (/usr/local/lib/node_modules/genieacs/bin/genieacs-cwmp:2:106963)”

to deal with the issue of multiple connections at the same time I decided to switch the periodic inform times based on devid from 10 to 14 hours. So far so good.

As aditional background information, I could add the whole provisioning process takes some time, sometimes around 20 seconds to complete. I had to raise the timeout variable GENIEACS_EXT_TIMEOUT=150000 to archieve this. I am considering making a local copy of the DB API data to reduce the delays. Perhaps that should be my priority.

In case someone wonders, all the computers involved are virtual machines and I don’ t really know a lot about the network setup but will try to answer any question to clarify this.

What I would like to know is how do I address this problem.

thanks in advance. RM.

I had a similar issue. Most CPE’s allow to set or have a random wait period before informing the ACS the handle these kind of issues.

My solution was, wait a bit, until all backlogged devices were dealt with. The machine was also a virtual machine, with 20 GB RAM, so the hardware was more or less able to handle these devices.
Was also cloud help, running the machines on solid state drives.

To split the devices would be, depending on the device, making a inform request to the device the spread out the inform time. (If that does not work, try restarting )

1 Like

Had similar issue with “stock” 1.2.5, switched to GIT version (which includes fixes), looks like it resolves the issue.

@zaidka: can you update the “official” branch to include fixes like this?

in my case, after an extensive research I found out the problem wasn’t really ACS but the external restapi, which still takes too much time to answer the Extension script. It does work fine if it’s just a few connections. That, and the limit of 150 simultaneous connections on MariaDB.

As a general solution, current talks are about tagging CPE on boot event and process each one of them later using an external script. Also, we were talking about replacing the data retrieval from restapi with text files located on the same ACS computer for each CPE, that would speed things up, provided NodeJS does not cache them (I still have to research about that).

At the time, to solve the periodic inform interval issue, what I did was to write a single function which returns a number in seconds between 10 and 14 hours based on the last 4 digits of the serial number. Since then we had no more problems of collisions.

Where is that limit set? Is there a way to change it?

yes, both MySQL and MariaDB has it by default, it can be set on /etc/my.cnf or similar

https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_max_connections

what I understand from what I have read on the web, increasing that value is not really a good option, since it increases the amount of RAM required to run the SQL server. That’s why I am considering using text files as temporary data storage.

I did the tag thing as I mentioned and it’s working fine, at least it gives us some rest since we had several blackouts during this week due to several storm rains.

edit: gave wrong link.

1 Like