Connection dropped on multiple cpe connections at the same time

rudymartin · July 29, 2021, 4:50pm

we are a small ISP provider and we are currently installing optical fiber connections in our city since about 3 months. So far we have about 250 registered devices on GenieACS.

Thing is, while developing a solution to handle PPPoE and WiFi credentials, we established 12 hours as the periodic inform interval. Last weekend we had a several blackouts at sections of the city (I believe they were scheduled) and as result, we had dozens of CPE doing Inform event on the ACS at the same time.

I was really unable to identify other source of the problem than the ACS by itself. The error I got from genieacs-cwmp log say:

Connection dropped
Channel has faulted; channel=“boot” retries=0 faultCode=“session_terminated” faultMessage=“The TR-069 session was unsuccessfully terminated”

below the section of the log. I was wondering if it has anything to do with the VALUE event

Summary

2021-06-25T17:29:48.801Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: boot : : datos a grabar en CPE: {

“InternetGatewayDevice.ManagementServer.PeriodicInformInterval”: 43200
}
2021-06-25T17:29:48.801Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: boot : grabando InternetGatewayDevice.ManagementServer.PeriodicInformInterval = 43200
2021-06-25T17:29:48.801Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: boot : FIN ! ========================================
2021-06-25T17:29:48.804Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: ACS request; acsRequestId=“17a4437e5e40100” acsRequestName=“SetParameterValues”
2021-06-25T17:29:48.805Z [ERROR] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Connection dropped
2021-06-25T17:29:48.805Z [WARN] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Channel has faulted; channel=“boot” retries=0 faultCode=“session_terminated” faultMessage=“The TR-069 session was unsuccessfully terminated”
2021-06-25T17:30:01.946Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Inform; cpeRequestId=“38” informEvent=“4 VALUE CHANGE” informRetryCount=2
2021-06-25T17:30:16.061Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Inform; cpeRequestId=“39” informEvent=“2 PERIODIC” informRetryCount=0
2021-06-25T17:30:16.077Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: inform : : INICIO ! dev id: 00259E-HG8145V5-485754432D90B2A3
2021-06-25T17:30:16.077Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: inform : manufacturer: Huawei Technologies Co., Ltd
2021-06-25T17:30:16.360Z [INFO] 10.111.1.243 00259E-HG8145V5-485754432D90B2A3: Script: inform : parametros a pedir: {
“InternetGatewayDevice.ManagementServer.PeriodicInformInterval”: “”,

and I found this on the journalctrl

Error" exceptionMessage=“Lock expired” exceptionStack=“Error: Lock expired\n at cs (/usr/local/lib/node_modules/genieacs/bin/genieacs-cwmp:2:56400)\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:97:5)\n at async In (/usr/local/lib/node_modules/genieacs/bin/genieacs-cwmp:2:117461)\n at async Dn (/usr/local/lib/node_modules/genieacs/bin/genieacs-cwmp:2:106963)”

to deal with the issue of multiple connections at the same time I decided to switch the periodic inform times based on devid from 10 to 14 hours. So far so good.

As aditional background information, I could add the whole provisioning process takes some time, sometimes around 20 seconds to complete. I had to raise the timeout variable GENIEACS_EXT_TIMEOUT=150000 to archieve this. I am considering making a local copy of the DB API data to reduce the delays. Perhaps that should be my priority.

In case someone wonders, all the computers involved are virtual machines and I don’ t really know a lot about the network setup but will try to answer any question to clarify this.

What I would like to know is how do I address this problem.

thanks in advance. RM.

JonasGhost · July 29, 2021, 7:33pm

I had a similar issue. Most CPE’s allow to set or have a random wait period before informing the ACS the handle these kind of issues.

My solution was, wait a bit, until all backlogged devices were dealt with. The machine was also a virtual machine, with 20 GB RAM, so the hardware was more or less able to handle these devices.
Was also cloud help, running the machines on solid state drives.

To split the devices would be, depending on the device, making a inform request to the device the spread out the inform time. (If that does not work, try restarting )

git-neo-si · August 31, 2021, 12:29pm

Had similar issue with “stock” 1.2.5, switched to GIT version (which includes fixes), looks like it resolves the issue.

@zaidka: can you update the “official” branch to include fixes like this?

rudymartin · September 7, 2021, 2:18pm

in my case, after an extensive research I found out the problem wasn’t really ACS but the external restapi, which still takes too much time to answer the Extension script. It does work fine if it’s just a few connections. That, and the limit of 150 simultaneous connections on MariaDB.

As a general solution, current talks are about tagging CPE on boot event and process each one of them later using an external script. Also, we were talking about replacing the data retrieval from restapi with text files located on the same ACS computer for each CPE, that would speed things up, provided NodeJS does not cache them (I still have to research about that).

At the time, to solve the periodic inform interval issue, what I did was to write a single function which returns a number in seconds between 10 and 14 hours based on the last 4 digits of the serial number. Since then we had no more problems of collisions.

JonasGhost · September 8, 2021, 8:40am

Where is that limit set? Is there a way to change it?

rudymartin · September 11, 2021, 3:15pm

yes, both MySQL and MariaDB has it by default, it can be set on /etc/my.cnf or similar

https://dev.mysql.com/doc/refman/8.0/en/server-system-variables.html#sysvar_max_connections

what I understand from what I have read on the web, increasing that value is not really a good option, since it increases the amount of RAM required to run the SQL server. That’s why I am considering using text files as temporary data storage.

I did the tag thing as I mentioned and it’s working fine, at least it gives us some rest since we had several blackouts during this week due to several storm rains.

edit: gave wrong link.

Topic		Replies	Views
Same event every time	5	509	October 6, 2023
Session unsuccessfully terminated	7	1890	September 12, 2023
Best estimation for MAX_CONCURRENT_REQUESTS	10	1297	September 27, 2023
Speading "2 PERIODIC" over time	6	668	October 19, 2024
Using extension scripts always leads to unsuccessfully terminated session	8	1148	September 15, 2021

Connection dropped on multiple cpe connections at the same time

Related topics