Worker failures from stream write contention

spmurrayzzz · August 7, 2020, 5:11pm

I decided to do some minor load tests on this app with small slices of our production network and im seeing many failures on the server side pop up in the logs

cloud-acs_1  | 2020-08-07T17:03:08.327Z [INFO] Worker listening; pid=9275 address="0.0.0.0" port=7547
cloud-acs_1  | 2020-08-07T17:03:32.896Z [ERROR] Uncaught exception; pid=77 exceptionName="Error" exceptionMessage="EAGAIN: resource temporarily unavailable, write" exceptionStack="Error: EAGAIN: resource temporarily unavailable, write"
cloud-acs_1  | 2020-08-07T17:03:33.100Z [ERROR] Uncaught exception; pid=77 exceptionName="Error [ERR_STREAM_DESTROYED]" exceptionMessage="Cannot call write after a stream was destroyed" exceptionStack="Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed\n    at doWrite (_stream_writable.js:406:19)\n    at writeOrBuffer (_stream_writable.js:394:5)\n    at WriteStream.Writable.write (_stream_writable.js:294:11)\n    at accessLog (/app/genieacs/dist/bin/genieacs-cwmp:2:12986)\n    at accessInfo (/app/genieacs/dist/bin/genieacs-cwmp:2:13069)\n    at sendAcsRequest (/app/genieacs/dist/bin/genieacs-cwmp:2:127832)\n    at applyPresets (/app/genieacs/dist/bin/genieacs-cwmp:2:124675)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"
cloud-acs_1  | 2020-08-07T17:03:33.158Z [ERROR] Worker died; pid=77 exitCode=0

Any ideas on whats happening here? Known issue?

spmurrayzzz · August 7, 2020, 5:13pm

Note: those logs come from docker-compose, the application is running in a docker container (alpine) on a t2.micro EC2 host (ubuntu 18.04). There are 3 nodes behind a tcp load balancer, TLS is being terminated in genieacs itself.

spmurrayzzz · August 13, 2020, 2:41pm

I have confirmed this happens with a single server (not in a load balancer). I usually see this error in under 24 hours of the server being live with active connections (less than 100 devices attempting to connect)

bajojoba · August 14, 2020, 11:38am

I had a similar issue. Had to upgrade fo 1.2 version and it disappeared.

spmurrayzzz · August 14, 2020, 5:54pm

This happens to me on current master builds as well. Smells like a state management bug (trying to write to a socket that no longer exists. Since this is an uncaught promise exception, on modern versions of node, the process exits with a non-zero error code.

spmurrayzzz · November 9, 2020, 7:42pm

Just noting here that this bug still exists in v1.2.3. Happens roughly 3 times per 24 hour period on average w/ ~13k CPEs on a single server. My organization is moving on and building our own ACS solution due to the stability and deployment issues we’ve seen with genie, but figured I’d raise this concern here in case you all want more debug information to help you solve the issue. (And it seems like at least a few others were seeing this problem)

Happy to help provide more info if you need it.

spmurrayzzz · November 9, 2020, 7:44pm

Here’s the most recent stack trace

cloud-acs_1  | 2020-11-09T19:24:59.367Z [ERROR] Uncaught exception; pid=922 exceptionName="Error" exceptionMessage="Cannot call write after a stream was destroyed" exceptionStack="Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed\n    at doWrite (_stream_writable.js:399:19)\n    at writeOrBuffer (_stream_writable.js:387:5)\n    at WriteStream.Writable.write (_stream_writable.js:318:11)\n    at ae (/app/genieacs/dist/bin/genieacs-cwmp:2:10540)\n    at re (/app/genieacs/dist/bin/genieacs-cwmp:2:10587)\n    at Nn (/app/genieacs/dist/bin/genieacs-cwmp:2:116067)\n    at processTicksAndRejections (internal/process/task_queues.js:97:5)"
cloud-acs_1  | 2020-11-09T19:24:59.431Z [ERROR] Worker died; pid=922 exitCode=0
cloud-acs_1  | 2020-11-09T19:24:59.431Z [ERROR] Too many crashes, exiting; pid=10

zaidka · November 11, 2020, 6:47am

Is this stack trace from v1.2.3 install npm? If not, can you share the genieacs-cwmp file? Need to extract the sourcemap part to translate the stack trace.

spmurrayzzz · November 11, 2020, 5:27pm

The stack trace is indeed from 1.2.3 but we don’t install from npm for security reasons. We fork the repo and build it ourselves for our production container environments.

Is there an additional command or config that you use to generate the typescript sourcemaps? Currently our build process follows the readme step of npm run build

I could add "sourceMap": true to the tsconfig if thats the desired approach.

zaidka · November 11, 2020, 5:38pm

The sourcemaps are inlined within the executable file. Extract them like this:

tail -n 1 dist/bin/genieacs-cwmp

It can be quite large so you can email that to me if you prefer.

spmurrayzzz · November 11, 2020, 7:56pm

It wasnt too large, only 250kb https://filebin.net/f9hqb373sglee0zz/genieacs-cwmp.sourcemap?t=ya8n1kvh

zaidka · November 11, 2020, 8:39pm

Thanks. Looking at the trace I can see the exception is getting thrown because the access log stream is no longer available for some reason. Do you have your logs configured to go to stdout or to a file? If it’s the latter, what’s your logrotate config like?

spmurrayzzz · November 11, 2020, 9:03pm

All logs currently go to stdout. Because we use systemd to keep the container up, those end up in syslog downstream from the application. But the application itself is only writing to stdout

zaidka · November 11, 2020, 10:15pm

Do you have any custom patches applied? Are you by any chance setting the environment variable GENIEACS_CWMP_ACCESS_LOG_FILE to something like /dev/stdout?

spmurrayzzz · November 12, 2020, 2:10pm

We do have a small patch applied to remove a mongo privilege escalation security risk: https://github.com/StarryInternet/genieacs/commit/e2861d63bb95cbbdcb7c0ff13135e2869b26fd3e

We have this deployed to a shared db cluster, and as a result we dont give admin privs to applications (which you need to run that command to get hostInfo). I commented on this problem in another thread: Remove dependency from db.hostInfo()

As for environment variables, these are the only ones we’re setting:

GENIEACS_CWMP_SSL_CERT
GENIEACS_CWMP_SSL_KEY
GENIEACS_UI_SSL_CERT
GENIEACS_UI_SSL_KEY
GENIEACS_MONGODB_CONNECTION_URL

zaidka · November 12, 2020, 5:55pm

Hmm so I suspect systemd/journald is terminating the stdout stream for some reason. I think some applications simply stop writing to stdout if it’s closed but I don’t think that would make for a good fix here. The access log IMO is too important to silently ignore.

Note that Genie uses stdout for access log and stderr for process related logs (e.g. service start, exceptions, etc). My recommendation is write access log to a file and keep stderr in journald. Access logs don’t belong in journald anyway.

spmurrayzzz · November 12, 2020, 7:52pm

That explanation doesn’t seem likely for a few different reasons:

we have hundreds of processes deployed this way — specifically, container applications that write all their logs to stdout and systemd watches the docker output. No problems in the last 6 years since we founded the ISP (this deployment is identical to how we deploy everything else)
even if systemd decided to close the container stream its listening to, the application’s access to stdout doesn’t change because its namespaced for that container. So even if the fclose() syscall happens on the stream that systemd is using, that call wouldn’t affect the stream that the application is using in the container. They are using different file descriptors entirely, which you can verify via lsof

zaidka · November 12, 2020, 10:05pm

See https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html#Stream%20logging

You can kind of reproduce this issue by stopping then immediately starting journald (but not ‘restart’ as described in the link above). Just tested that myself and the genieacs-cwmp process would crash as soon as I start the client simulator after having restarted journald. Though I don’t even get the see the exception log because both stdout and stderr get terminated in this case.

It’s possible that the sheer volume of logs coming from the process is more than what journald is willing to handle so it just terminates the stream. This looks like it might be related: https://github.com/coreos/bugs/issues/990

spmurrayzzz · November 13, 2020, 12:55am

Again, we have dozens of production services logging orders of magnitude more data than genie does without this issue (including access logs that amount to 400 million requests per day, all to stdout) so I don’t see any reasonable evidence here beyond conjecture that your hypothesis that systemd/journald is the issue would be accurate.

Ill do some more more digging here to see if I can find anything more helpful.

Topic		Replies	Views
Fault not associated with a channel!	3	926	December 14, 2020
Worker died for all services	2	1097	June 16, 2021
Attempt to set a non-writable parameter	15	82	March 12, 2025
Worker dies and cwmp stops	0	604	July 2, 2021
1.2.13 CWMP crash	0	116	August 7, 2024

Worker failures from stream write contention

Related topics