Worker failures from stream write contention

I decided to do some minor load tests on this app with small slices of our production network and im seeing many failures on the server side pop up in the logs

cloud-acs_1  | 2020-08-07T17:03:08.327Z [INFO] Worker listening; pid=9275 address="0.0.0.0" port=7547
cloud-acs_1  | 2020-08-07T17:03:32.896Z [ERROR] Uncaught exception; pid=77 exceptionName="Error" exceptionMessage="EAGAIN: resource temporarily unavailable, write" exceptionStack="Error: EAGAIN: resource temporarily unavailable, write"
cloud-acs_1  | 2020-08-07T17:03:33.100Z [ERROR] Uncaught exception; pid=77 exceptionName="Error [ERR_STREAM_DESTROYED]" exceptionMessage="Cannot call write after a stream was destroyed" exceptionStack="Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed\n    at doWrite (_stream_writable.js:406:19)\n    at writeOrBuffer (_stream_writable.js:394:5)\n    at WriteStream.Writable.write (_stream_writable.js:294:11)\n    at accessLog (/app/genieacs/dist/bin/genieacs-cwmp:2:12986)\n    at accessInfo (/app/genieacs/dist/bin/genieacs-cwmp:2:13069)\n    at sendAcsRequest (/app/genieacs/dist/bin/genieacs-cwmp:2:127832)\n    at applyPresets (/app/genieacs/dist/bin/genieacs-cwmp:2:124675)\n    at process._tickCallback (internal/process/next_tick.js:68:7)"
cloud-acs_1  | 2020-08-07T17:03:33.158Z [ERROR] Worker died; pid=77 exitCode=0

Any ideas on whats happening here? Known issue?

Note: those logs come from docker-compose, the application is running in a docker container (alpine) on a t2.micro EC2 host (ubuntu 18.04). There are 3 nodes behind a tcp load balancer, TLS is being terminated in genieacs itself.

I have confirmed this happens with a single server (not in a load balancer). I usually see this error in under 24 hours of the server being live with active connections (less than 100 devices attempting to connect)

I had a similar issue. Had to upgrade fo 1.2 version and it disappeared.

This happens to me on current master builds as well. Smells like a state management bug (trying to write to a socket that no longer exists. Since this is an uncaught promise exception, on modern versions of node, the process exits with a non-zero error code.

Just noting here that this bug still exists in v1.2.3. Happens roughly 3 times per 24 hour period on average w/ ~13k CPEs on a single server. My organization is moving on and building our own ACS solution due to the stability and deployment issues we’ve seen with genie, but figured I’d raise this concern here in case you all want more debug information to help you solve the issue. (And it seems like at least a few others were seeing this problem)

Happy to help provide more info if you need it.

Here’s the most recent stack trace

cloud-acs_1  | 2020-11-09T19:24:59.367Z [ERROR] Uncaught exception; pid=922 exceptionName="Error" exceptionMessage="Cannot call write after a stream was destroyed" exceptionStack="Error [ERR_STREAM_DESTROYED]: Cannot call write after a stream was destroyed\n    at doWrite (_stream_writable.js:399:19)\n    at writeOrBuffer (_stream_writable.js:387:5)\n    at WriteStream.Writable.write (_stream_writable.js:318:11)\n    at ae (/app/genieacs/dist/bin/genieacs-cwmp:2:10540)\n    at re (/app/genieacs/dist/bin/genieacs-cwmp:2:10587)\n    at Nn (/app/genieacs/dist/bin/genieacs-cwmp:2:116067)\n    at processTicksAndRejections (internal/process/task_queues.js:97:5)"
cloud-acs_1  | 2020-11-09T19:24:59.431Z [ERROR] Worker died; pid=922 exitCode=0
cloud-acs_1  | 2020-11-09T19:24:59.431Z [ERROR] Too many crashes, exiting; pid=10

Is this stack trace from v1.2.3 install npm? If not, can you share the genieacs-cwmp file? Need to extract the sourcemap part to translate the stack trace.

The stack trace is indeed from 1.2.3 but we don’t install from npm for security reasons. We fork the repo and build it ourselves for our production container environments.

Is there an additional command or config that you use to generate the typescript sourcemaps? Currently our build process follows the readme step of npm run build

I could add "sourceMap": true to the tsconfig if thats the desired approach.

The sourcemaps are inlined within the executable file. Extract them like this:

tail -n 1 dist/bin/genieacs-cwmp

It can be quite large so you can email that to me if you prefer.

It wasnt too large, only 250kb https://filebin.net/f9hqb373sglee0zz/genieacs-cwmp.sourcemap?t=ya8n1kvh

Thanks. Looking at the trace I can see the exception is getting thrown because the access log stream is no longer available for some reason. Do you have your logs configured to go to stdout or to a file? If it’s the latter, what’s your logrotate config like?

All logs currently go to stdout. Because we use systemd to keep the container up, those end up in syslog downstream from the application. But the application itself is only writing to stdout

Do you have any custom patches applied? Are you by any chance setting the environment variable GENIEACS_CWMP_ACCESS_LOG_FILE to something like /dev/stdout?

We do have a small patch applied to remove a mongo privilege escalation security risk: https://github.com/StarryInternet/genieacs/commit/e2861d63bb95cbbdcb7c0ff13135e2869b26fd3e

We have this deployed to a shared db cluster, and as a result we dont give admin privs to applications (which you need to run that command to get hostInfo). I commented on this problem in another thread: Remove dependency from db.hostInfo()

As for environment variables, these are the only ones we’re setting:

GENIEACS_CWMP_SSL_CERT
GENIEACS_CWMP_SSL_KEY
GENIEACS_UI_SSL_CERT
GENIEACS_UI_SSL_KEY
GENIEACS_MONGODB_CONNECTION_URL

Hmm so I suspect systemd/journald is terminating the stdout stream for some reason. I think some applications simply stop writing to stdout if it’s closed but I don’t think that would make for a good fix here. The access log IMO is too important to silently ignore.

Note that Genie uses stdout for access log and stderr for process related logs (e.g. service start, exceptions, etc). My recommendation is write access log to a file and keep stderr in journald. Access logs don’t belong in journald anyway.

That explanation doesn’t seem likely for a few different reasons:

  1. we have hundreds of processes deployed this way — specifically, container applications that write all their logs to stdout and systemd watches the docker output. No problems in the last 6 years since we founded the ISP (this deployment is identical to how we deploy everything else)

  2. even if systemd decided to close the container stream its listening to, the application’s access to stdout doesn’t change because its namespaced for that container. So even if the fclose() syscall happens on the stream that systemd is using, that call wouldn’t affect the stream that the application is using in the container. They are using different file descriptors entirely, which you can verify via lsof

See https://www.freedesktop.org/software/systemd/man/systemd-journald.service.html#Stream%20logging

You can kind of reproduce this issue by stopping then immediately starting journald (but not ‘restart’ as described in the link above). Just tested that myself and the genieacs-cwmp process would crash as soon as I start the client simulator after having restarted journald. Though I don’t even get the see the exception log because both stdout and stderr get terminated in this case.

It’s possible that the sheer volume of logs coming from the process is more than what journald is willing to handle so it just terminates the stream. This looks like it might be related: https://github.com/coreos/bugs/issues/990

Again, we have dozens of production services logging orders of magnitude more data than genie does without this issue (including access logs that amount to 400 million requests per day, all to stdout) so I don’t see any reasonable evidence here beyond conjecture that your hypothesis that systemd/journald is the issue would be accurate.

Ill do some more more digging here to see if I can find anything more helpful.