Invalid session with Apache proxy

We are trying to configure Apache as a transparent proxy for the CWMP process, but our encountering “Invalid session” issues. Our Apache config looks like this:

<VirtualHost public_ip:7548>
  ServerName acs.domain.net:7548
  ProxyPass / http://127.0.0.1:7547/
  ProxyPassReverse / http://127.0.0.1:7547/
  SSLEngine on
  SSLCertificateFile "/etc/ssl/certs/domain_net.crt"
  SSLCertificateKeyFile "/etc/ssl/certs/domain_net.key"
</VirtualHost>

Before we put the transparent proxy in place, we had zero “Invalid session” errors. After, we had over 50k before we pulled the plug and reverted.

#Before:
$ zgrep "Invalid session" /var/log/genieacs/genieacs-cwmp.log-20200429.gz | wc -l
0

#After:
$ grep "Invalid session" /var/log/genieacs/genieacs-cwmp.log | wc -l
54134

Here is our Apache info:
Server version: Apache/2.4.39 (Unix)
Server built: Apr 6 2019 14:30:06

This happens when the proxy shares the same upstream TCP connection for multiple TR-069 sessions. Genie caches data for an active session in memory (tied to the connection) so you need to configure the proxy such that TCP connections are never reused. I don’t remember how I used to do that Apache. Try the opitons ‘disablereuse’ or ‘stickysession’.

1 Like

Thank you for your help! The solution was:

ProxyPass / http://127.0.0.1:7547/ disablereuse=On

2 Likes

What was the performance goal of using the connection itself as a cache key? This makes this application very error prone with abstracted layer 7 load balancers (like AWS ELB, given theres no way to tell it to not reuse connections apart from setting a 1s idle timeout)

In order to cache the current session in process rather than the database. If we decouple the TR-069 session from the TCP connection then different RPCs belonging to one session may end up being served by different processes (or servers). That’s a pretty significant performance penalty.

Not familiar with AWS ELB. You can essentially treat genieacs-cwmp as a TCP service and use a TCP load balancer.

Thanks for the context @zaidka

For posteritys sakes, AWS ELB = Amazon Web Services, Elastic Load Balancer (https://aws.amazon.com/elasticloadbalancing)

Its a very commonly used load balancer service. Its pretty standard to do TLS offloading/termination at this layer which, even if you configure it as a pure layer 4 proxy, has the same upstream proxy connection reuse issue noted above. I’d honestly be surprised if you don’t get more feedback like this in the future.

A notable workaround is to do the TLS termination in the genieacs application itself, but thats likely less performant than doing it in something written in C like nginx (in addition to having no connection reuse introducing a ton of socket overhead).

Node uses OpenSSL for TLS/SSL so that puts on an equal footing with nginx performance wise.

I’m inclined to think that the overhead is negligible in the grand scheme of things. For perspective, the TLS handshake probably has an order of magnitude bigger overhead for example. I could be wrong though. Either way, the performance gains we get by imposing the restriction of one session per connection are just too good to miss.

That performance claim is simply not true. There is a reason that the battle-tested way to deploy Node.js applications is put a nginx (or similar) reverse proxy in front of the application process, both for TLS offloading as well as serving static content when applicable. EDIT: The reason is that C is just objectively better by every measure when CPU-bound tasks are what we’re evaluating.

This article is a little old, but its still relevant and illustrates my point with an accompanying load test: Why should I use a Reverse Proxy if Node.js is Production-Ready? | by Thomas Hunter II | intrinsic | Medium

image

nginx outperformed node request throughput (without SSL) by 157% and with SSL the difference was almost 30%.

When you combine that raw throughput with all the work nginx does under the hood to make socket use more efficient, these numbers add up to significant performance benefit at scale. If it would be helpful for you, I’d be happy to setup a test suite you could use to validate your releases in this way. Respectfully, I think trying to keep your application layer more stateless (e.g. not storing session data in volatile memory) would yield you better performance numbers down the road due if you can leverage long-used tooling/patterns in this space to do so.

As it stands right now, people have to understand implementation details of how your session state is stored in order to put a proxy in front of it. That seems like a poor design choice if you want wider adoption of this application, and better performance.

Node is still using C/C++ for TLS by means of OpenSSL. As per the article, offloading TLS to nginx yields 16% higher throughput. The difference would be much greater if Node was doing the CPU intensive processing in JavaScript!

When serving static content, 16% performance improvement is very respectable for sure. But I’d be surprised if that would account for more than 1% of the total load on the server. While the benefit we get from being able to cache session data in process memory is easily 3x higher throughout (and that’s a conservative guesstimate). Let me know if you’d like to test that I’ll happily prepare a patch that you can apply to make Genie cache sessions in the database. I’m sure you’ll come agree that the current approach is a reasonable compromise for that much more performance.

Thanks for engaging me on this, and I’m happy to help test that. Let me know if you need any assistance from me, I dont expect you to drop all your current work to appease my architecture concerns here. But I am very wary about using the current app in production as it is, without being able to terminate TLS at the load balancer with reuseable socket connections. Stability being the major concern.

For some context to my questions here: In my 15 years of experience building distributed systems, hitting a high-performance database (using a primary key, constant time lookup) for a session document has never been a performance issue. Ex: one of our core services I currently manage handles over 70k requests per minute. The average session lookup accounts for roughly 1ms per request, this is done in mongo. Respectfully in my experience, that sub 1ms overhead does not add up to your 3x performance guesstimate.

I realize everyone’s deployment strategies aren’t homogenous, and very depending on business concerns, but as of right now putting a standard reverse proxy in front of your application breaks its state management layer, which really does feel like a suboptimal design choice given how cheap session data can be managed in a database.

Hello, sorry for bump this topic again.

We’re in a situation were GenieACS is returning session invalid, when it’s deployed in Google Kubernetes Engine (GKE).

We tried with the latest versions (1.2.8 and 1.2.9), we tried a lot of changes in the the LB and the Kubernetes Ingress Class (Nginx and Traefik), we tried to change the the Keepalive, TCP header forward, etc…

When testing the application on a local Docker, everything works correctly. When using anywhere that has an LB in front, this error starts to appear when we have a few hundred devices

We believe that the problem is what was reported in this topic, GenieACS is detecting several sessions from the same source and starts giving an invalid session error, did you manage to create a patch to change this? Or is there any configuration that can be done?

Thank you very much

Hello @zaidka and @spmurrayzzz. I hope you’re fine!

I have a similar topology and cannot figure it out a workaround, tried changing Cloud but that didn’t help.
Noticed you guys talking about a patch, was it released? Do you have a solution for this scenario?

Thanks in advance!

Best Regards.

I ended up building a custom in-house ACS from scratch instead, as there were other benefits to owning the code ourselves beyond just fixing the design flaw noted in this thread, so I didn’t move forward with a patch for this project.

Unfortunately though, if genieacs is still using the client socket/connection reference as a key for session state management, theres not much you can do if you have a reverse proxy doing any sort of TLS termination on ingress.

You could try the workaround I noted above in the thread where genie does the TLS termination and you disable socket reuse in the reverse proxy. That may or may not still work, it’s been a while since I’ve worked with this project. It also may defeat your infrastructure purpose of using a proxy or load balancer in the first place, depends on your use case.

Just sharing an update, we gave up using Genie in our Kubernetes clusters due to these problems, we are creating a dedicated server with direct access (without Load Balancer) and now everything is working… it will generate some difficulties and changes internally, but it’s working