I know this is not a GenieACS problem but if some GenieACS user that uses Mikrotik could help me here I would appreciate so much!
Our ISP is testing Mikrotiks as customer home routers. Currently we have 60 hAP ac2 (ROS 6.47.7) devices running and we manage them through TR-069.
Everytime I change the config script and push it to our 60 devices I have a small percentage (usually 2-3 devices) that doesn’t boot up. Each time, they are different devices that become unresponsive like this and the logs are all the same. These are the last logs I get from the device (and the log is the same for the devices that boot up properly):
2020-11-18 04:05:29 -0700 MST system,info resetting system configuration
2020-11-18 04:05:29 -0700 MST tr069,info performing config overwrite
Then it fails to boot and the customer needs to phisically reboot the device in order to come back. Watchdog doesn’t seems to trigger because no matter how much time we wait, the device won’t reboot by itself.
When I was testing at the lab, I had some episodes where rebooting or config overwriting was making the device unresponsive (no Winbox, no SSH, no nothing until unplug and plug it back) but it seemed random and I thought it was a problem with that specific device.
I really don’t know if the crash happens when device is shutting down or booting up.
Does anyone have any ideas?
Something we have encountered a few times is that, for unknown reasons, sometimes devices upon config overwrite will end up booting up with a completely blank config. In such cases, we are able to MAC telnet into the device from a MikroTik router in the same broadcast domain and reset the configuration again to a factory default config, and then it usually reapplies the GenieACS config overwrite after that. I haven’t tried doing a simple reboot in that case to see if the device comes back on its own. I always assumed that the device didn’t entirely download the config file, tried to apply an empty or incomplete config file, and ended up with a blank configuration. I’m not sure this is the same problem you are having, and our routers are running an older 6.44.5 version, so I’m not sure how helpful it is.
What I might suggest is that you avoid doing overwrites for global changes to existing configurations - use configuration alterations for that (.alter files). It is much safer that way. We only do a config overwrite when a device that has the stage 1 boot loader default connects to the ACS and needs to download the full blown config.
I’m happy to talk to someone about this problem.
As you said, I’m not sure we have the same issue as I’m not able to MAC telnet to the device after this failure (this happens every so often on my lab Mikrotik and I see no sign the device came back, even on a blank config)
Thanks for your suggestion. Indeed .alter is less scary/aggressive when you know this kind of config failures can happen. I would need to give this some thought because I designed our “update push system” (we don’t use Genie’s UI, we integrated it to our intranet instead) for complete reconfigs. That way customers can easily jump from very old configs to the newest one.
It would be very nice thou if MIkrotik would be more reliable when appling TR-069 reconfigs. It seems to me that it’s something we want to work everytime even if you don’t do it very often. I just don’t know what to do with it.
Anyways thanks for answering