You Need To Monitor Your Monitoring Service – A story about server failure

You Need Fail over

If you are a self-hosting enthusiast like me, you probably have already heard about Uptime-Kuma. It is fantastic. But in this world, you should not use only one Kuma. You should use two Kumas or Monitor your Kuma with Uptimerobot. This is a story about redundancy, and why everyone should have redundancy. In the heat of the moment I forgot to take screenshots, I am sorry about that.

The plot of this story includes a server failure. And yes I am talking about a VPS here, the computers which float in the cloud and you pay to rent them. Yes, it happens. It happened to me with an Oracle Cloud instance. And yes I managed to restore it quite easily. How I managed it will be another story. Here I am going to tell you what I learned about dual redundancy.

A preview of my setup :

At that point I had 3 VPS instances and more than 20 online services, some of which are public facing (yes my IP-Checker service and No-Number-Whatsapp service, both ad-free). In one of the instances I was running my own pi-hole in the cloud and all my devices connect to it using Wireguard (in a split tunnel for DNS traffic only). I was also using that Wireguard VPN instance as a gateway to SSH into my public-facing VMs for better security. So when I learned about Uptime Kuma, I installed it on the same instance running Wireguard and pi-hole. Don’t worry the instance was running fine ( CPU at 6%, RAM at 40% ).

Pihole +wireguard

As I was running my VPN cum Monitoring server in the cloud I was pretty confident that the hardware is not going to fail. But boy I was wrong. I set up my Kuma to monitor all my servers at an interval of 20 seconds, and if something goes down, it sends me an email and a telegram message immediately. The setup was working great.

What happened:

But one fine morning it happened. I was not using my phone so I did not notice but my mom said that her internet is not functioning. I figured out my VPN is down. I tried to SSH into my VM, but no response. I could not figure it out. I opened my Oracle cloud dashboard and saw a strange thing. It says the VM is running but the graphs were telling a different story. I noticed a spike in CPU usage from a baseline of 6% to a max of 66% and then the tracing disappeared as if the server has stopped. All the tracing including CPU, RAM, Network and others just vanished indicating server shutdown. I manually tried to stop it using the dashboard but failed. It was a hardware failure. A very rare thing. But it definitely happens. Anyway, I will tell you about it in another post. But in the end, I lost my VPN connection, my internet and my Monitoring server.

Switching off the VPN in the devices restored the internet connection for them and I of course received some WhatsApp messages in that time period, nothing important though and I do not know how long my internet was down. My Kuma was also down. So I received no notification of its death from itself either.

CPU Usage graph

What I learned :

Of course, I learned nothing and just started scrolling down my Twitter feed as if nothing happened. No no no I was just joking. First I restored my service. I also learned the value of redundancy in critical infrastructures. So I

  • Launched another VM for a second Pi-Hole and WireGuard combo setup. Switched half of my devices to the secondary Pi-Hole so that I receive email and telegram notifications in at least one of my devices when something like this happens again.
  • I was too lazy to set up a second Uptime Kuma to monitor my other Uptime kuma. So I created a free account with Uptime Robot. It will monitor my monitoring server at least every 5 minutes. And now I have a better understanding of disaster recovery. So it will be a faster recovery next time.
  • Set up a good monitoring system. Don’t just use ping for monitoring your stuff. There are multiple options for better monitoring. In Uptime Kuma you can monitor HTTPs response, and SSL certificate expiry and the best is keyword monitoring. You can literally monitor if a particular word is loading on your website.
  • Oracle says that the detects hardware failure within 5 minutes and migrates the services to new servers automatically. But that was not the case. So you should be prepared for yourself. Don’t believe the cloud providers. Especially if you are on the free tier.

In the summery, I just want to tell you that monitor your stuff. Have some redundancies for critical infrastructures. Have a solid disaster recovery plan and of course, comment down below about what you liked and what you did not like and don’t forget to share it with like-minded people.

FAQ Section

  1. Why I use pi-hole: Yes I also depend on ad revenues. But most of the news outlets put more ads than content. I usually pay for a premium subscription whenever it is available. But for a small blog with good content, I disable my pi-hole for a bit.
  2. Yes, I now have 4 VPS and I run my pi holes in VPS. It is because all my VPS are free tier and I don’t pay for them. Also, my internet is behind CG-NAT so the VPS is better for my setup.
  3. Why Uptime Kuma: First I don’t have the money to buy a premium plan for uptime robot. This blog itself is self-hosted, here is how. And second, If you don’t use Uptime Kuma I will not consider you as a self-hosting enthusiast. It is awesome man!

I will write a post about my experience with server failure and disaster recovery. I will also write more about my setups if you want to know.

Thanks for reading so far. Hope you have enjoyed it. Have a nice day. To know more about me visit here.

Also I am new to writing blogs. So please help me improve my skill by commenting down below the things you want me to do better the next time. Thank you.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.