Was hoping to post about more exciting things that have happened but for the moment, our servers are down due to unplanned outage. We will post an update when we have the servers backup and running along with a technical summary.
Update: Back online and as promised an explanation.
The problem has a number of contributors mostly consisting of stupid mistakes that should not have happened but lessons well learned.
First a little background:
Main login server authenticates off an ldap server so we can do central authentication with many different services (lots of logins, one password/user data etc). SSH servers run nslcd, a daemon that queries nss details from an ldap server (such as username, uid, shell, home directory etc). NSCD (different to nslcd), is a caching daemon that caches results from the ldap server.
Factors in downtime:
1:NSCD was turned off (caching daemon). By itself wasn’t that big an issue but still not recommended. The correct response was to turn it off only while debugging or to force the cache to update when we need it to update.
2: Debugging was enabled on the ldap server to a separate non syslog file with excess debugging information. I had turned it on to debug an issue and just left it on by accident.
3: /var/log or even /var /was not in a separate directory. Not anticipating logs to grow that fast before logrotate took care of it was my fault. openldap was logging directly to a file which didn’t go through syslog, getting around that point. I imagine that even if we had separated /var, it would have still caused issues as slapd (openldap server), still stores its real files for non special databases in /var/lib/ldap.
4: Even with all of these issues, it was processes on our main login server that started the ball rolling. When we tested it ourselves, because we weren’t causing the server to lookup much, we didn’t see much but as soon as people started using it and we started getting more members, even doing a simple ls -al of the home directories, causes the server to do 100s of queries, each for specific uidNumbers due to the amount of users and then of course, each user doing similar commands etc etc, you can see how it went.
Investigation
We didn’t actually know what the problem was at the start. The possible candidates came to mind: someone was brute forcing the server, someone was scraping or rapidly querying the webserver in such a way that caused it to query the ldap server. (wordpress is inter linked with ldap), something was misconfigured in nslcd on one of the servers, someone was querying our ldap server from inside our network excessively hard.
When I saw the hard drive was full on the ldap server, I took down the server, fired up a new vm, setup a separate /var/log partition and rebuilt the server. This was so I could properly examine the logs in case anything more malicious was at hand while letting the ldap service run. Pretty easy process with slapcat however, my documentation on rebuilding was slightly ambiguous on recent changes and caused me to take longer than I thought it would.
Mitigating the problem + future proofing it.
/var/log now has it’s own separate partition. NSCD is turned on and will continue to be tweaked to get the right cache time. Logging verbosity has been reduced to a more sensible level and only uses syslog. I’ve correctly fixed/updated the documentation on rebuilding. In the case that something happens again, i’m going to write a script to build the server in one go without prompting to save some time as well . We’re also looking into better monitoring solutions that will give us rapider response times so we can react faster.
This is a little embarrassing I admit, but I’d prefer to keep you all informed about what’s actually going on. If anyone is interested and wants to know more, hit us up.
Mark





0 Comments.