As regular readers know, I’m an advocate – often an annoyingly persistent advocate – for backing up.
As I write this, it’s been kind of a rough week here at Ask Leo! world headquarters – not for me, but for some of the organizations and people that I support in my spare time.
While the issues at hand are getting resolved – and I’ll walk through some of them as a bit of an example – it all served to remind me just how fragile our digital lives can seem at times.
And that beyond backing up, one of the most important things we can control is exactly how we react when things go wrong – because sooner or later things will go wrong.
Become a Patron of Ask Leo! and go ad-free!
Five days without mail
A couple of groups that I support receive all of their email through a spam-filtering service1. That means that when you send someone in that group email, it first goes to this service where it’s analyzed. Only the non-spam (or “ham” as it’s sometimes referred to) is then delivered to the group’s actual mail server and into the recipient’s email account.
On Thursday a system administrator made a very reasonable and innocuous change to the DNS for that group’s domain. DNS is the mapping of a name – like “askleo.com” – to the actual IP address of the server that domain resides on. It also includes the information on where the email destined for accounts on that domain should be delivered. The change, as I said, was for all intents and purposes, completely justified and benign. It’s a change I could see myself making.
But of course, it wasn’t benign. It had the unanticipated side-effect of confusing the heck out of the spam filter service.
Unfortunately, we didn’t realize that this change had been made until four days after it happened. Like I said, it was expected to be a completely benign change. On Thursday I got the call: “we stopped getting email!”. For the next four days we worked on the problem without really knowing the root cause.
Only in a conversation with the system administrator on Monday did we both happen to mention Thursday – as the date of the change, and the date of the email stoppage. Not being a big believer in coincidence, it was clear we’d found our smoking gun.
But wait! There’s more!
You’d think that would be the end of it, but nope.
We got the email for that group up and running, and by Tuesday all was well.
On Wednesday I got a message from a different group that I’m also helping out (Corgi-related, this time :-) ) that mail sent to their discussion alias was suddenly bouncing.
Took a bit of investigating, but this, too, turned out to be related to the same anti-spam service. The chronology:
- Over the weekend, before we understood the root cause, the service “rebuilt” the anti-spam account for the problematic domain in the hopes that whatever was broken would get fixed.2
- An unanticipated side-effect was that two other domains associated with my billing account were “disassociated” from the domain we were working on. They became orphans. The discussion list that started bouncing was on one of those domains.
- On Wednesday, several days later, an audit process at the anti-spam service noted the orphan entries … and deleted them. Email to those domains started bouncing immediately.
Once again a phone call to the anti-spam service (they’re on speed dial by this time) and some investigative work on their part, and all is rebuilt and repaired – presumably including my bill.
Meanwhile in another corner of the world
As the events above were wrapping up I got an email from a friend for whom I am also the webmaster and occasional tech support. One day her laptop – the center of her business – wouldn’t turn on.
Now, we’re several miles and hours apart, so actually physically helping with that repair wasn’t in the cards. She did find someone local. They got the machine working, but, against all expectations, seemed unable to repair, reconstruct or otherwise reconfigure her email (Outlook) to its pre-failure condition.
She texted and emailed in a panic. Understandable, I think, because as I said – her business lives not just on her laptop, but in Outlook. Not getting that back the way it was could have some dire consequences.
Sometimes the experience that I bring is nothing more than thinking ‘oh, this looks interesting, let’s poke it.’
I was able to gain access to her Carbonite online backup and confirm that her PST files – the repository of all her Outlook information – were present and backed up right up to the day of or before the failure. One way or another we should be able restore her world.
I expect to run a remote session to her machine later using TeamViewer to see what that’ll take. (This’ll also be my first direct experience with Carbonite. I’m hoping to have good things to report.3)
And then 100 yards to the east
My neighbor had asked for some assistance with a printer. It was a wireless printer of some sort, but would only print from one of his two computers; interestingly enough the desktop and not the wireless laptop.
Printer setup can be amazingly complex and frustrating, and honestly is best dealt with in person. I let him know that I’d stop by some time. Problem was that I let him know that two weeks ago before all of the above (and a volunteer event) happened.
Finally as things calmed down I was able to have a look.
Sometimes the experience that I bring is nothing more than thinking “oh, this looks interesting, let’s poke it”. Years of experience poking things often leads me to solutions that others might not discover. The setup menu for the printer was anything but simple or user friendly. But I poked at it enough in places that looked like they might relate, and sure enough after a few minutes we’d connected the printer to the network, the laptop to the printer, and were printing a test page.
Bullet, meet foot
On top of everything else, for a portion of the time that all the problems above were occurring, mail that I was sending out was silently going nowhere.
I’d made a change that had the unanticipated side-effect of confusing Gmail (the mail service I use) as to exactly how mail should get sent from me. The confusion resulted in the kind of failure that takes 24 hours to report, and then keeps trying for 5 days until giving up completely.
I thought everyone was just avoiding my messages. In reality my messages had gone exactly nowhere – and they wouldn’t, ever.
I’d cleanly and silently shot myself in the foot.
Fortunately I had an insight as to why I might be getting ignored by the world, and discovered and fixed the problem about 24 hours after I’d created it.
Our fragile world
A common thread to this week is the unanticipated side effect of seemingly minor and inconsequential actions.
Yes, our world is fragile.
Our world is also complex.
While it might seem to some, that people such as me can navigate it with ease, the reality is that there’s a lot of tap dancing and hand-wringing going on in the background that you often don’t see. Yes, I lost sleep this week – though not because I didn’t think things wouldn’t get fixed, but rather because the service that I was providing was letting people down.
By far, the single most important thing you can do when things go wrong is simply not to panic. Of course, prepare beforehand, like my friend’s backup, but more importantly simply refuse to panic. I know that’s difficult – particularly when important data might be at stake, but honestly – panicking almost always makes things worse – often much worse.
Panicking never helps4.
And if there’s one thing you can count on in this fragile world, something will go wrong, go wrong, gowrong, gowr..asd1234123asd23e