Friday, December 31, 2021

Murphy's Law: Or why in the world did that have to go wrong?

 

     There is a "law" of the universe (similar to Newton's laws or the Department of Motor Vehicle's laws) that is known as "Murphy's Law" (or Murphy's first law; Murphy was so infamous that it was decided to tack on two additional "laws" to his name). All apologies to anyone named Murphy out there. (Actually, the roots of the adage are uncertain but Murphy gets primary credit in US society.)

     So, why end off the blogs for this year with one concerning "anything that can go wrong, will go wrong"? Some of you may have noticed a period of inactivity for me since mid-November 2021 (most probably didn't notice at all -- oh well). (Do, please, feel free to leave comments or suggestions for future topics.)

     Why end up the year with this topic? Well, I certainly HOPE everyone had a fantastic year. For those who didn't ...........

     Around November 20, the hard disk on my primary computer went bad (SMART tests started failing). Actually, that was only the start of analysis. It took a week to find out why it was failing, another week to decide how to approach the problem (repair or replacement -- I decided on replacement). "Supply chain issues", "chip shortages", etc. etc. It took a while to get the new computer. So, now to just use my backup drive to restore the system onto the new machine! Piece of cake, right?

     Nope. That backup hard disk, which was behaving fine (and happily responding to diagnostic programs) prior to the hard disk collapse, now decided to not mount. Checks were performed. New tools were obtained. File system in great shape. Partition/volume test results erratic -- sometimes they passed, sometimes not. It mounted once -- on a different machine which I was using to perform tests on the hard disk. But I had a technician on the phone and they felt it would be useful to try one more test -- lost that one instance of mounting.

     Then, early this week, I performed an OS upgrade on my new machine. While updating, it took a left turn and had an unexpected crash/reboot. When it came up, the disk was mounted. No known reason -- any more than any reason known why it STOPPED mounting.

     I started the process of restoration. It took a LOONG time. USB-C transfer times as slow as 3 KB/sec at times. But it kept moving (and the data transfer rate fluctuated from 3 KB/sec to as much as 1.3 MB/sec) and, after about 28 hours, my system was restored to almost the same state as it was in on November 20.

     Lessons learned? No, not really. I had done regular backups. How many of you are making backups of your backups? What is the chance of a primary hard disk and the primary backup disk failing at virtually the same time? Before it happened, that chance was awfully small. As always, however, once it had already occurred, that chance was up to 100%. If you are in that small minority who can keep track of such (I am not sure I am) then alternating disks for backups might be wise.

     Could I have continued work using alternative tools (tablet/smartphone/borrowed computers)? Sure, but there are a lot of things in the environment that just were no longer there. It turned out that one of the things about which I was saying "thank goodness" really wasn't true. It turned out that that cloud backup of some vital files actually had different copies of some files that were accessed differently for different user accounts. Any work that I did in the meantime had the danger of being overwritten if/when the restore did succeed. I had just barely decide to "go full speed ahead" when the backup disk decided to mount.

     Other things that fall into Murphy's category. The lower heating element of our oven broke (literally) while cooking Christmas breakfast. In our bathroom, the outlet stopped working, the main circuit breaker had NOT flipped, and the apparently affected switch did not have a GFI reset button -- finally tracked down the outlet that DID have a tripped GFI, with reset button, this week. (Are you aware that a GFI interruption affects ALL of the outlets on the circuit -- not just the specific outlet? You probably are -- I wasn't.)

     In logistics, resource planning, and business procedures there is always an attempt to allow for things going wrong. But just what specifically goes wrong makes a big difference and it is safe to assume that, if you have prepared for 98% of the likely problems, you will eventually get one in that 2% not taken care of. This is why recovery plans are important to have in place -- your prevention methods will not always succeed. This is a given in the world of cybersecurity. "Bad guys" are always going to be searching for newly exploitable vulnerabilities and they will find them. Noticing, stopping, and recovery from will always be needed. (But keep trying to protect, or eliminate, the vulnerabilities anyway.)

Interrupt Driven: Design and Alternatives

       It should not be surprising that there are many aspects of computer architecture which mirror how humans think and behave. Humans des...