Skip to main content

Engineering by Roomba’ing Around

A simple random walk algorithm:
  • Start out systematically
  • Hit an obstacle
  • Change direction
  • Hit another obstacle
  • Change direction
  • Eventually cover the problem space.

As applied to the problem of cleaning a floor, the algorithm seems to work OK, particularly if you are willing to ignore the parts of the problem space that the device cannot solve (corners, low furniture, complex spaces).

I sometimes see similar algorithms used by IT engineers. They start out systematically, hit an obstacle, head off in a random direction, hit an obstacle, head off in a different direction, and (usually) solve the problem (eventually). Unfortunately many IT engineers troubleshoot this way.

It could be worse – Some engineers start out systematically, hit an obstacle, and instead of changing direction, they just keep on banging into the obstacle. They haven’t figured out that even a random direction change is better than no direction change.

I also see IT engineers ignore the problems that their tool or project cannot solve. Roomba owners presumably understand that the device has limitations and they compensate for those limitations by using conventional cleaning techniques in places where the Roomba has no coverage. With a Roomba, the house isn’t clean until the corners are covered. With IT projects, I sometimes see the project declared a success long before the corners are covered, and I often see requirements that are not easily addressed by a particular tool, technique or workgroup either ignored or arbitrarily declared ‘out of scope’. (I.E. ‘we cannot solve this problem in the allotted time/budget, therefore the problem doesn’t exist.’)

Over the last month or so, Sun Oracle graciously provided us with another opportunity to learn far more about ZFS than we really ought to need to know. While getting sucked into debugging another obscure issue, I tried occasionally to step back and ask if we were systematically troubleshooting or just Roomba’ing around.

Sometimes it’s hard to tell.


Popular posts from this blog

Cargo Cult System Administration

Cargo Cult: …imitate the superficial exterior of a process or system without having any understanding of the underlying substance --Wikipedia During and after WWII, some native south pacific islanders erroneously associated the presence of war related technology with the delivery of highly desirable cargo. When the war ended and the cargo stopped showing up, they built crude facsimiles of runways, control towers, and airplanes in the belief that the presence of war technology caused the delivery of desirable cargo. From our point of view, it looks pretty amusing to see people build fake airplanes, runways and control towers  and wait for cargo to fall from the sky.
The question is, how amusing are we?We have cargo cult science[1], cargo cult management[2], cargo cult programming[3], how about cargo cult system management?Here’s some common system administration failures that might be ‘cargo cult’:
Failing to understand the difference between necessary and sufficient. A daily backup …

Ad-Hoc Versus Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to…

The Cloud – Provider Failure Modes

In The Cloud - Outsourcing Moved up the Stack[1] I compared the outsourcing that we do routinely (wide area networks) with the outsourcing of the higher layers of the application stack (processor, memory, storage). Conceptually they are similar:In both cases you’ve entrusted your bits to someone else, you’ve shared physical and logical resources with others, you’ve disassociated physical devices (circuits or servers) from logical devices (virtual circuits, virtual severs), and in exchange for what is hopefully better, faster, cheaper service, you give up visibility, manageability and control to a provider. There are differences though. In the case of networking, your cloud provider is only entrusted with your bits for the time it takes for those bits to cross the providers network, and the loss of a few bits is not catastrophic. For providers of higher layer services, the bits are entrusted to the provider for the life of the bits, and the loss of a few bits is a major problem. These …