Wednesday, April 30, 2008

Simple Steps to Improving Availability - Five Essential Transitions

A poorly managed high availability cluster will have lower availability than a properly managed non-redundant system.

That's a bold statement, but I'm pretty sure it's true. The bottom line is that the path to improving system availability begins with the fundamentals of system management, not with redundant or HA systems. Only after you have executed the fundamentals will clustering or high availability make a positive contribution to system availability.

Here's the five transitions that are the critical steps on the path to improved availability:

Transition #1: From ad-hoc system management to structured system management.

Structured system management implies that you understand the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications, that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. Ad hoc system management doesn't cut it.

Transition #2: From ad-hoc changes to simple change management.

Simple change management means that you have controls around changes sufficient to determine who/what/when/why on any change to any system or application critical file. Changes are predicted. Changes are documented. Changes are not random, and they do not 'just happen'. A text file 'changes.txt' edited with notepad.exe and stored in c:/changelog/ is not as comprehensive as a million dollar consultant-driven enterprise CMDB that takes years to implement, but is a huge step in the right direction, and certainly provides more incremental value at less cost than the big solution.

Transition #3: From 'i dunno....maybe.....' to root cause analysis.

Failures have a cause. All of them. The 'cosmic ray did it' excuse is bullshit. Find the root cause. Fix the core problem. You need to be able to determine that 'the event was caused by .... and can be resolved by ... and can be prevented from ever happening again by ...'. If you cannot find the cause and you have to resort to killing a process or rebooting a server to restore service, then you must add instrumentation, monitoring or debugging to your system sufficient so that the next time the event happens, you will find the cause.

Transition #4: From 'try it...I think it will work..' to 'my tests show that......'

Comprehensive pre-production testing ensures that the systems that you build and the changes that you make will work as expected. You know that they will work because you tested them, and in the rare case that they do not work as expected, you will be able do determine the variation between test and production devise a test that accommodates the differences.

Transition #5: From non-redundant systems to simple redundancy.

Finally, after you've made transitions one through four, you are ready for implementation of basic active/passive redundancy. Skipping ahead to transition #5 isn't going to get you to your availability goals any sooner.

Remember though, keep it simple. Complexity doesn't necessarily increase availability.

--Mike

Thursday, April 24, 2008

Virtualization Security - Reading List (update)

(Added and updated links) Here's a reading list of interesting posts in the virtualization+security space. There is a lot of thought going on about how virtualization in the data center affects application security. If you have VM's, you ought to be reading and thinking. Really thinking.

Start with Five Immutable Laws of Virtualization Security. Follow through to the Burton Blog for details. (Pete Lindstrom, Spire Security)

Maybe read Virtualization and Security: The Full Story (I'm still thinking about this one.) I'm done thinking. See comments. (Sara Peters, CSI)

Thoughts on auditing virtualized environments, separation of duties, and audit controls. Are your auditors going to declare all the vm's that share the same hardware cluster in scope for the audit of one of the vm's? Ours will. (Hoff, Rational Security)

Bits on PCI compliance (Hoff, Rational Security)

The operational issues surrounding virtualized servers, or - how are you going to manage the siloed operational domains of system management, network and security, when all of the above are in one box? (Hoff, Rational Security)

The performance issues that you should think about before dumping your virtualized network and security functions onto the same processors that serve up your application. Can we say context switches? I keep thinking about a spanning tree meltdown in a virtual switch. That would be amusing. Hint: There is no wire to pull. (Hoff, Rational Security)

An analysis of Patch frequency for VMWare ESX. Yep - you now have another platform to patch-manage, another patch repository, another patch management console, another set of patches to run through the patch test-QA-deploy cycle. (Ronald Oglesby and Dan Pianfetti @ GlassHouse Technologies)

A few bits on securing ESX. Really basic and obvious, but likely not followed by most vm system managers. Certainly not a substitute for separating vm's by security classification. (Amol Sarwate, SC Magazine)

A few basic rules you to match up your vm infrastructure to your security containers. Really important rules. (Rich Mogull, Securosis)

A summary of the four big issues surrounding virtualization. (Hoff, Rational Security)

And last, but not least, Theo's thoughts on virtualization.


"x86 virtualization is about basically placing another nearly full
kernel, full of new bugs, on top of a nasty x86 architecture which
barely has correct page protection. Then running your operating
system on the other side of this brand new pile of shit."


Priceless.

--Mike

ARS Technica Articles on Commodore Amiga

Jeremy Reimer of ars technica is running an interesting series of articles on the birth and death of the Commodore Amiga.

It handled graphics, sound, and video as easily as other computers of its time manipulated plain text. It was easily ten years ahead of its time. It was everything its designers imagined it could be, except for one crucial problem: the world was essentially unaware of its existence.

We pretty much laughed at the primitive video capabilities of 1985-1990 Mac's and PC's as compared to an Amiga. Those two platforms were pathetic. And when we combined a $1500 Amiga with a $1500 Video Toaster card, we had a decent video editing machine nearly as functional as dedicated editors costing far, far more. My guess is that few people today realize how far advanced that computer was compared to its contemporaries, and how long it took the rest of the industry to catch up.

I use Windows Movie Maker and iMovie to play around with video, and other than the vast improvements in storage that allow direct editing of on-disk digital video files, the capabilities of today's software isn't that much greater than the 20 year old Amiga. We should really be ashamed at how little progress we made in two decades.

Links to the series:

Part 1: Genesis

Part 2: The Birth of Amiga

Part 3: The First Prototype

Part 4: Enter Commodore

Part 5: Postlaunch Blues


Part 6: Stop the Bleeding


It's worth a read.

Thursday, April 17, 2008

Ad-Hoc Verses Structured System Management

Structured system management is a concept that covers the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications. Structured system management implies that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. The converse of structured system management is what I call ad hoc system management, where every system has it own plan, undocumented and inconsistent, and you don't know how inconsistent they are, because you've never looked.

In previous posts (here and here) I implied that structured system management was an integral part of improving system availability. Having inherited several platforms that had, at best, ad hoc system management, and having moved the platforms to something resembling structured system management, I've concluded that implementing basic structure around system management will be the best and fastest path to improved system performance and availability. I currently place sound fundamental system management ahead of redundancy in the path to increased availability. In other words, a poorly managed redundant high availability system will have lower availability that a well managed non-redundant system.

This structure doesn't need to be a full ITIL framework that a million dollars worth of consultants dropped in your lap, but it has to exist, even if only in simple, straightforward wiki or paper based systems.

You know you have structured system management when:

You manage with scripts, not mouse clicks.
Management by mouse clicking around is certain to introduce human error into a process that needs to be error free. Configuring a dozen servers or switches with a GUI, where each config item takes a handful of mouse clicks, will inevitably introduce inconsistency into the configuration. All humans are human. All humans error. Even the best system manager will eventually error in any manual process. GUI's can be great for monitoring and troubleshooting, but for configuration - Use scripts, not clicks.

You manage consistently across platforms.
The fundamentals of managing Windows, Unix, Linux, and other operating systems is essentially the same. Whatever processes you have for building, monitoring, managing, logging and auditing systems must be applied across all platforms. Databases are essentially all the same from the point of view of a DBA. Backup, recovery, logging, auditing, security can be consistent across database platforms, even if the details of implementation are not. In ad hoc system management, you have no fundamentals, or your various platforms are managed to different, unrelated standards, or to no standard at all.

You deploy servers from images, not install DVD's. Server installation from scratch is necessary for the first installation of a new major version of an operating system. That first installation gets documented, tested, and QA'd with a fine tooth comb. All your other servers of that platform get imaged from that master. And when you have a major upgrade cycle, you re-master your server golden images. I once took over the management of a platform where the twenty-odd servers clearly had been installed from whatever CD happened to be laying around, with what seemed to be random choices for the installation wizards. That wasn't fun at all.

You can re-install a server and its applications from your documentation. That implies that you understand where each server deviates from the golden image, it implies that you know what the application did to your server when you installed it, and it implies that you can re-create every change to the application since is was first installed. If the only way you can move the application to a new server is to tar or zip up an unknown, undocumented directory structure, or worse yet, you have to upgrade servers in place because you can't re-install the application and get it to work, you pretty much are in ad hoc land. This one is a tough one.

You install and configure to the least bit principle for all your devices, servers, operating systems and applications. (Read the post!)

You have full remote management of all your servers and devices. This means that your router and switch serial consoles are all connected to a console server, that your server 'lights out' boards are installed, licensed and working, that you have remote IP based KVM's on all servers that need a three fingered salute, and that your network people, when they toast your network, have out-of-band management sufficient to recover what they'd toasted. And it means that the people that need to get at the remote consoles can do it from home, in their sleep.

You build router, switch and firewall configurations from templates, not from scratch. Your network and firewall configs should be so consistent, that a 'diff' of any two configs will show a readable output. Network and firewall configuration is not a case where you want entropy.

You have version control and auditing on critical system and application configuration files. On the Windows platform, this is tough to do. On Unix'ish like operating systems, network, firewalls, load balancers, it is trivial to do. (CVS or SVN and a couple of scripts. It is really simple, and there is no excuse for not doing it). On databases, this means that you are using command line based tools and writing scripts to manage your databases, not using the friendly, shiny mouse-clicky, un-auditable GUI. You have version control and auditing when you can tell your security and forensics team exactly how a server, database or firewall was configured at 14:32 UTC on a Tuesday a year and a half ago. And you can demonstrate that with certainty that it how it was configured.

You automatically monitor, strip chart and alert on at least memory, network and disk I/O, and CPU. And you've identified and are charting at least a handful of other platform specific measurements. MRTG is your friend.

You automatically monitor, strip chart and alert on application response time and availability. Go get a web page, measure how long it takes, and strip chart it. Connect to your database, do a simple SELECT, measure how long it takes, and strip chart it.

You have a structure and process for your documentation, even if it is a directory on a shared drive, or better yet, a wiki. Your documentation starts out with a document on 'how to document'.

You have change management and change auditing sufficient to determine who/what/when/why on any change to any system or application critical file.

You have a patch strategy, consistently applied across applications and platforms. You know what patch & rev each device and server is at, and you have a minimal range of variations between patch revs on a platform. You understand your platform vendors recommendations, you know how well they regression test, and you know how much risk is associated with patching each platform. You've had the 'old and stable' vs. 'patch regularly for security' arguments, and you have picked a side. It doesn't matter which side. It matters that you've had the arguments.

You have neat cabling and racks. You cannot reliably touch a rack that is a mess. You'll break something that wasn't supposed to be down, you'll disconnect the wrong wire and generally wreak havoc on your production systems. Tie wraps, coiled cables, labeled cables, color coded cat 5, color coded power cables, and cable databases are simple, easy an free to implement. I'll bet a pitcher of brew that if your racks are a mess, so are your servers. Read this post from Data Center Design

You determine root cause of failures, outages and performance slowdowns more often than not. In the cases where you determine root cause, you take action to mitigate or eliminate future occurrences of that event. In cases where you do not determine root cause, you implement logging, diagnostics or measurements so that the next time the event occurs, you will have sufficient information to determine root cause. Every failure that is not tracked to a cause should result in changed or improved instrumentation or logging. Rebooting is NOT a valid troubleshooting technique. When you restart/reboot, you loose all possibility of gathering useful data that can be used to determine root cause and prevent future occurrences. If you are re-booting, you'd better have gotten a kernel dump to send to the vendor. Otherwise you wasted a re-boot and a root cause opportunity.

You have centralized logging, alerting, log rotation and basic log analysis tools. Your centralized logging can be as simple as a free SNMP collector, Snare, syslog, and a few perl scripts. Your log rotation, log archiving and log analysis tools can be the tools that come with your operating system, or even grep. Netflow and netflow collectors are free. Syslog is free.

You have installed, configured and are actually using your vendor provided platform management software. (HP SIM, IBM Director, etc.) Your servers and SAN's should be phoning home to HP or IBM when they are about to die, they should be sending you SMS's or traps, and you should be planning for a maintenance window to replace the DIMM or drive that your platforms predictive failure has alerted you on. You should be able to query your platform management database and determine versions, patches, CPU temperatures, and the price of coffee at Starbucks. You vendor provided you with a wonderful toolkit. Its free. Use it.

Last rule:

Keep it simple at first. Once you've done it the simple way, and you know what you want, you can talk to vendors. Until then, stick to free, simple and open source. Stay away from expensive and difficult to implement tools until you've mastered the platform provided built in tools, free and open source tools.

--Mike

Thursday, April 10, 2008

Asynchronous Information Consumption - Random Thoughts

We spend lots of money on various forms of asynchronous media converters. People seem to be moving their media and information consumption habits, and perhaps even a portion of our social habits, toward asynchronous, store-and-forward type mechanisms.

I find the shift to asynchronous consumption interesting. When time independence is combined with the location independence that comes from being un-tethered, a fundamentally different lifestyle results.

Interesting examples, in no particular order:

DVR's convert synchronous media into asynchronous media. I find it wonderfully ironic that we send film crews to deserted islands for weeks, filming ordinary people doing un-ordinary things to each other, send the film back to a studio, where several months are spent slicing it up and gluing it back together to make it somehow more dramatic and presumably more watchable, then synchronously spew it out over the air every Tuesday at 8pm to an audience of twenty million people who are leaning forward from their couches, holding their breaths, remotes in hand, waiting for their DVR's to record the show so they can watch it on Thursday at 10pm.

Broadcast networks are extraordinarily good at delivering media synchronously to huge numbers of people at exactly the same time. Some of us have images from the 1950's of a family around the TV, waiting for the test pattern to morph into the Lone Ranger. The rest of us can google for images from the 1950's of a family around the TV waiting for the test pattern to morph into the Lone Ranger. And millions of media consumers have become extraordinarily good at synchronizing their wind-up clocks, their farm chores, their meals and their lives to the broadcast schedules of first AM radio, then television. Yes, my family, in the 1960's, scheduled our evening meal around the Huntley-Brinkley Report. Dinner never was allowed to intrude on the news. And the news was never a minute late. Producers would have gotten fired if that ever happened.

CNN's Headline News is a strange hybrid attempt to make a synchronous delivery mechanism appear to by asynchronous, by endlessly repeating the same 3 minutes of news and 7 minutes of advertising at 10 minute intervals.

Movies in the theater are synchronous. Video rentals are asynchronous. You don't have to arrive on time for a rental. Netflix is asynchronous, and you don't even have to arrive at all.

Magazines and newspapers
are some sort of a hybrid delivery mechanism. They are edited and printed and delivered synchronously, but consumed asynchronously. My 'to be read' pile often has a months worth of weekly magazines piled up. The editors, press operators, truck drivers and postal carriers who broke their backs getting the press run printed, bound and delivered on time must not be happy with me for waiting a month to read their work.

Twitter is asynchronous, instant messaging is synchronous. Blogs are asynchronous, except for really popular blogs, where a small number of people with apparently nothing else to do lean forward from their chairs, holding their breaths, mouse in hand, waiting for the blogger whom they worship to post a new article so they can be the first with an irrelevant reply.

Podcasting is completely asynchronous. NPR and BBC podcasting effectively saves me the trouble of having to get a Tivo for my radio. In fact if media companies simply podcasted everything, appropriately wrapped up in DRM and embedded with advertising, we'd all be able to toss our DVR's.

E-mail is supposed to be asynchronous. Unfortunately masses of workers put e-mail pop-up notification on their desktops and immediately read and respond to all e-mails, effectively turning a store-and-forward asynchronous media into a synchronous real-time conversation. Didn't we invent chat for that? E-mail threads though, replace synchronous meetings, even when they shouldn't.

Second Life is some kind of anomaly. There is no store-and-forward at all, yet it is still popular.

Traditional Learning is horribly synchronous. Report to class every morning at 08:05am. Leave for your next class at 08:55am. Report to your next class at 09:05am. Repeat. Repeat. Repeat. Repeat. Assignment #16 is due Tuesday, April 11th at 09:55. If you get behind, too bad for you. If you get ahead, shame on you. You are being far too disruptive. You need to stop that. If you are a grog in the morning and brilliant at night, too bad for you. Sounds synchronous to me.

On-line learning can be as synchronous or asynchronous as we decide to make it. On-line courses can, in theory be delivered asynchronously. Most are not. Students still have to structure the course start date, end date and most quizzes and assignments around certain time-date restrictions arbitrarily imposed on them by faculty and burrocrats.

It doesn't have to be that way. Hutchinson Technical College and a few other Minnesota tech colleges conducted a twenty year bold experiment in asynchronous education. They delivered high quality advanced education, including lectures, work assignments, quizzes and exams with almost no date-time restrictions, other than the restriction that they were only able to staff the labs between 7 am and 5 pm. Students could, if they so desired, start their semester any day of the week, any week of the year, complete any assignment, project, exam or quiz any time they felt ready, complete their program, get a degree and graduate the day that they met the program requirements. There were no arbitrary semester boundaries, other than an administrative billing boundary that occurred several times per year. (If you want to keep learning, you gotta keep paying....)

In other news, one of the largest higher education systems in the US finally achieved its decade old goal of a common calendar for all of its member colleges and universities. For the first time since its founding, all colleges and universities in the Minnesota State system will start and end their semesters on exactly the same date.

Wednesday, April 9, 2008

Solaris Live Upgrade, Thin Servers and Upgrade Strategies

Sun recently upgraded Solaris Live Upgrade to permit live upgrades of servers with zones, and from what I can tell, seems to be positioning Live Upgrade as the standard method for patching and upgrading Solaris servers. This gives Solaris sysadmins another useful tool in the toolkit. For those who don't know Solaris:
Solaris Live Upgrade provides a method of upgrading a system while the system continues to operate. While your current boot environment is running, you can duplicate the boot environment, then upgrade the duplicate. Or, rather than upgrading, you can install a Solaris Flash archive on a boot environment.
The short explanation is that a system administrator can create a new boot disk from a running sever (or an image of a bootable server) while the server is running, then upgrade, patch or otherwise tweak the new boot disk, and during a maintenance window, re-boot the server from the new image. In theory, the downtime for a major Solaris upgrade, say from Solaris 9 to Solaris 10, is only the time that it takes to reboot the server. If the reboot fails, the old boot disk is still usable, untouched. The system administrator simply re-boots from the old boot disk. (Actually the required change window would be the time to reboot, test, reboot. But the fallback is simple and low risk, so the window can still be shorter than other strategies.)

Other high-end operating systems support something like this, and many system administrators maintain multiple bootable disks in a server to protect themselves against various operating system failures. Sun's Live Upgrade makes maintaining the alternate boot disks relatively painless.

If you are in Windows land, imagine this: While your Windows 2003 server is running, copy your C: drive, including all registry entries, system files, tweaks, user files and configuration bits to an empty partition. Then while the server is still up & in production, without rebooting, upgrade that new partition to Windows 2008, preserving configs, tweaks, installed software and the registry. Then reboot the sever to the new 2008 boot partition, and have everything work. The down time would be the time it takes to re-boot, not the time it takes to do an in-place upgrade. And if it didn't work, a reboot back to the original 2003 installation puts you back were you were. Pretty cool.

We tried Live Upgrade when it first came out, but ran into enough limitations that we rarely used it for routine upgrades. Early versions couldn't upgrade logical volumes or servers with zones, and the workarounds for a live upgrade with either were pretty ugly. Also, by the time a server was ready for a major version upgrade, we usually were ready to replace hardware anyway. I've got one old dog of a server at home that has been live upgraded from Solaris 8, to 9, through the current Solaris 10, and a whole bunch of mid-cycle upgrades and patches in between, but because of the limitations and our aggressive hardware upgrade cycle, we've generally not used Live Upgrade on our critical production servers. We might re-try it, now that zones and logical volumes are supported.

Alternative Upgrade Strategies

Don't upgrade. Reinstall.
The thinner the server, the more attractive a non-upgrade strategy becomes. In the ideal thin server world, servers would be minimized to their bare essentials (the least-bit principle), configured identically, and deployed using standard operating system images. Applications would be similarly pre-configured, self contained, with all configuration and customization contained in well defined and well known files and directory structures. Operating systems, application software and application data would all be logically separated, never mixed together.

In this ideal world, a redundant sever could be taken off line, booted from a new master operating system image, fully patched and configured, and have required applications deployed to the new server with some form of semi automatic deployment process. Data would be on data volumes, untouched. Upgrades would be re-installs, not upgrades. Applications would be well understood and trivial to install and configure, and system administrators would know exactly what is required for an application to function. All the random cruff of dead files that accumulates on servers would get purged. Life would be good.

Obviously if you've ever installed to /usr/local/bin, or descended into CPAN hell, or if you've ever run the install wizard with your eyes closed and fingers crossed, you are not likely to have success with the 're-install everything' plan unless you've got duplicate hardware and can install and test in parallel.

We are starting to follow this strategy on a subset of our Solaris infrastructure. Some of our applications are literally deployed directly from Subversion to app servers, including binaries, war files, jars and all. So a WAN boot from flar, followed by a scripted zone creation and an SVN deploy of the entire application in theory will re-create the entire server, ready for production.

Throw New Hardware at the problem. For applications that use large numbers of redundant servers, a strategy is to build a new operating system on a new or spare server, re-install and test the application, and cut over to the new hardware and operating system. If the old hardware is still serviceable it can be re-used as the new hardware for the next server. Rolling upgrades, especially of redundant load balanced servers, can be pretty risk free.

Unfortunately major software and operating systems have conspired to make major upgrades difficult to do on clustered severs & databases. So dropping a Windows 2003 SQL server out of a cluster, reinstalling or upgrading it to 2008, and having it re-join the cluster, isn't an option. Workarounds that stop the cluster, manipulate data LUN's and bring up the old data on the new server are do-able though. That makes it possible to do major upgrades in short windows, but not 'zero window'.

Abandon in place. A valid strategy is to simply avoid major upgrades for the entire life cycle of the hardware. Operating system, application and hardware upgrades are decreed to be synchronous. Major operating system upgrades do not occur independently of hardware upgrades. Once a server is installed as O/S version N, it stays that way until it reaches the end of its life cycle, with patches only for security or bug fixes. This depends on the availability of long term vendor support, and requires that system managers maintain proficiency in older operating system versions. In this case, application life cycle and hardware life cycle are one in the same.


Upgrade in place, fingers crossed. This would probably be the least desirable option. Stop the server, insert the upgrade CD, and watch it upgrade (or trash) your server. The risk of failure is high, and the options for fall back are few. (Recover from tape anyone?). Odds are fair that your test lab sever, if you have one, is just different enough that the upgrade tests aren't quite valid, and some upgrade or compatibility snafu will cause a headache. The risk can be mitigated somewhat by creating another bootable disk and storing it somewhere during the upgrade. The fall back in that case is to switch boot disks and reboot from the old disk.

Other options, or combinations of the above are possible.

Monday, April 7, 2008

System Management by the Least Bit Principle

System Management Principle Number 6:
If you remove one more bit, the system will fail.

Minimal installs, when applied to systems, result in systems that have higher security availability.

This isn't new. This is just a restatement and perhaps an extension of normal securing and hardening of systems. The following are examples of how to apply least bit to various parts of your systems and applications.

File System Permissions (rights) are least bit when removing one more role, right or permission results in a failed application. The goal is to minimize permissions, either by starting with no permission bits and adding permissions until the application is fully functional, or by starting with a best guess on permissions and removing rights & permissions until the application fails. Ideally the vendor would have tested and configured the minimum permissions and rights for the application.

That rarely, if ever happens. The best example I've encountered is a couple decades ago when installing WordPerfect Office (now Novell GroupWise). The WordPerfect Office instructions gave a whole page of detailed instructions on exactly what file system permissions were needed through the entire application directory structure (Read, Filescan, Create, Modify, Write, Erase, etc). I tested every one of them & found that they almost had the principle down perfect. They only one extra bit in on one directory.

Avoid, at all costs, the system admin shortcut of 'just give it full rights, we can figure out what to take away later'. Later never happens.

Oh - and for the Unix'ers reading this - Netware actually HAD file system permissions.

File Systems are full of abandoned applications, abandoned installations, dead scripts and unused Java run-times (how many of those do you have on your servers?). Every one of them is a violation of the least bit principle.

Operating System Installations start with the most minimal installation possible. Use the 'Base' or 'Core' installation and add packages or options as needed for application functionality. Never, under any circumstances, do anything resembling 'full install'. Vendors might balk. System mangers need to un-balk the vendors. We were told that Sun E25K's must have the full operating system installed and that our minimal, stripped down install wasn't supported. We responded by advising that we could, if we so desired, run our 32 core Oracle servers on HPUX.

Microsoft, with Windows Server Core, appears to be moving in this direction also. Thank you Redmond. I'm starting to like you.

Minimal operating system installs carry security bonuses. It's a fine day when you can read down the weekly or monthly list of security vulnerabilities and X them off with 'not vulnerable, package not installed'.

This is a continuation of the age old hardening practice of minimizing the ports that a server listens on, and minimizing the number of services that are running. Except that the services are not even installed.

Databases have options, features and configuration that keeps us busy full time figuring out what they do & why we need them. DBA's are pretty good at minimizing user rights and roles. The least bit principle applies not only to user rights and roles, but also to database feature installation. If the application isn't using a feature, the feature should not be installed. With a minimally installed and configured Oracle base, when you walk through the quarterly Oracle nightmare that they call a Critical Patch Update, half the time you get lucky. You get to mark yourself down as 'not vulnerable, package not installed'.

Applications ship with every feature enabled. Disable them all, and re-enable the ones that you are using. Delete sample applications and configs. Please. Remove all unnecessary bits from configuration files. If you cannot walk down through the application installation directory and know what each file & directory does, you've got to start reading & calling your vendor tech support. Tomcat does NOT need sample bindings, or whatever it comes pre-configured with. An Apache config can be a couple thousand bytes, instead of 40k. There is no reason to load 30 modules when your Apache instance is serving up plain old HTML. Your goal is 'not vulnerable, module not loaded'.

This also requires separation of application data from application code (or binaries, or executables). Any directory that has both executable code and data cannot be secured by the least bit principle.
Application vendors, including open source projects, that ship fully configured applications are not doing system administrators and security people a favor. They are forcing us to walk through and entire application and figure out if all three deploy directories are really needed.

Application developers that write applications that mix executables and data should have their 401K's converted to Bear Stearns stock.

Firewalls and Load Balancers are obviously candidates for least bit. For firewalls, the least bit implies that every firewall rule can be tracked back to required application functionality. That also implies that the application managers or system managers know exactly which sever talks to whom, in what direction and on what port(s). Time to call the vendor. If they cannot tell you exactly what ports and protocols are used by each component of their application, then you need to admit to yourself that you bought the wrong product from the wrong vendor. And if they wrote an application that uses randomly generated ephemeral ports that can't be firewalled, you know you bought the wrong product.

For load balancers, the least bit implies not only that old, dead load balancer configs are removed, but also that the load balancer is configured to only forward URL's that are required by the application. If the application has an index.html in the application root and a half dozen jar/ear/war files, the load balancer should only forward index.html and /application/* for each jar/ear/war. At least then you know that if someone screws up an deploys a new, unsecured application, or forgets to de-provision an old application, the load balancer will toss the request rather than forward it on. In this case, the least bit principle results in an substantially longer and more complex configuration.

Network Devices. For switches and routers, this applies to the device firmware or operating system and the configuration. The least bit principle requires us to load the minimum operating system or feature set that is required for required functionality or application support. That means that you load base IOS images, not enterprise images, unless you have a specific requirement for enterprise features.

As security devices, the configuration should permit minimum required traffic and should block protocols and services that are not required for the functionality of the device attached to the switch or router. This means enabling more features on the switch or router (DHCP snooping, spanning tree root guard, etc.) In other words, the network devices only pass required traffic, blocking all other traffic.

This also applies to standard hardening of network devices, such as disabling unnecessary services, as is presumably already being done by network administrators.

Application Deprovisioning is the tail end of least bit principle. The final task when taking an application out of service is to remove its bits from severs, file systems, load balancers, databases and firewalls. As long as those bits are still in your datacenter, all the security baggage that application carried with it are still around your neck.

If this is System Management Principle #6, I suppose I'll have to dream up #1 through #5. When I think of them, I'll blog them.

Friday, April 4, 2008

Privacy, Centralization and Databases

A fascinating article on the potential problems associated with privacy on a large government run database was recently posted at ModernMechanix. The article appears to be in response to an effort to build a centralized data center that would contain personal records on US citizens. The interesting part is that the article appeared in The Atlantic in 1967. Reading it today makes it clear that not only did the author, Arthur R. Miller, laid bare the fundamental issues surrounding centralized government managed data repositories, but that the issues neither have changed, nor been addressed. Unfortunately the article is probably far more relevant today.

The author is concerned that:
With its insatiable appetite for information, its inability to forget anything that has been put into it, a central computer might become the heart of a government surveillance system that would lay bare our finances, our associations, or our mental and physical health to government inquisitors or even to casual observers.
As we discuss a national database of health records, a national identity card; and as we already have centralized employment reporting, mandatory bank transaction reporting, centralized credit agencies, that though seemingly privately run, are protected from any reasonable privacy rules or laws by some unseen forces in DC, we can pretty much declare that except for cleaning up a few loose ends, we already have what Arthur Miller feared in 1967.

This sounds familiar:
The great bulk of the information likely to find its way into the center will be gathered and processed by relatively unskilled and unimaginative people who lack discrimination and sensitivity.
Show me a governmental agency where some clerk hasn't snooped around in other peoples tax or health records. We know it happens. And occasionally we even hear about it in the media.

And of course, once you are in the database, how easy is it to get inaccurate records corrected?
An untested, impersonal, and erroneous computer entry such as “associates with known criminals” has marked him, and he is helpless to rectify the situation. Indeed, it is likely that he would not even be aware that the entry existed.
Does anyone actually think that the No Fly list is accurate? Or that expunged records are expunged?

I'm still waiting for this to happen:
To ensure the accuracy of the center’s files, an individual should have an opportunity to correct errors in information concerning him. Perhaps a print-out of his computer file should be sent to him once a year.
Let me try that this weekend. I'll write a letter to every state and federal agency that has ever had any records on me and ask them for copy. I'm sure that will work. While I'm at it, I'll ask for every photo from every traffic camera that I every drove through also.

Who hasn't heard this idea:
One solution may be to store information according to its sensitivity or its accessibility, or both.
Wow. I just paid a $300/hr consultant to tell me that the new enterprise security best practices will require me to classify our data and secure it according to its sensitivity. I could have read a 40 year old Atlantic for a $1 instead. (That was sarcasm. Data classification is obvious and self evident.)

And:
It probably will also be necessary to audit the programs controlling the manipulation of the files and access to the system to make sure that no one has inserted a secret “door” or a password permitting entry to the data by unauthorized personnel.
Fascinating. Code audits, account audits, integrity checking, intrusion detection.

If this is even the slightest bit interesting to you then page through this Harvard Law Review article also (PDF). The arguments are reiterated in greater detail. The notes in the margin are also very interesting.

I'm rather disappointed that my generation, the one that took computing from the early 1980s through today, seems to have neither come up with any significant new privacy issues, nor solved any longstanding privacy problems.

What's even more chilling is that the use of organized, automated data indexing and storage for nefarious purposes has an extraordinary precedent. Edwin Black has concluded that the efficiency of Hollerith punch cards and tabulating machines made possible the extremely "...efficient asset confiscation, ghettoization, deportation, enslaved labor, and, ultimately, annihilation..." of a large group of people that a particular political party found to be undesirable.

History repeats itself. We need to assume that the events of the first half of the twentieth century will re-occur someday, somewhere, with probably greater efficiency.

What are we doing to protect our future?