Sunday, April 25, 2010

Oracle/Sun ZFS Data Loss – Still Vulnerable

Last week I wrote about how we got bit by a bug and ended up with lost/corrupted Oracle archive logs and a major outage. Unfortunately, Oracle/Sun’s recommendation – to patch to MU8 – doesn’t resolve all of the ZFS data loss issues.

There are two distinct bugs, one fsync() related, the other sync() related. Update 8 may fix 6791160 zfs has problems after a panic , but

Bug ID 6880764 “fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached”

is apparently not resolved until 142900-09 released on 2010-04-20.

DBA’s pay attention: Any Solaris 10 server kernel earlier than Update 8 + 142900-09 that is running any application that synchronously writes more than 32k chunks is vulnerable to data loss on abnormal shutdown.

As best as I can figure – with no access to any information from Sun other than what’s publically available – these bugs affect synchronous writes large enough to be written directly to the pool instead of indirectly via the ZIL. After an abnormal shutdown, on reboot the ZIL replay looks at the metadata in the ZIL wacks the write (and your Oracle Archive logs).

It appears that you can

  • limit database writes to 32k (and kill database performance)
  • or you can force writes larger than 32k to be written to the ZIL instead of the pool by setting zfs_immediate_write_sz  larger than your largest database write (and kill database performance)
  • or you can use a separate log intent device (slog)
  • or you can update to 142900-09

Ironically, the ZFS Evil Tuning Guide recommends the opposite – set “the zfs_immediate_write_sz parameter to be lower than the database block size” so that all database writes take the broken direct path.

Another bug that is a consideration for an out of order patch cycle and rapid move to 142900-09:

Bug ID 6867095: “User applications that are using Shared Memory extensively or large pages extensively may see data corruption or an unexpected failure or receive a SIGBUS signal and terminate.”

This sounds like an Oracle killer.

I’m not crabby about a killer data loss bug in ZFS. I’m crabby because Oracle/Sun knew about the bug and it’s enormous consequences and didn’t do a dammed thing to warn their customers. Unlike Entrust – who warned us that we had a bad cert even though it was our fault that our SSL certs had no entropy, and unlike Microsoft who warned it’s customers about potential data loss, Sun/Oracle really has their head in the sand on this.

Unfortunately – when your head is in the sand, your ass is in the air.

Saturday, April 24, 2010

We do not retest System [..] every time a new version of Java is released.

This post’s title is a quote from Oracle technical support on a ticket we opened to get help running one of their products on a current, patched JRE.

Oracle’s response:

“1. Please do not upgrade Java if you do not have to
2. If you have to upgrade Java, please test this on your test server before implemeting [sic] on production
3. On test and on production, please make a full backup of your environment (files and database) before upgrading Java and make sure you can roll back if any issue occurs.”

In other words – you are on your own. The hundreds of thousands of dollars in licensing fees and maintenance that you pay us don’t do you sh!t for security.

Let’s pretend that we have a simple, clear and unambiguous standard: ‘There will be no unpatched Java runtime on any server’.

There isn’t a chance in hell that standard can be met.

This seems to be a cross vendor problem. IBM’s remote server management requires a JRE on the system that has the application that connects to the chassis and allows remote chassis administration. As far as we can tell, and as far as IBM’s support is telling us, there is no possibility of managing an IBM xSeries using a patched JRE.

“It is not recommended to upgrade or change the JRE version that's built inside Director. Doing so will create an unsupported configuration as Director has only been tested to work with its built-in version.”

We have JRE’s everywhere. Most of them are embedded in products. The vendors of the products rarely if ever provide security related updates for their embedded JRE’s. When there are JRE updates, we open up support calls with them watch them dance around while they tell us that we need to leave them unpatched.

My expectations? If a vendor bundles or requires third party software such as a JRE, that vendor will treat a security vulnerability in the dependent third party software as though it were a vulnerability in their own software, and they will not make me open up support requests for something this obvious.

It’s the least they could do.

Friday, April 16, 2010

Bit by a Bug – Data loss Running Oracle on ZFS on Solaris 10, pre 142900-09 (was: pre Update 8)

We recently hit a major ZFS bug, causing the worst system outage of my 20 year IT career. The root cause:

Synchronous writes on ZFS file systems prior to Solaris 10 Update 8 are not properly committed to stable media prior to returning from fsync() call, as required by POSIX and expected by Oracle archive log writing processes.

On pre Update 8 MU8 + 142900-09[1], we believe that a programs utilizing fsync() or O_DSYNC writes to disk are displaying buffered-write-like behavior rather than un-buffered synchronous writes behavior. Additionally, when there is a disk/storage interruption on the zpool device and a subsequent system crash, we see a "rollback" of fsync() and O_DSYNC files. This should never occur, as write with fsync() or O_DSYNC are supposed to be on stable media when the kernel call returns.

If there is a storage failure followed by a server crash[2], the file system is recovered to an inconsistent state. Either blocks of data that were supposedly synchronously written to disk are not, or the ZFS file system recovery process truncates or otherwise corrupts the blocks that were supposedly synchronously written. The affected files include Oracle archive logs.

We experienced the problem on an ERP database server when an OS crash caused the loss of an Oracle archive log, which in turn caused an unrecoverable Streams replication failure. We replicated problem in a test lab using a v240 with the same FLAR, HBA’s, device drivers and a scrubbed copy of the Oracle database. After hundreds of test and crashes over a period of weeks, were able to re-create the problem with a 50 line ‘C’ program that perform synchronous writes in a manner similar to the synchronous writes that Oracle uses to ensure that archive logs are always consistent, as verified by dtrace.

The corruption/data loss is seen under the following circumstances:

  • Run a program that synchronously writes to a file

OR

  • Run a program that asynchronously write to a file with calls to fsync().

Followed by any of:[2]

  • SAN LUN un-present
  • SAN zoning error
  • local spindle pull

Then followed by:

  • system break or power outage or crash recovery

Post recovery, in about half or our test cases, blocks that were supposedly written by fsync are not on disk after reboot.

As far as I can tell, Sun has not issued any sort of alert on the data loss bugs. There are public references to this issue, but most of them are obscure and don’t clearly indicate the potential problem:

From the Sun kernel engineer that worked our case:

Date 08-MAR-2010
Task Notes : […] Since S10U8 encompasses 10+ PSARC features and 300+ CR fixes for ZFS, and the fixes might be inter-related, it's hard to pinpoint exactly which ones resolve customer's problem.

For what it’s worth, Sun support provided no useful assistance on this case. We dtrace’d Oracle log writes, replicated the problem using an Oracle database, and then – to prevent Sun from blaming Oracle or our storage vendor - replicated the data loss with a trivial ‘C’ program on local spindles.

Once again, if you are on Solaris 10 pre Update 8 Update 8 + 142900-09[1] and you have an application (such as a database) that expects synchronous writes to still be on disk after a crash, you really need to run a kernel from Update 8 or newer (Oct 2009) Update 8 + 14900-09 dated 2010-04-22 or newer[1] .


[1] 2010-04-25: Based on on new information and a reading of 142900-09 released Apr/20/2010, MU8 alone doesn’t fully resolve the known critical data loss bugs in ZFS.

The read is that there are two distinct bugs, one fsync() related, the other sync() related. Update 8 may fix 6791160 zfs has problems after a panic , but

Bug ID 6880764 “fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached”

is not resolved until 142900-09 on 2010-04-22.

Another bug that is a consideration for an out of order patch cycle and rapid move to 142900-09:

Bug ID 6867095: “User applications that are using Shared Memory extensively or large pages extensively may see data corruption or an unexpected failure or receive a SIGBUS signal and terminate.”

This sounds like an Oracle killer.

[2]Or apparently a server crash alone.

Phishing Attempt or Poor Customer Communications?

I’ve just ran into what’s either a really poor customer communications from Hewlett-Packard, or a pretty good targeted phishing attempt.

The e-mail, as received earlier today:

From: gcss-case@hpordercenter.com

Subject: PCC-Cust_advisory

Dear MIKE JAHNKE,
HP has identified a potential, yet extremely rare issue with HP
BladeSystem c7000 Enclosure 2250W Hot-Plug Power Supplies manufactured prior to March 20, 2008. This issue is extremely rare; however, if it does occur, the power supply may fail and this may result in the unplanned shutdown of the enclosure, despite redundancy, and the enclosure may become inoperable.

HP strongly recommends performing this required action at the customer's earliest possible convenience. Neglecting to perform the required action could result in the potential for one or more of the failure symptoms listed in the advisory to occur. By disregarding this notification, the customer accepts the risk of incurring future power supply failures.
Thank you for taking our call today, as we discussed please find Hewlett Packard's Customer Advisory - Document ID: c01519680.
You will need to have a PDF viewer to view/print the attached document.
If you don't already have a PDF viewer, you can download a free version from Adobe Software, www.adobe.com

The interesting SMTP headers for the e-mail:

Received: from zoytoweb06 ([69.7.171.51]) by smtp1.orderz.com with Microsoft SMTPSVC(6.0.3790.3959);
Fri, 16 Apr 2010 10:22:02 -0500
Return-Path: gcss-case@hpordercenter.com

Message-ID: 5A7CB4C6E58C4D9696B5F867030D280C@domain.zoyto.com

The interesting observations:

  • They spelled my name wrong and used ‘Mike’ not ‘Michael’
  • The source of the e-mail is not hp.com, nor is hp.com in any SMTP headers. The headers reference hpordercenter.com, Zyoto and orderz.com
  • hpordercenter.com Zyoto and orderz.com all have masked/private Whois information.
  • The subject is “PCC-Cust_advisory”, with – and _ for word spacing
  • Embedded in the e-mail is a link to an image from the Chinese language version of HP’s site: http://….hp-ww.com/country/cn/zh/img/….
  • There is inconsistent paragraph spacing in the message body
  • It references a “phone conversation from this morning” which didn’t occur. There was no phone call.
  • It attempts to convey urgency (“customer accepts risk…”)
  • It references an actual advisory, but the advisory is 18 months old and hasn’t been updated in 6 months.
  • Our HP account manager hasn’t seen the e-mail and wasn’t sure if it was legit.

Attached to the e-mail was a PDF.

The attached PDF (yes, I opened it…and no, I don’t know why…) has a URL across the top in a different font, as though it was generated from a web browser:

HP

Did I get phished?

If so, there’s a fair chance that I’ve just been rooted, so I:

  • Uploaded the PDF to Wipawet at the UCSB Computer Security Lab. It showed the PDF as benign.
  • Checked firewall logs for any/all URL’s and TCP/UDP connections from my desktop at the time that I opened the PDF and again after a re-boot. There are no network connections that aren’t associated with known activity.

I’m pretty sure that this is just a really poor e-mail from an outsourcer hired by HP. But just in case… I opened up a ticket with our security group, called desktop support & had them Nuke from orbit, MBR included.

Damn – what a waste of a Friday afternoon.

Monday, April 12, 2010

3.5 Tbps

Interesting stats from Akamai:

  • 12 million requests per second peak
  • 500 billion requests per day
  • 61,000 servers at 1000 service providers

The University hosts an Akamai cache. My organization uses the University as our upstream ISP, so we benefit from the cache.

The  Universities Akamai cache also saw high utilization on Thursday and Friday of last week. Bandwidth from the cache to our combined networks nearly doubled, from about 1.2Gbps to just over 2Gbps.

The Akamai cache works something like this:

  • Akamai places a rack of gear on the University network in University address space, attached to University routers.
  • The Akamai rack contains cached content from Akamai customers. Akamai mangles DNS entries to point our users to the IP addresses of the Akamai servers at the University for Akamai cached content.
  • Akamai cached content is then delivered to us via their cache servers rather than via our upstream ISP’s.

It works because:

  • the content provider doesn’t pay the Tier 1 ISP’s for transport
  • the University (and us) do not pay the Tier 1 ISP’s for transport
  • the University (and us) get much faster response times from cached content. The Akamai cache is connected to our networks via a 10Gig link and is physically close to most of our users, so that whole propagation delay thing pretty much goes away

The net result is that something like 15-20%of our inbound Internet content is served up locally from the Akamai cache, tariff free.  A win for everyone (except the Tier 1’s).

This is one of the really cool things that makes the Internet work.

Via CircleID

Update: The University says that the amount of traffic we pull from Akamai would cost us approximately $10,000 a month or more to get from an ISP.  That’s pretty good for a rack of colo space and a 10G port on a router.

Saturday, April 3, 2010

The Internet is Unpatched – It’s Not Hard to See Why

It’s brutal. We have Internet Explorer vulnerabilities that need a chart to explain, a Mac OS X update that’s larger than a bootable Solaris image, a Java security update, two Firefox Updates, Adobe and Foxit! PDF readers that apparently are broken, as designed, and three flagship browsers that rolled over and died in one contest.

Responsible network administrators and home users have been placed into patch hell by software vendors that simply are not capable of writing software that can stand up to the Internet.

  • There is no operating system or platform that has built in patch management technology that is both comprehensive and easy for network administrators and home users to understand or use.
  • There is no reason to expect that even if software vendors were actually able to release good code, that the release would make it out to users desktops.
  • Some vendors (Microsoft) have robust and easy to use patch distribution systems, but those systems only distribute patches for their software. Each other vendor must re-invent the software distribution wheel, and each does it in a random and arbitrary way, with flags, popups, silent installs, noisy installs, click here to continue, arbitrary re-boots…

It’s not a Microsoft problem, it’s not an Adobe problem, it’s a software development problem, and as far as I can tell, all vendors have the problem.

So how did I get on this rant (other than the pathetic display of incompetence by the worlds major software vendors the last few months few decades)?

Google Analytics.

Presumably this blog is frequented by more-technical-than-average users. I can’t imagine non-technical users being interested in my most frequented posts on Structured System Management, MTTR, MTBF & Continuous Deployment. I would also assume that because the blog should only be interesting to techies:

  1. the distribution of browsers should be skewed towards Chrome and Firefox or other ‘nerdy’ browsers
  2. the operating systems should be weighted toward Linux and OS X
  3. the readers of this blog should be fully patched

Guess which two out of the three are correct?

Firefox & Chrome add up to more than IE:

browser-distribution

The operating systems tend to be Linux and OS X heavy compared to the market as a whole:

browser-os-distribution

And the readers of this blog tend to be fully patched:

Flash-Versions

Oops.

[Hint – any Flash version other than r45 is out of date. Y’all are nerds, so y’all already knew that, right?].

If technical people either cannot or are not keeping up with patches, why would we expect ordinary users to keep up?

Broken, as designed.