Oracle/Sun ZFS Data Loss – Still Vulnerable

Last week I wrote about how we got bit by a bug and ended up with lost/corrupted Oracle archive logs and a major outage. Unfortunately, Oracle/Sun’s recommendation – to patch to MU8 – doesn’t resolve all of the ZFS data loss issues.

There are two distinct bugs, one fsync() related, the other sync() related. Update 8 may fix 6791160 zfs has problems after a panic, but

Bug ID 6880764 “fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached”

is apparently not resolved until 142900-09 released on 2010-04-20.

We do not retest System [..] every time a new version of Java is released.

This post’s title is a quote from Oracle technical support on a ticket we opened to get help running one of their products on a current, patched JRE.

Oracle’s response:

“1. Please do not upgrade Java if you do not have to
2. If you have to upgrade Java, please test this on your test server before implemeting [sic] on production
3. On test and on production, please make a full backup of your environment (files and database) before upgrading Java and make sure you can roll back if any issue occurs.”

In other words – you are on your own. The hundreds of thousands of dollars in licensing fees and maintenance that you pay us don’t do you sh!t for security.

Let’s pretend that we have a simple, clear and unambiguous standard: ‘There will be no unpatched Java runtime on any server’.

There isn’t a chance in hell that standard can be met.

This seems to be a cross vendor problem. IBM’s remote server management requires a JRE on the system that has the application that connects to the chassis and allows remote chassis administration. As far as we can tell, and as far as IBM’s support is telling us, there is no possibility of managing an IBM xSeries using a patched JRE.

“It is not recommended to upgrade or change the JRE version that's built inside Director. Doing so will create an unsupported configuration as Director has only been tested to work with its built-in version.”

We have JRE’s everywhere. Most of them are embedded in products. The vendors of the products rarely if ever provide security related updates for their embedded JRE’s. When there are JRE updates, we open up support calls with them watch them dance around while they tell us that we need to leave them unpatched.

My expectations? If a vendor bundles or requires third party software such as a JRE, that vendor will treat a security vulnerability in the dependent third party software as though it were a vulnerability in their own software, and they will not make me open up support requests for something this obvious.

It’s the least they could do.

Bit by a Bug – Data loss Running Oracle on ZFS on Solaris 10, pre 142900-09 (was: pre Update 8)

We recently hit a major ZFS bug, causing the worst system outage of my 20 year IT career. The root cause:

Synchronous writes on ZFS file systems prior to Solaris 10 Update 8 are not properly committed to stable media prior to returning from fsync() call, as required by POSIX and expected by Oracle archive log writing processes.

On pre Update 8 MU8 + 142900-09[1], we believe that a programs utilizing fsync() or O_DSYNC writes to disk are displaying buffered-write-like behavior rather than un-buffered synchronous writes behavior. Additionally, when there is a disk/storage interruption on the zpool device and a subsequent system crash, we see a "rollback" of fsync() and O_DSYNC files. This should never occur, as write with fsync() or O_DSYNC are supposed to be on stable media when the kernel call returns.

If there is a storage failure followed by a server crash[2], the file system is recovered to an inconsistent state. Either blocks of data that were supposedly synchronously written to disk are not, or the ZFS file system recovery process truncates or otherwise corrupts the blocks that were supposedly synchronously written. The affected files include Oracle archive logs.

We experienced the problem on an ERP database server when an OS crash caused the loss of an Oracle archive log, which in turn caused an unrecoverable Streams replication failure. We replicated problem in a test lab using a v240 with the same FLAR, HBA’s, device drivers and a scrubbed copy of the Oracle database. After hundreds of test and crashes over a period of weeks, were able to re-create the problem with a 50 line ‘C’ program that perform synchronous writes in a manner similar to the synchronous writes that Oracle uses to ensure that archive logs are always consistent, as verified by dtrace.

The corruption/data loss is seen under the following circumstances:

  • Run a program that synchronously writes to a file

OR

Run a program that asynchronously write to a file with calls to fsync().

  • Followed by any of:[2]
  • SAN LUN un-present
  • SAN zoning error
  • local spindle pull
  • Then followed by:
  • system break or power outage or crash recovery

Post recovery, in about half or our test cases, blocks that were supposedly written by fsync are not on disk after reboot.

As far as I can tell, Sun has not issued any sort of alert on the data loss bugs. There are public references to this issue, but most of them are obscure and don’t clearly indicate the potential problem:

  • “MySQL cluster redo log gets truncated after a uadmin 1 0 if stored on zfs” (Sunsolve login required)
  • “zfs has problems after a panic”
  • “ZFS dataset sometimes ends up with stale data after a panic :
  • “kernel/zfs zfs has problems after a panic - part II” Which contains a description: "But sometimes after a panic you see that one of the datafile is completely broken, usually on a 128k boundary, sometimes it is stale data, sometimes it is complete rubbish (deadbeef/badcafe in it, their system has kmem_flags=0x1f), it seems if you have more load you get random rubbish."

And:

  • “fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached”

From the Sun kernel engineer that worked our case:

Date 08-MAR-2010
Task Notes : […] Since S10U8 encompasses 10+ PSARC features and 300+ CR fixes for ZFS, and the fixes might be inter-related, it's hard to pinpoint exactly which ones resolve customer's problem.

For what it’s worth, Sun support provided no useful assistance on this case. We dtrace’d Oracle log writes, replicated the problem using an Oracle database, and then – to prevent Sun from blaming Oracle or our storage vendor - replicated the data loss with a trivial ‘C’ program on local spindles.

Once again, if you are on Solaris 10 pre Update 8 Update 8 + 142900-09[1] and you have an application (such as a database) that expects synchronous writes to still be on disk after a crash, you really need to run a kernel from Update 8 or newer (Oct 2009) Update 8 + 14900-09 dated 2010-04-22 or newer[1] .


[1] 2010-04-25: Based on on new information and a reading of 142900-09 released Apr/20/2010, MU8 alone doesn’t fully resolve the known critical data loss bugs in ZFS.

The read is that there are two distinct bugs, one fsync() related, the other sync() related. Update 8 may fix “6791160 zfs has problems after a panic”, but

Bug ID 6880764 “fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached”

is not resolved until 142900-09 on 2010-04-22.

Another bug that is a consideration for an out of order patch cycle and rapid move to 142900-09:

Bug ID 6867095: “User applications that are using Shared Memory extensively or large pages extensively may see data corruption or an unexpected failure or receive a SIGBUS signal and terminate.”

This sounds like an Oracle killer.

[2]Or apparently a server crash alone.

Phishing Attempt or Poor Customer Communications?

I’ve just ran into what’s either a really poor customer communications from Hewlett-Packard, or a pretty good targeted phishing attempt.

The e-mail, as received earlier today:
From: gcss-case@hpordercenter.com
Subject: PCC-Cust_advisory
Dear MIKE JAHNKE,
HP has identified a potential, yet extremely rare issue with HP
BladeSystem c7000 Enclosure 2250W Hot-Plug Power Supplies manufactured prior to March 20, 2008. This issue is extremely rare; however, if it does occur, the power supply may fail and this may result in the unplanned shutdown of the enclosure, despite redundancy, and the enclosure may become inoperable.
HP strongly recommends performing this required action at the customer's earliest possible convenience. Neglecting to perform the required action could result in the potential for one or more of the failure symptoms listed in the advisory to occur. By disregarding this notification, the customer accepts the risk of incurring future power supply failures.
Thank you for taking our call today, as we discussed please find Hewlett Packard's Customer Advisory - Document ID: c01519680.
You will need to have a PDF viewer to view/print the attached document.
If you don't already have a PDF viewer, you can download a free version from Adobe Software, www.adobe.com
The interesting SMTP headers for the e-mail:
Received: from zoytoweb06 ([69.7.171.51]) by smtp1.orderz.com with Microsoft SMTPSVC(6.0.3790.3959);
Fri, 16 Apr 2010 10:22:02 -0500
Return-Path: gcss-case@hpordercenter.com

Message-ID: 5A7CB4C6E58C4D9696B5F867030D280C@domain.zoyto.com
The interesting observations:
  • They spelled my name wrong and used ‘Mike’ not ‘Michael’
  • The source of the e-mail is not hp.com, nor is hp.com in any SMTP headers. The headers reference hpordercenter.com, Zyoto and orderz.com
  • hpordercenter.com Zyoto and orderz.com all have masked/private Whois information.
  • The subject is “PCC-Cust_advisory”, with – and _ for word spacing
  • Embedded in the e-mail is a link to an image from the Chinese language version of HP’s site: http://….hp-ww.com/country/cn/zh/img/….
  • There is inconsistent paragraph spacing in the message body
  • It references a “phone conversation from this morning” which didn’t occur. There was no phone call.
  • It attempts to convey urgency (“customer accepts risk…”)
  • It references an actual advisory, but the advisory is 18 months old and hasn’t been updated in 6 months.
  • Our HP account manager hasn’t seen the e-mail and wasn’t sure if it was legit.
Attached to the e-mail was a PDF.
The attached PDF (yes, I opened it…and no, I don’t know why…) has a URL across the top in a different font, as though it was generated from a web browser:
HP
Did I get phished?
If so, there’s a fair chance that I’ve just been rooted, so I:
  • Uploaded the PDF to Wipawet at the UCSB Computer Security Lab. It showed the PDF as benign.
  • Checked firewall logs for any/all URL’s and TCP/UDP connections from my desktop at the time that I opened the PDF and again after a re-boot. There are no network connections that aren’t associated with known activity.
I’m pretty sure that this is just a really poor e-mail from an outsourcer hired by HP. But just in case… I opened up a ticket with our security group, called desktop support & had them Nuke from orbit, MBR included.

Damn – what a waste of a Friday afternoon.

3.5 Tbps

Interesting stats from Akamai:

  • 12 million requests per second peak
  • 500 billion requests per day
  • 61,000 servers at 1000 service providers

The University hosts an Akamai cache. My organization uses the University as our upstream ISP, so we benefit from the cache.

The Universities Akamai cache also saw high utilization on Thursday and Friday of last week. Bandwidth from the cache to our combined networks nearly doubled, from about 1.2Gbps to just over 2Gbps.

The Akamai cache works something like this:

  • Akamai places a rack of gear on the University network in University address space, attached to University routers.
  • The Akamai rack contains cached content from Akamai customers. Akamai mangles DNS entries to point our users to the IP addresses of the Akamai servers at the University for Akamai cached content.
  • Akamai cached content is then delivered to us via their cache servers rather than via our upstream ISP’s.

It works because:

  • the content provider doesn’t pay the Tier 1 ISP’s for transport
  • the University (and us) do not pay the Tier 1 ISP’s for transport
  • the University (and us) get much faster response times from cached content. The Akamai cache is connected to our networks via a 10Gig link and is physically close to most of our users, so that whole propagation delay thing pretty much goes away

The net result is that something like 15-20%of our inbound Internet content is served up locally from the Akamai cache, tariff free.  A win for everyone (except the Tier 1’s).

This is one of the really cool things that makes the Internet work.

Via CircleID

Update: The University says that the amount of traffic we pull from Akamai would cost us approximately $10,000 a month or more to get from an ISP.  That’s pretty good for a rack of colo space and a 10G port on a router.

The Internet is Unpatched – It’s Not Hard to See Why

It’s brutal. We have Internet Explorer vulnerabilities that need a chart to explain, a Mac OS X update that’s larger than a bootable Solaris image, a Java security update, two Firefox updates, Adobe and Foxit! PDF readers that apparently are broken, as designed, and three flagship browsers that rolled over and died in one contest.