Last In - First Out: August 2008

Privacy, Centralization and Security Cameras

The hosting of the Republican National Convention here in St Paul has one interesting side effect. We finally have our various security and traffic cameras linked together:
http://www.twincities.com/ci_10339532

“The screens will also show feeds from security cameras controlled by the State Patrol, Minnesota Department of Transportation, and St. Paul, Minneapolis and Metro Transit police departments.
Before the RNC, there was no interface for all the agencies' cameras to be seen in one place. Local officials could continue to use the system after the RNC.” (Emphasis mine)

So now we have state and local traffic cameras, transit cameras and various police cameras all interconnected and viewable from a central place. This alone is inconsequential. When however, a minor thing like this is repeated many times across a broad range of places and technologies and over a long period of time, the sum of the actions are significant. In this case, what’s needed to turn this into something significant is a database to store the surveillance images and a way of connecting the ~~security and~~ ~~traffic~~ surveillance camera images to cell phone roaming data, WIFI roaming data, social network traffic data, Bluetooth scans and RFID data from automobile tires. Hmm…that actually doesn’t sound too difficult, or at least it doesn’t sound too much more difficult than security event correlation in a large open network. Is there any reason to think that something like that will not happen in the future?

If it did, J Edgar Hoover would be proud. The little bits and pieces that we are building to solve daily security and efficiency ‘problems’ are building the foundation of a system that will permit our government to efficiently track anyone, anywhere, anytime. Hoover tried, but his index card system wasn’t quite up to the task. He didn’t have Moore’s Law on his side.

As one of my colleagues indicates, hyper-efficient government is not necessarily a good thing. Institutional inefficiency has some positive properties. In the particular case of the USA there are many small overlapping and uncoordinated units of local, state and federal government and law enforcement. In many cases, these units don’t cooperate with each other and don’t even particularly like each other. There is an obvious inefficiency to this arrangement. But is that a bad thing?

Do we really want our government and police to function as a coordinated, efficient, centralized organization? Or is governmental inefficiency essential to the maintenance of a free society? Would we rather have a society where the efficiency and intrusiveness of the government is such that it is not possible to freely associate or freely communicate with the subversive elements of society? A society where all movements of all people are tracked all the time? Is it possible to have an efficient, centralized government and still have adequate safeguards against the use of centralized information by future governments that are hostile to the citizens?

As I wrote in Privacy, Centralization and Databases last April:

What's even more chilling is that the use of organized, automated data indexing and storage for nefarious purposes has an extraordinary precedent. Edwin Black has concluded that the efficiency of Hollerith punch cards and tabulating machines made possible the extremely "...efficient asset confiscation, ghettoization, deportation, enslaved labor, and, ultimately, annihilation..." of a large group of people that a particular political party found to be undesirable.
History repeats itself. We need to assume that the events of the first half of the twentieth century will re-occur someday, somewhere, with probably greater efficiency.
What are we doing to protect our future?

We are giving good guys full spectrum surveillance capability so that sometime in the future when they decide to be bad guys, they’ll be efficient bad guys.

There have always been bad governments. There always will be bad governments. We just don’t know when.

09/29-2008 - Updated to correct minor grammatical errors.

Scaling Online Learning - 14 Million Pages Per Day

Some notes on scaling a large online learning application.

09/29-2008 - Updated to correct minor grammatical errors.

Stats:

29 million hits per day, 700/second
14 million .NET active content pages per day^[1]
900 transactions per second
2000 database queries per second
20 million user created content files
Daily user population of over 100,000
Database server with 16 dual core x64 CPU's, 128GB RAM, Clustered
Nine IIS application servers, load balanced
The largest installation of the vendors product

Breadth and complexity. The application is similar to a comprehensive ERP application, with a couple thousand stored procedures and thousands of unique pages of active web content covering a full suite of online learning applications, including content creation and delivery, discussions, quizzing, etc. The application has both breadth and depth, and is approximately as complex as a typical ERP application. This makes tuning interesting. If a quarter million dollar query pops up, it can be tuned or re-designed, but if the load is spread more or less evenly across dozens or hundreds of queries & stored procedures, the opportunities for quick wins are few.

Design. Early design decisions by the vendor have been both blessings and curses. The application is not designed for horizontal scalability at the database tier. Many normal scaling options are therefore unavailable. Database scalability is currently limited to adding cores and memory to the server, and adding cores and memory doesn’t scale real well.

The user session state is stored in the database. The original version of the application made as many as ten database round trips per web page, shifting significant load back to the database. Later versions cached a significant fraction of the session state, reducing database load. The current version has stateless application servers that also cache session state, so database load is reduced by caching, but load balancing decisions can be still made without worrying about user session stickiness. (Best of both worlds. Very cool.)

Load Curve. The load curve peaks early in the semester after long periods of low load between semesters. Semester start has a very steep ramp up, with first day load as much as 10 times the load the day before (See chart). This reduces opportunity for tuning under moderate load. The app must be tuned under low load. Assumptions and extrapolations are used to predict performance at semester startup. There is no margin for error. The app goes from idle to peak load in about 3 hours on the morning of the first day of classes. Growth tends to be 30-50% per semester, so peak load is roughly predicted at 30-50% above last semester peak.

Early Problems

Unanticipated growth. We did not anticipate the number of courses offered by faculty the first semester. Hardware mitigated some of the problem. The database server grew from 4CPU/4GB RAM to 4CPU/24GB, then 8CPU/32GB in 7 weeks. App servers went from four to six to nine.

Database fundamentals: I/O, Memory, and problems like ‘don’t let the database engine use so much memory that the OS gets swapped to disk’ were not addressed early enough.

Poor monitoring tools. If you can’t see deep in to the application, operating system and database, you can’t solve problems.

Poor management decisions. Among other things, the project was not allowed to draw on existing DBA resources, so non-DBA staff were forced to a very steep database learning curve. Better take the book home tonight, 'cause tomorrow you're gonna be the DBA. Additionally, options for restricting growth by slowing the adoption rate of the new platform were declined, and critical hosting decisions were deferred or not made at all.

Unrealistic Budgeting. The initial budget was also very constrained. The vendor said 'You can get the hardware for this project for N dollars'. Unfortunately N had one too few zero’s on the end of it. Upper management compared the vendor's N with our estimate of N * 10. We ended up compromising at N * 3 dollars, having that hardware only last a month & within a year and a half spending N * 10 anyway.

Application bugs. We didn’t expect tempDB to grow to 5 times the size of the production database and we didn’t expect tempDB to be busier than the production database. We know from experience that SQL Server 2000 can handle 200 database logins/logouts per second. But just because it can, doesn’t mean it should. (The application broke its connection pooling.)

32 bits. We really were way beyond what could rationally be done with 32 bit operating systems and databases, but the application vendor would not support ~~Itanic~~ Itanium at all, and SQL 2005 in 64 bit mode wasn’t supported until recently. The app still does not support 64 bit .NET application servers.

Query Tuning and uneven index/key distribution. We had parts of the database were the cardinality looked like a classic long tail problem, making query tuning and optimization difficult. We often had to make a choice of optimizing for one end of the key distribution or the other, with performance problems at whatever end we didn't optimize.

Application Vendor Denial. It took a long time and lots of data to convince the app vendor that not all of the problems were the customer. Lots of e-mail, sometimes rather rude, was exchanged. As time went on, they started to accept our analysis of problems, and as of today, are very good to work with.

Redundancy. We saved money by not making the original file and database server clustered. That cost us in availability.

Later Problems

Moore's Law. Our requirements have tended to be ahead of where the hardware vendors were with easily implementable x86 servers. Moore's Law couldn’t quite keep up to our growth rate. Scaling x86 SQL server past 8 CPU’s in 2004 was hard. In 2005 there were not very many options for 16 processor x86 servers. Scaling to 32 cores in 2006 was not any easier. Scaling to 32 cores on a 32 bit x86 operating system was beyond painful. IBM’s x460 (x3950) was one of the few choices available, and it was a painfully immature hardware platform at the time that we bought it.

The “It Works” Effect. User load tended to ramp up quickly after a smooth, trouble free semester. The semester after a smooth semester tended to expose new problems as load increased. Faculty apparently wanted to use the application but were held back by real or perceived performance problems. When the problems went away for a while they jumped on board, and the next semester hit new scalability limits.

Poor App Design. We had a significant number of high volume queries that required re-parsing and re-optimization on each invocation. Several of the most frequently called queries were not parameterizable and hence had to be parsed each time they were called. At times we were parsing hundreds of new queries per second, using valuable CPU resources on parsing and optimizing queries that would likely never get called again. We spent person-months digging onto query optimization and building a toolkit to help dissect the problem.

Database bugs. Page latches killed us. Tier 3 database vendor support, complete with a 13 hour phone call and Gigs of data finally resolved a years old (but rarely occurring) latch wait state problem, and also uncovered a database engine bug that only showed up under a particularly odd set of circumstances (ours, of course). And did you know that when 32-bit SQL server sees more than 64GB of RAM it rolls over and dies? We didn't. Neither did Microsoft. We eventually figured it out after about 6 hours on the phone with IBM Advanced Support, MS operating system tier 3 and MSSQL database tier 3 all scratching their heads. /BURNMEM to the rescue.

High End Hardware Headaches. We ended up moving from a 4 way HP DL580 to an 8-way HP DL740 to 16-way IBM x460's (and then to 32 core x3950's). The x460's and x3950's ended up being a maintenance headache, beyond anything that I could have imagined. We hit motherboard firmware bugs, disk controller bugs, had bogus CPU overtemp alarms, hardware problems (bad voltage regulators on chassis interface boards), and even ended up with an IBM 'Top Gun' on site (That's her title. And no, there is no contact info on her business card. Just 'Top Gun'.)

File system management. Maintaining file systems with tens of millions of files is a pain, no matter how you slice it.

Things that went right.

We bought good load balancers right away. The Netscalers have performed nearly flawlessly for 4 years, dishing out a thousand pages per second of proxied, content switched, SSL’d and Gzip’d content.

The application server layer scales out horizontally quite easily. The combination of proxied load balancing, content switching and stateless application servers allows tremendous flexibility at the app server layer.

We eventually built very detailed database statistics and reporting engine, similar to Oracle AWR reports. We know, for example, what the top N queries are for CPU, logical I/O, physical I/O. etc, at ten minute intervals any time during the last 90 days.

HP Openview Storage Mirroring (Doubletake) works pretty well. It's keeping 20 million files in sync across a WAN with not too many headaches.

We had a few people who dedicated person-years of their life to the project, literally sleeping next to their laptops, going for years without ever being more than arms reach from the project. And they don’t get stock options.

I ended up with a couple quotable phrases to my credit.

On Windows 2003 and SQL server:

"It doesn't suck as bad as I thought it would"

and

"It's displayed an unexpected level of robustness"

Lessons:

Details matter. Enough said.

Horizontal beats vertical. We know that. So does everyone else in the world. Except perhaps our application vendor. The application is still not horizontally scalable at the database tier and the database vendor still doesn't provide a RAC like horizontally scalable option. Shards are not an option. That will limit future scalability.

Monitoring matters. Knowing what to monitor and when to monitor it is essential to both proactive and reactive application and database support.

AWR-like reports matter. We have consistently decreased the per-unit load on the back end database by continuously pounding down the top 'N' queries and stored procedures. The application vendor gets a steady supply of data from us. They roll tweaks and fixes from their customers into their normal maintenance release cycle. It took a few years, but they really do look at their customers' performance data and develop fixes. We fed the vendor data all summer. They responded with maintenance releases, hot fixes and patches that reduced database load by at least 30%. Other vendors take note. Please.

Vendor support matters. We had an application that had 100,000 users, and we were using per-incident support for the database and operating system rather than premier support. That didn't work. But it did let us make a somewhat amusing joke at the expense of some poor first tier help desk person.

Don’t be the largest installation. You’ll be the load test site.

The Quarter Million Dollar Query

Unlimited Resources

Naked Without Strip Charts
^[1] For our purposes, a page is a URL with active content that connects to the database and has at least some business logic.

Acronyms

Two acronyms worth remembering.

RGE:

RGE: (Resume Generating Event) – An event that forces a person, or the persons manger to generate an updated resume.

An RGE is something most of us don't want to experience, at least not too often. RGEs are often followed by changes in income, housing, marital status, etc.

HGE:

HGE: (Headline Generating Event) – An event that causes news reporters to write stories and generate headlines.

HGEs can be either positive or negative, depending on the causes and effects of the event, although with the exception of dot-com startups, most IT initiated HGEs are negative events related to system or project failures of some sort.

HGEs are often followed by RGEs.

Obviously a goal of system mangers, security people and IT folks in general is to make sure that that acronyms like the above don’t show up unexpectedly. Those of us in public service are particularly sensitive to HGEs. There are not too many circumstances where public service IT organizations can generate positive headlines. Odds are that if there are headlines, they are not good. There is no incentive for the local news broadcast to begin with a segment on your shiny and fast new servers or your four nine’s of application uptime.

We spend a lot of time analyzing risk in security decisions, system designs, deployments and upgrades. If we do it right, we can design, build, manage and maintain systems that meet user/customer requirements while minimizing the probability of triggering an HGE and the follow on RGEs.

And if we are REALLY doing it right, we'll have fun while we are doing it.

Design your Failure Modes

In his axiom 'Everything will ultimately fail', Michael Nygard writes that in IT systems, one must:

"Accept that, no matter what, your system will have a variety of failure modes. Deny that inevitability, and you lose your power to control and contain them. [....] If you do not design your failure modes, then you will get whatever unpredictable---and usually dangerous---ones happen to emerge."

I'm pretty sure that I've seen a whole bunch of systems and applications where that sort of thinking isn't on the top of the software architects or developers stack. For example, I've seen:

apps that spew out spurious error messages to critical logs files at a rate that makes 'tail -f' useless. Do the errors mean anything? Nope - just some unhandled exceptions. Could the app be written to handle the exceptions? Yeh, but we have deadlines......
apps that log critical application server/database connectivity error messages back to the database that caused the error. Ummm...if the app server can't connect to the database, why would you attempt to log that back to the database? Because that's how our error handler is designed. Doesn't that result in a recursive death spiral of connection errors that generate errors that get logged through the connections that are in an error state? Ummm.. let me think about that.....
apps that stop working when there are transient network errors, and need to be restarted to recover. Network errors are normal. Really? We never had that problem with our ISAM files!. Can you build your app to gracefully recover from them? Yeh, but we have deadlines......
apps that don't start up if there are leftover temp files from when they crashed and left temp files all over the place. Could you clean up old temp files on startup? How would I know which ones are old?

I suspect that mechanical engineers and metallurgists, when designing motorcycles, autos, and things that hurt people, have that sort of axiom embedded into their daily thought processes pretty deeply. I suspect that most software architects do not.

So the interesting question is - if there are many failure modes, how do you determine which failure modes that you need to engineer around and which ones you can safely ignore?

On the wide area network side of things, we have a pretty good idea what the failure modes are, and it is clearly a long tail sort of problem, something like:

We've seen that circuit failures, mostly due to construction, backhoes, and other human/mechanical problems, are by far the largest cause of failures and are also the slowest to get fixed. Second place, for us, is power failures at sites without building generator/UPS, and a distant third is hardware failure. In a case like that, if we care about availability, redundant hardware isn't anywhere near as important as redundant circuits and decent power.

Presumably each system has a large set of possible failure modes, and coming up with a rational response to the failure modes that are on the left side of the long tail is critical to building available systems, but it is important to keep in mind that not all failure modes are caused by non-animate things.

In system management, human failures, as in a human pushing the wrong button at the wrong time, are common and need to be engineered around just like mechanical or software failures. I suspect that is why we need things like change management or change control, documentation and the other no-so-fun parts of managing systems. And humans have the interesting property of being able to compound a failure by attempting to repair the problem, perhaps the reason why some form of outage/incident handling is important.

In any case, Nygard's axiom is worth the read.

Using the DNS Question to Carry Randomness - a Temporary Hack?

I read Mark Rothman's post "Boiling the DNS Ocean". This lead me to a thought (just a thought), that somewhere within the existing DNS protocol, there has to be a way of introducing more randomness in the DNS question and get that randomness back in the answer, simply to increase the probability that a resolver can trust an authoritative response. Of course having never written a resolver, I'm not qualified to analyze the problem -- but this being the blogosphere, that's not a reason to quit posting.

So at the risk of committing bloggo-suicide....Here goes......

Patch Now - What Does it Mean?

When security researchers/bloggers announce to the world 'patch now', are they are implying that the world should 'patch now without consideration for testing, QA, performance or availability'? Or are they advising an accelerated patch schedule, but in a change managed, tested, QA’d rollout of a patch that considers security and availability? And when they complain about others not patching fast enough, are they assuming that the foot draggers are incompetent? Or are they ignoring the operational realities of making untested changes to critical infrastructure?

Consider that:

All patches have a probability of introducing new bugs. That probability is always > 0 and <= 1. The probability is never equal to zero. (And for a certain large database vendor, our experience is that the probability of introducing new bugs is very close to one).
There are many, many bugs that are only relevant under high loads.
A patch that corrupts data, as in databases or file systems, can be impossible to back out or recover from without irretrievable data loss.
Building test cases that can put realistic real world loads on test servers is very difficult, very expensive, and may not uncover the new bugs anyway.
A failed system or application has known, documented consequences. It is not a game of probability or chance. An unpatched security vulnerability is a game of chance where in most cases the odds against you are not known.

As an operations person with real responsibilities, who is accountable to a very large group of paying customers, and who has to make security versus availability decisions almost every day, I need security researchers to uncover, analyze and communicate risks, threats, vulnerabilities and mitigation techniques. The best of the researchers already do that very well, and for that I am very grateful. To those who are doing that for public service, fame, fortune or personal ego, I sincerely thank you, no matter what your motivation. You are adding value to the Internet community.

But when security people push recommendations out to the world without consideration for availability and/or performance, their recommendations remove value from the Internet community.

Security Researchers add value when

Uncovering and analyzing vulnerabilities and active exploits. (Research)
Analyzing probable and improbable attack vectors and calculating and communicating probabilities. (Research)
Testing and verifying attack vectors. (Research)
Communicating to the community the relative and absolute risks of vulnerabilities and consequences of exploitation. (Public Service)
Developing and communicating mitigation options. (Research)

Security Researchers do not add value when

Making blanket patch advice without consideration for performance or availability. (Operations)
Complaining about enterprises that do not follow their advice. (Carping)

(non-exhaustive lists, of course.)

In that context, when I hear 'patch now' advice, You can bet that I will filter the advice through the prism of availability, performance and operational reality.

I'll listen to 'patch now no matter what' advice from a security researcher/blogger who has real time operational responsibility for a large customer base, perhaps 100,000 or more customers, and who, if the patch fails, would be responsible for interruption of service for those hundred thousand customers, and who, if the patch fails, could or would be terminated for non-performance.
I'll listen to 'patch now no matter what' advice from a security researcher/blogger who has had a system with a hundred thousand customers down hard, has escalated to the vendors highest support level, and who has been on a tech support conference call for 13 continuous hours or more.
I'll listen to 'patch now no matter what' advice from consultants who are putting the reputation and existence of their consultancy on the line every time they give a customer advice.
I'll listen to 'patch now no matter what' advice from our own security staff, who I know will not point fingers, duck and hide when the patch goes bad and my systems fail.

As far as I am concerned, if you are in a position like one of the above, you can complain about service providers who do not patch fast enough to suite your preferences. If you are not in that position, you cannot complain when I don't (or your service provider doesn't) patch fast enough for you.

The bottom line is that unless the people who give the world advice to 'patch now no matter what' are also going to write my e-mail's and presentations explaining why my systems failed, unless they will absorb the inevitable backlash from customers, senior management, governing boards and will stand up in front of representatives from my internal business units and get grilled, castigated, chewed up and spit out for my decision, I don't need them to complain that I am not 'patching now'.

I've been in the 7pm vendor conference call with vendor VP and development supervisor, where our CIO came to the meeting with his/her letter of resignation, to be turned in to our CEO should the vendor fail to deliver performance fixes for the business critical application by 7am the next day.
It was not a fun meeting.

'Patch now' advice must be filtered through the prism of availability, performance and operational reality.

Safe browsing - Websense says fuggetaboutit!

It would sure be nice if an ordinary mortal could buy a computer, plug it in, and safely surf the web. Websense doesn't think so. I don't either. Apparently neither does CNN.

According to Websense:

75 percent of Web sites with malicious code are legitimate sites that have been compromised [...]
60 percent of the top 100 most popular Web sites have either hosted or been involved in malicious activity in the first half of 2008.

Ordinary precautions, like 'don't surf pr0n' , 'don't run P2P', and 'don't download screen savers' are of marginal value when legitimate web sites are part of the malware content distribution network.

It's 2008. So now that we have the wonderful world of Web 2.0, Websense says:

The danger is that users typically associate the content they are viewing from the URL in the address bar, not the actual content source. The URL is no longer an accurate representation of the source content from the Web page. (Emphasis is mine.)

So even the wise old advice of simply making sure that you pay attention to your address bar is of limited value. Your address bar is really just the starting point for the adventure that your Web 2.0 browser will take you on without your knowledge or consent.

Obviously it is true that some people, some of the time, can surf the web with a mass produced, default installed operating system and browser. But for the general case, for most users, that's apparently not true.

One of my security mantras is 'if it can surf the web, it cannot be secured'. In my opinion, if your security model assumes that desktops and browsers are secure, your security model is broke. You still need to do everything you can to secure your desktops and browsers, but at the end of the day, after you've secured them as best as they can be, you still need to maintain a healthy distrust toward them.

Of course when security vendors report on the state of security, we need to put their data into the context of the increase revenue they see when everyone panics and buys their product.

(via Zdnet )

A Down Side to Open Wireless

An interesting byproduct of maintaining an open wireless network? Apparently a man in Mumbai had his life disrupted by a raid on his property when authorities suspected his open wireless as being the source for messages relating to a recent bombing.

So what happens when a crime is alleged to have been committed using your open network?

For individuals who happen to get caught up in a crime involving something like what happened in Mumbai, and who don't have the backing of corporate legal departments, I suspect that the process wouldn't be much fun.

A person has open wireless because:

They don't know it is open.
They know it is open but don't know how to secure it.
They know how, but are to lazy to bother securing it.
They know how to secure it, but they don't really care.
They know how, and they are leaving it unsecured on principle.

If someone gets caught up in something like this, I hope they are in one of the latter categories, rather than in one of the first couple categories.

The crud moved up the stack

A long time ago (in internet time), in a galaxy really nearby, a few large software companies attached their buggy, unsecured operating systems to the Internet.

Havoc ensued.