When Software Vendors Make Security Assumptions

Bob recently ran into a situation where in order to run a vendor provided tool, he had to either modify his security practices or spend a bunch of time working around the poor tool design. The synopsis of his problem:
"Problem with that, though, is that it wants to log in as root. All the documentation says to have it log in as root. But on my hosts nobody logs in as root, unless there’s some big crisis happening."
This wasn't a crisis.

This seems to be a common problem. We've had a fair number of situations where a vendor assumed that remote login as root was possible, that there were no firewalls anywhere, that all systems of the same platform had the same credentials, and that unsafe practices are generally followed.

Examples:
  • Really expensive enterprise backup software that assumed that there were no firewalls anywhere. The vendor advised us that technical support couldn't help us if the customer was firewalled. (This was a while ago, but the product still requires the worlds ugliest firewall rules).
  • An 'appliance' (really a linux box) that because it used the brilliantly designed random port-hopping Java RMI protocol for its management interface, couldn't be firewalled separately from its console (really a Windows server).
  • A financial reporting tool that required that the group 'Everyone' have 'Full Control' over the MS SQL server data directories. No kidding - I have the vendor docs and the f*ugly audit finding.
  • A really expensive load testing product that assumes that netstat, rsh and other archaic, deprecated, unencrypted and unsecure protocols are enabled and available across the network.
Why are vendors so clueless?

Here's a couple of hypothesis.
  1. Bob and I are the only ones in the world with segmented networks, who have remote root login disabled and who have rational security practices. To the vendors, we are outliers who don't matter.
  2. The vendors’ developers, who insist that the only way that they can be productive is if they get a sandbox/dev environment where they are root and they don't have any security restrictions, actually get what they ask for. The code they write then works fine (in their unrestricted environment) so they ship it. Customers don't object, so the practice continues.
I suspect the later.

(OK - maybe there are other possibilities, but they aren't as amusing as picking on developers...)
It doesn't have to be this way. A couple decades ago I installed a product that had specific, detailed instructions on the minimum file system permissions required for application functionality for each directory in the application tree, including things like write-only directories, directories with read, but not file scan privs, etc. (an early version of what's now called GroupWise)

Today? I still see vendors that assume they have 'Full Control', remote root, unrestricted networks, etc.

My solution?
  • Escalate brain deadedness through the vendors’ help desk, through tiers 1-2-3 to the duty manager and don't let them close the ticket until you've annoyed them so badly that the product manager calls you and apologizes.
  • In meetings with the vendors’ sales team, emphasize the problem. Make it clear that future purchasing decisions are affected by their poor design. ‘You’ve got a great product, but unfortunately it’s not deployable in our environment’. The sales channel likely has more influence over the product than the support channel.
  • Ask for a written statement of indemnification on future security incidents that are shown to have exploited the vendors poor design. You obviously will not get it, but the vendors’ product support will have to interface with their own legal, which is painful enough that they'll likely not forget your problem at their next 'product roadmap' meeting.
  • Do it nicely though. Things like "...man, I've got an audit finding here that makes your product look really bad, that's going to hurt us both..." are more effective than anything resembling hostility.
If enough customers make enough noise, will the vendors eventually get the message?

Continuous Deployment – the Debate

Apparently, IMVU is rolling out fifty deployments a day. Continuous deployment at its finest, perhaps.

Michael Bolton at Developsense decided to look at what they are deploying. He found a couple dozen bugs in about as many minutes and concluded:
...there's such a strong fetish for the technology—the automated deployment—that what is being deployed is incidental to the conversation. Yes, folks, you can deploy 50 times a day. If you don't care about the quality of what you're deploying, you can meet any other requirement, to paraphrase Jerry Weinberg. If you're willing to settle for a system that looks like this and accept the risk of the Black Swan that manifests as a privacy or security or database-lclearing problem, you really don't need testers.
On the surface, it seems that the easier it is to deploy, the less time you'll spend on the quality of what you deploy. If a deploy is cheap, there is no reason to test. Just deploy. If the customers don't like it pull it back. If it's a Mars lander? Send another one. This line of thinking is captured in a comment by ‘Sho’ on a recent Rail Spikes post:
Sho - "The users would rather have an error every now and again on a site which innovates and progresses quickly rather than never any error on a site which sits still for months at a time because the developers can’t do a thing without rewriting hundreds upon hundreds of mundane low-level tests."
Send it out the door, fix it later? I know my users would disagree, but perhaps there are user communities out there who are different.

On the other hand - if you dig into what Timothy Fritz wrote about IMVU's deployment process, you get an idea that it's not your grandfather's process:
"We have around 15k test cases, and they’re run around 70 times a day. That’s a million test cases a day."
Hmm... this just got interesting.
"code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts. "
Ahh….. they’ve wrapped a very extensive automated process around a deployment. Really cool.

What’s the catch? I've always wondered what hooks instant deployment has that prevent code rollouts from breaking the database. Turns out that the database is an exception:
Schema changes are done out of band. Just deploying them can be a huge pain. ......  It’s a two day affair, not something you roll back from lightly. In the end we have relatively standard practices for schemas
What about performance problems?
... like ‘this query you just wrote is a table scan’ or ‘this algorithm won’t scale to 1 million users’. Those kinds of issues tend to be the ones that won’t set off an automatic rollback because you don’t feel the pain until hours, days or weeks after you push them out to your cluster. Right now we use monitoring to pick up the slack, but it costs our operations team a lot of time.
How much time? Sounds like they save development time, but they make the DBA's and operations staff make up the difference?

Where does QA fit into this?
....we do have a Quality Assurance staff. There are numerous quality engineering tasks that only a human can do, exploratory testing and complaining when something is too complex come to mind. They're just not in the "take code and put it into production" process; it can't scale to require QA to look at every change so we don't bother. When working on new features (not fixing bugs, not refactoring, not making performance enhancements, not solving scalability bottlenecks etc), we'll have a controlled deliberate roll out plan that involves manual QE checks along the way, as well as a gradual roll-out and A/B testing.
So the 50 rollouts per day is only for bug fixes, performance enhancements and scalability, not for new features or schema changes. QA exists but is not in the direct loop between the developers and production.

No word on security or compliance.

My old school conservatism tells me that the more often you let new code on to your servers, the more often you have an opportunity to shoot yourself in the foot. Continuous deployment looks to me like ready-fire-aim with the gun pointed at your privates. There may be a case for it though: a bug tolerant customer base that has a short attention span (the IMVU customers are probably ideal for this), no regulatory or compliance considerations, no or minimal security considerations, no database schema changes, no major upgrades, no new features, a small, highly motivated development team, extremely good fully automated regression testing, fully automated deployment with automated A/B testing, automated baselining, automated rollback, etc.

Unfortunately, all that detail is lost in the conversation.



Related: Why Instant Deployment Matters and Deployment that Just Works, James @ Heroku

When Security Devices are Exploitable

I can't resist connecting this bit of info from "Security and Attack Surfaces of Modern Applications"

(Via Gunnar Peterson)
So, today’s Firewall is:
  • A Multi-Protocol parsing engine
  • Written in C
  • Running in Kernel space
  • Allowed full corporate network access
  • Holding cryptographic key material
…and still considered a security device?
With "Stealth Router-based Botnet Discovered" (via Cybersec).
...the first known botnet based on exploiting consumer network devices, such as home routers and cable/dsl modems.
When security devices are exploitable…

An ERP Database Conversion

We just finished a major ERP database conversion.

The conversion consisted of the following major tasks.

  1. Migrate a decade of data from 35 RDB databases into a single Oracle 10G database.
  2. Move from an ‘all-in-one’ design where database, batch and middleware for a customer are all on one server to a single dedicated database server for all customers with separate middleware servers.
  3. Merge duplicate person records from all databases into a single person record in the new database.
  4. Point thousands of existing batch, three tier client-server, business logic services and J2EE web services to the new database.
  5. Maintain continuous downstream replication to reporting databases.

Replication

JCC Logminer was used to replicate 600 or so tables from each of the 35 production RDB databases to a merged schema in the new Oracle ERP database. Oracle Streams was used to replicate from the ERP database to a reporting database. The replication was 35 databases into 1 database into 1 database.

    JCC Logminer            Streams
RDB ----------+---->> ERP ------>> Reporting
|
RDB ----------+
|
RDB x35 ------+
The replication was established and maintained for over a year while the application was prepared for the conversion. Replication from the RDB source to the ERP and Reporting databases was normally maintained to within a few seconds. Significant effort was put into building replication monitoring tools to detect and repair broken or lagging replication. Row count reports checked for missing data. During that year, some read-only batch jobs, reports and J2EE functionality was migrated to either the Oracle ERP or reporting databases.

Preparation

During the year+ that we maintained the database replication, individual screens, forms, reports, batches and middleware components were ported to Oracle and tested against a full copy of the ERP database. Dual code bases were maintained for converted components. Additionally, practicing for the merging of person-records triggered significant data cleanup activities and introduced significant business logic changes that had to be coded and tested.

Testing

We set up three major testing initiatives. The first test was to ensure that the various batch, client server, middleware and J2EE applications would function with 10G as the backend database. Simple functionality tests were combined with database optimization testing. The screen, batch or applications had to both function as expected and also had to pass a performance test. A package was built that generated an Oracle execution plan and TKProf trace. As developers and testers checked functionality, the package automatically generated Oracle query plans and SQL traces for each function and e-mailed the results back to to the developer. This allowed developers and testers to determine if the converted functionality was rationally optimized. In many cases, functionality had to be re-factored to work efficiently on the new database.

The second testing effort was directed at schema changes required to support merged databases, and at building programs that could reliably merge the person-records (identity) of millions of customers by comparing and merging attributes such as SSN, address, etc. The merge process was tested on full database copies about a dozen times before the merge results were considered accurate.

The third testing effort was directed at simulating load by running critical parts of the application at near anticipated peak loads using load test software and agents. Historical data was used to estimate peak load for a particular function. Oracle MTS was tested, with a final configuration of both dedicated and shared Oracle processes. Transient client connections were configured to use MTS database connections, persistent J2EE and middleware connections were configured to use dedicated database connections.

Early design decisions (for example using timestamp instead of datetime) caused a fair amount of bug-hunting on the client server->Oracle connections. Archaic COBOL coding standards introduced cursor issues, network latency issues and Oracle optimization issues. A significant number of Oracle client and Compuware Uniface bugs were detected and eventually either patched or worked around.

Cutover

Cutover and failback checklists were developed and tested. An XMPP chat room was used to coordinate and track the cutover progress and serve as the authoritative record of actions taken (Starting Step 77… … … Step 77 Complete.)

Actual cutover consisted of

  1. Production shutdown
  2. Replication catch-up and data verification
  3. Database backups
  4. Schema changes to support merged databases
  5. A lengthy person-merge process to merge identities and accounts for persons who exist in more than one database.
  6. Code rollouts and configuration changes to point all application components at the new database
  7. Full database backups
  8. Sanity testing and bug fixes for newly uncovered bugs.
  9. Re-enable access to new production environment.
  10. Post-production testing and bug fixes.

Total down time was about 36 hours. The pre-merge preparation and post-merge testing took about 1/3 of that time. The person-merge and schema changes took half of the time.

Post-cutover

The first business day at production loads uncovered significant performance issues on parts of the application that had not been load tested and a few dozen minor functionality related bugs. Service was not significantly affected by either the performance issues or bugs.

Pre-cutover tension and stress level was high. Once a cutover like this goes live, there is no realistic fall back strategy. The assumption was that failure of this cutover would have been an RGE (Resume Generating Event) for senior IT leadership.

We’ve declared victory. Donuts for all.

Spoke too soon. Damn.

Securing Real Things

Here’s a podcast worth a listen. Brian Contos @ Imperva interviews Joseph Weiss on the topic of control system security.

Quotes:

  • "Running the operational systems comes before any security requirements."
  • "Power [industry] does not take this seriously"
  • "Are we more secure [than 10 years ago]? No."

Notes:

  • Systems that run Windows 95 and NT 4.0 even after the upgrades.
  • Can't patch them without breaking them.
  • Custom written TCP stacks that crash when you ping them.
  • Field devices with built in bluetooth or wireless modems
  • Bolted-on security. Very old platforms, not designed with security in mind.
  • Two cyber incidents on systems with brand new control systems. Very significant equipment damage.
  • Significant environmental discharge. Three deaths.

Janke-CNC-1980I have no particular knowledge of this general topic, other than a couple decades ago I made a living programming CNC machine tools, wrote a textbook on the topic, and occasionally played with PLC's. Oh - and I tried to write a Windows 3.x app that communicated with HART field instruments at 1200 baud. That was pre-internet, pre-almost-everything. You can see from the 30-year-old pic of me that we were hardly advanced past stone knives and bear skins.

Starting from the point of view of someone with almost no knowledge here’s my- 

Random thoughts:

A bunch of years ago when I lived in a small town, I woke up one winter morning to a cold house, no heat and no hot water. A bit of investigating and a knock on the door from a city utility worker cued me in to the cause. A couple teenagers thought it would be amusing to climb over a fence and put a big wrench on a big valve and shut off the main natural gas line coming into town. It didn't take long for the underground gas distribution pipes to empty out and run the whole town out of natural gas. This ended up being much more than an amusing prank. The city utility workers had to:

  • shut off and lock the gas meter at every house and business in town. Gas meters in older houses were still indoors, so the workers had to get access to those houses. (a couple days work)
  • re-fill the pipeline and city distribution network
  • go back to every house and business, contact the residents, locate every gas appliance in each house, turn on the gas to the house, re-light and test each appliance. (a couple more days work)

The tasks needed to be done serially. In other words, until all the gas was shut off, no gas lines could be re-pressurized. If memory serves me correctly, I had no gas, stove, heat or hot water for 3 or 4 days in late winter (in Minnesota, where winter is really winter) This was a minor incident, nothing more than a prank gone bad, but one with significant downstream consequences. Safely restoring service after a simple prank took significant resources and disrupted an entire community.

CNC machine tools and industrial robots, most of which are now networked, can be damaged easily by a simple bug in their programming, and when damaged can take weeks to repair (don't ask me how I know…).

Programmable road signs apparently are amusingly easy to re-program.

If a bad guy can embed a trojan at the heart of a payment processor network, then presumably doing the equivalent with the networks that control infrastructure like power, water, chemical, oil and similar facilities shouldn't be much more difficult. The consequences though, could be far worse. Nobody dies when their payment card gets caught up in fraud.

If you are going to declare cyber war on a nation, don't play around with network DDOS's. They are annoying and disruptive, but the damage is transient and whatever service was disrupted is easily restored after the attack. Servers reboot, routes heal, big deal. 'Rebooting' a pipeline, refinery, or similar infrastructure is non-trivial, and repairing physical damage caused by disrupting complex control systems is orders of magnitude more expensive and difficult than repairing virtual damage from hacking web sites or DDOS'ing political entities.  Crank around on the PLC's that control the valves that mix things together in a municipal water system, refinery or chemical plant and really bad things that hurt real people can happen.

In a recent Schneier post, Bryan Singer commented:

I feel pretty confident in saying that any of us that have been working in this space for any time probably have the knowledge required to stop a significant amount of manufacturing, disable infrastructure, stop utilities, turn off the lights, water, etc without a lot of effort. If we know how to do it, so do the proverbial "bad guys" (or they shortly will).

Firewall Rule (Mis)management

The ISSA released an interesting study on Firewall rule (mis)management[1].

Among their conclusions are: 
  • Firewall have gotten more complex over time
  • Firewall administrators routinely make errors
  • Firewall administrators are not following best practices
  • Firewall training materials do not focus management practices 

Broken Windows – System Administration and Security

A recent study confirms that the ‘Broken Windows’[1] crime theory might be valid. As reported in the Boston Globe:

“It is seen as strong scientific evidence that the long-debated "broken windows" theory really works—that disorderly conditions breed bad behavior, and that fixing them can help prevent crime”[2]

Does the theory also apply to system administration, security, servers, networks and firewalls? How about application code?

I grew up knowing that you always cleaned and washed your car before you took it to the mechanic. Why? Because if the mechanic saw that you had a neat, well kept car, he’d do a better job of fixing it. I’ve seen that in other places, like when you visit someone with a neat house versus a messy house, or hang out in a messy, smoky bar with cigarette butts and peanut shells on the floor, or a gated community versus a slum. Let’s assume that it’s simply part of human nature.

Quoted in the Boston Globe article:

""One of the implications certainly is that efforts that invest in improving the environment in terms of cleanliness may actually help in reducing moral transgressions because people perceive higher moral standards," said Chen-Bo Zhong, assistant professor of management at the Rotman School of Management at the University of Toronto.”

Higher moral standards reduce moral transgressions. Disorderly conditions breed bad behavior.

Does the theory apply to applications, servers, networks and firewalls?

  • If system administrators, developers and network administrators consistently and carefully maintain a system or application by tracking down and cleaning up all the little bugs, error messages and minor day to day cruft will the system perform better, have higher availability and better security?
  • If a firewall rule set is organized, systematic, structured instead of random and disordered, will the firewall administrators pay closer attention to the firewall and be less likely to cause an error that results in misconfiguration or unavailability?
  • If a server has applications and data neatly organized instead of scattered all over the file system, if the root directory has only system files in it instead of random leftovers from past projects, will the sysadmin pay closer attention to configuration, change management and security?
  • If and applications code base is organized, structured, and generally neat, will the code maintainers do a better job of maintaining the code?

I speculate that this is true, based only on observation and anecdote.


[1] Broken Windows, George L. Kelling and James Q. Wilson, the Atlantic, March 1982
[2]
Breakthrough on Broken Windows, The Boston Globe, Feb 8 2009

Via: The "Broken Windows" Theory of Crimefighting, Bruce Schneier

Cafe Crack – Instant Man in the Middle

Things like this[1] make me wonder how we’ll even get some semblance of sanity over the security and identity protection of mobile users.
Cafe Crack, provides a platform built from open source software for deploying rogue access points and sophisticated Man-in-the-Middle attacks.
They make it look easy:
Using only a laptop, the attacker can sit unassumingly in a public location to steal personal information. Perhaps the most alarming aspect of this demonstration is that it was accomplished with only a laptop and existing open-source software.
I knew it could be done, but I thought it was harder than that.
There are things that corporations can do, like spin up VPN’s:
However, the good news is that it is just as easy to protect oneself against Man-in-the-Middle attacks on an unsecure wireless connection. By using DNSSEC or VPN services, the user can bypass the attacker and keep their information secure.
But for ordinary users?
In the end, it is up to the user to be knowledgeable and safe around unsecure technology like public wireless.
I think ordinary users don’t have a chance.

Update (07/06/2012): The FBI warned on this type of attack..

[1] Cafe Cracks: Attacks on Unsecured Wireless Networks, Paul Moceri and Troy Ruths