Solaris 11 – The F-35 of UNIX && Name Resolver Lameness

•October 4, 2016 • Leave a Comment

As I worked through some client issues with Solaris, I noticed something disturbing. It seems that Solaris 11 has bifurcated the name resolution. Observe the following:

root@solaris11-test:~# pkg update
pkg: 0/1 catalogs successfully updated:
Unable to contact valid package repository
Encountered the following error(s):
Unable to contact any configured publishers.
This is likely a network configuration problem.
Framework error: code: 6 reason: Couldn't resolve host ''
URL: '' (happened 4 times)

root@solaris11-test:~# host is an alias for has address

Soooooo, the host utility command can perform DNS resolution but the pkg command cannot. You should know that /etc/resolv.conf is fully setup on this system. So, what happened? Well Solaris 11 mostly just trashed any previous concepts of Solaris network configuration. I’d invite you to compare that process with Solaris 8. Well, it seems they have bifurcated the nameservice calls in libc, as well. I don’t have the code in front of me, but I’d be willing to bet that pkg isn’t using the traditional gethostbyname() or gethostbyaddr() calls or it would have worked. It’s using some kind of new jiggery pokery that wants you to have gone through using dladm, ipadm, and netadm wringer to configure networking.

So, I went and ran the gauntlet of Cisco-like tools in Solaris 11 and got nameservice (or at least the nsswitch “import”) configured, (yes, again). Then pkg was happy and things worked.

So, they left both systems active at once? That makes troubleshooting a nightmare folks. I see it as yet another “screw you Mr. Sysadmin” from Oracle. I’m sure someone will argue this is some kind of feature. However, this isn’t backwards compatibility, or the pkg tool would have also worked.. This is just shoddy work done by people with no respect for Solaris’s traditional strengths. You can’t chalk this one up to someone’s “vision.”

No wonder during the construction of Solaris 10, Sun decided to isolate the ZFS team and dedicate resources to it and why Apple, while considering how to either copy or clone ZFS, has done the exact same thing. Had they not, it’d would have undoubted got sucked into the wrongheadedness of SMF and other horrors since it’d have been poisoned by some MBA who wanted “corporate standards followed to the letter”. It was like Skunkworks building the SR-71 Blackbird. They had to dedicate and mostly isolate those engineers from the rest of Sun (I know people personally who worked on the project) just like Lockheed did with Skunkworks. Otherwise, the same corporate poison would have contaminated them. If you’d like to see what happens to aerospace companies who don’t do that then just check in on the F-35 and see how one trillion dollars can, in fact, go up in smoke while attempting a boondoggle to distill crap into highly refined crap. SMF was designed by a committee like the F-35. Oh, so was the OSI protocol stack, which also happens to suck rather badly (at least according to me and the IEEE). Hey, at least all those hours of committee yammering got us the OSI Model, which is about the only usable work product. Anyone running TP4 who wants to fight about that? Didn’t think so…

Solaris 10 & 11 CLI wants to be like Cisco IOS and SMF wants to be like Systemd … but why?

•October 4, 2016 • Leave a Comment

It’s quite obvious with all the new features added to Solaris 10 that Oracle (and probably Sun before them) wants to change the way their users interact with their favored tools on the command line. Previously, say in Solaris 8 (we won’t talk about Solaris 9… it barely existed), you’d have standard UNIX tools which took arguments and switches in the normal fashion. Ie….

$ foobar -a -b -z somearg

Now Oracle wants everything to be in what I recognize as “Cisco Style” where there are no flags or switches and every command has to take place inside of a “context”. In this sort of style:

$ foobar sub-command -z somearg

You can see this all over the place. The ZFS toolset (zpool, zfs, etc..) is this way, the virtualization tools are this way (for LDOMs and Solaris Containers), the networking toolset is also setup this way in Solaris 11 (ipadm & dladm).

My impression of Solaris 10 was “I Love ZFS and I hate SMF”. I love ZFS because it’s about a million times easier to administer than Solstice Disc Suite (Solstice Volume Manager), which frankly is garbage. I remember having to re-jigger RAID arrays because SDS/SVM didn’t write enough metadata to them to reconstruct meta devices that had been physically moved around. This was back in the Solaris 8 days, but nonetheless, it was way more pain than folks had to endure in LVM or VxVM. ZFS was a great thing and I don’t find the toolkit particularly odious. However, I do have some critiques of the approach in general.

There are two concerns that I have over this new “style” of command line tools. First is that the desire to load one binary up with a million different features and functions is not the UNIX way and is unenlightened. Ever work with the “net” tool on Windows for administering SMB shares and users? It’s a mighty powerful tool, but it has about a zillion different features and functions. Some might argue that having one-tool is easy to remember, but I consider that a bogus argument since you’ll still have to remember all the sub-commands; so what difference does it make? The problem with bundling too many features with one tool is that the main logic path of the tool becomes a marathon to keep up with for the developers. New folks to the project have to learn where all the land mines are buried and the longer the logic path, the more terrain they have to know. That’s why one of the fundamental principles of UNIX is KISS (Keep it Small and Simple). Discard and deprecate UNIX core principles at your peril, corporate kiss-ups. You’ll be lucky to stay in business long enough to regret it.

The other issue is that it’s simply weird looking and non-standard but doesn’t give you anything in return. I teach classes in various flavors of UNIX and Linux and one of the most common annoyances new students cite is how non-standard UNIX commands can be. They see “ps aux” and ask why there is no option specifier (the hyphen). They see commands which create captive interfaces and ask “how can this be used in a pipe” (which of course they can’t). I have to explain that every now and then, either some jerkhole thinks they have “a better way” (wrongly), they come from DOS or Windows and use clueless conventions from those OSs, or they just get plain lazy. This annoys folks to no end. It especially annoys the VMS crowd who is used to high degrees of regularity in their DCL CLI. They are rarely amused by “ps -ef” vs “ps aux” hand-waving (ie.. SysV vs BSD conventions). Sun/Oracle didn’t make this better with Solaris 10 & 11, they made it far far worse (and for no good reason).

Lastly, let me give you one more example of why this method is inferior and has real drawbacks. It has to do with the “contextual” nature of the command strings when used in scripts. Consider an example where a resource of some type is created then configured. This happens a lot with tools in Solaris that manage storage objects, network interfaces, CPUs, etc. The old way would involve creating the resource, then using other tools to further configure it. Here is a pseudo-example that creates a widget:

# Make a frobozz widget as an example. Set attribute foo to equal "bar" while we are at it.  
$ mkthing -a foo=bar -n frobozz

That’s not so bad. Now let’s do some further configuration to it. As a thought experiment, say for example that ‘-s’ is to set the state of the frobozz thingy to “active”.

$ modthing -n frobozz -s active

Now consider the same thing but done using the Cisco / Solaris 10||11 way. It would likely look more like this:

$ superthing create-obj -n frobozz
superthing> set context obj=frobozz
superthing.frobozz> set state=active
superthing-frobozz> ..
superthing> save
INFO: Stuff was saved
superthing> quit

At first blush, you might say “Big deal, they are about the same.” However, for those of you who write scripts, ask yourself this, would you rather drive individual command line programs where you can check each one’s status and not have to do screen-scraping, or would you rather have to interact with something Expect-style where you will need to interpret the slave-application’s every response (via parsing it’s output, not using return values). Many programming languages can’t do it at all. Some of the more astute defenders of this brand of stupid will say “Wait! You can feed those options on the command line.” Well, yes, most of the time you could. However, I’ve noticed several places in Solaris’s new toolbox where there is no way to do that. Thus, if that facility is purely there by the goodness of their hearts, then the only way you are going to be able to drive things programmatically is if they let you do it! If the commands were discrete (the way $Deity intended) then you wouldn’t have that issue.

It’s not an issue of old school versus new school. It’s an issue of people who have learned better through experience and pay attention vs those who want to plunge ahead and disregard the work in the past with hubris and lack of planning. I love Solaris’s ZFS and the new IPS system as much as the next guy. However, I think Oracle (and Sun before them) has done an awful lot of damage to Solaris in other places. I think the SMF was really a design pattern that Linux’s systemd is now following. Both are showing themselves to be weak and ill-thought out. Most of my support calls for Solaris are for SMF. The are not coming from old greybeards who hate it, either. They are from non-UNIX folks who are baffled and beaten by it. Just like many of my recent calls for Linux are for systemd issues where it’s either eating CPU for no apparent reason, stymieing users seeking to setup simple startup and shutdown procedures.  I don’t give a damn what the market-droid cheerleaders say to spin the issue, SMF was just as bad an idea as systemd. In fact, I’d say it’s even worse because significant portions use binary-opaque sqlite backends (not to knock sqlite, but this is a bad place to put it). The other half of SMF configuration goes into XML files. All this used to be human-readable scripts you could tweak/tune/read individually. They turned a traditional simple text-based UNIX feature into a complex binary opaque nightmare.

The decision for SMF and Systemd (in Linux) was probably made on the basis that developers who use an IDE more than the CLI would be more comfortable editing XML or “unit files” than writing shell scripts. They’d be free from having to learn shell script, too. Lastly they’d have more determinism over startup and shutdown since they’d moved the primary startup logic out of scripts and into their own tool. This allowed folks to parallelize the startup and other nifty features. The only trouble with these arguments is that they are either fools gold (“oh it’s going to be so much easier”) or the features being touted are not tied to the ill-advised/foolish decisions and thus invalid as an argument for implementing them (ie.. there already existed parallelized “init” implementations – it didn’t need a re-write or replacement to do that).

The only valid argument I can see being made is the one about XML and unit-files being more deterministic. However, I think this needs to be put in the context of losing the features available to the user in the shell scripts used for SysV init. In this case you’re dealing with a zero-sum game. What you give to the developers, you take away from the sysadmins.

While there are certainly those who can force us to go along, don’t expect folks to go along quietly. Furthermore, don’t expect them to stay on platforms like Linux and Solaris just because of some perceived momentum or acceptance when they become too onerous to bear anymore. The SMF and the Cisco-izing of Solaris are unwelcome from my perspective. While some Solaris die-hards are still drinking the kool-aid, most Solaris sysadmins have moved on. I personally don’t expect anyone to care when Solaris 12 comes out (if it even gets that far). All the cool features are being take up and enhanced by more enlightened projects (such as ZFS on FreeBSD). Anyone who isn’t tied to some legacy environment is going to have jumped ship by then. Legacy environments only get smaller and suffer attrition. That’s Oracle Solaris’s future, if you ask me.


You Heard Me: Die SSL. Die Now.

•March 3, 2016 • Leave a Comment

It’s time to dismantle another orthodoxy that will probably annoy a lot of people: Secure Sockets Layer or “SSL”. Let me first say that I completely understand the need for both authorization and authentication. I really do get it. However, I do not believe SSL is the right solution for either one in just about any application.

For me it comes down to these few issues: trust, track record, and design. I believe that SSL has a flawed trust model, an (extremely) checkered track record, and operates on risky design & coding principles.

Let’s first talk about trust. Ask yourself these questions. Do you tend to trust corporations more than people? Do you think, in general, big companies are just a good bunch of folks doing the right things? Are you in the 1% and believe that companies should have the same rights as humans? Do you believe most companies would stand up to the NSA? You see, personally, I do not trust corporations. That is my default position: zero trust. I trust corporations much less than a shady stranger sharpening a knife in a train station.  I also think corporate personhood should be abolished unless of course corporations can be charged criminally just like an actual human when/if they harm or kill someone, which of course, isn’t the case today.

So, do you trust Verisign, Comodo, GoDaddy and friends with information you’d traditionally have put in your safe (ie.. banking info etc..) ? I sure as hell don’t. They are just corporations. They can be bought and sold if their policies don’t suit one corporate puppet master or another. They can be easily bullied by law enforcement to issue a fake certificate. There is absolutely nothing stopping the NSA from walking in the door with a subpoena and demanding a signed cert for any domain they wish (or even a signed signing-cert). There is no way for normal citizens to know if that happened.

Now another point about “trust”. I said I don’t trust corporations and I don’t trust Verisign to pat me on the head and “insure” that the cert actually belongs to the real company. Crazy me. I don’t trust that one big corporation that got paid to say I should trust another big corporation. Here’s the point on trust: the whole model points to entities I feel I can never actually trust.

Last point on the issue of trust is this. Have you ever been “verified” by one of these commercial certificate authority? I have. I was working for a very well known and now defunct bank at the time. I payed very close attention to the steps they took to do the verification. I assert my opinion that with a fairly small budget and a small amount of social engineering, their process can be completely derailed. Yes, they make a few phone calls. This security measure is not insurmountable by any stretch of the imagination. You really think even a small group of cops or well heeled couldn’t make that happen?

So, moving along, let’s talk about SSL’s track record. Frankly, it’s tough to make any fair comparison since there aren’t any viable alternatives with any track time. However, in my opinion, there is a close relative we can compare with and that’s PGP. It’s older than SSL. It handles encryption, has a trust model, and has seen 20 years or more usage. No, it’s not a perfect apples-to-apples comparison, but it’s pretty fair nonetheless.

Now we’ve established the context for comparison is “other encryption schemes that have a tough job to do and have been doing it a very long time”. So, I’ll just say it. SSL has an extremely terrible track record.  I don’t really see anyone with credibility trying to argue against this. The best the devil’s advocate could do would probably be to point out the many band-aids that have been applied. Okay, great it’s all fixed now right? Uhh, maybe. There is credible evidence that the NSA knew about the Heartbleed bug for years before it was discovered.

It’s true that the NSA and any group of spooks would probably (okay definitely) warehouse as much zero-day as they can in secret, SSL or otherwise. OpenSSL not only has a checked past, but also, in my opinion, the software architecture almost insures that hacks will happen again and again.

Writing secure software isn’t easy, but there are some design guidelines which certainly help you. Let me give you my top three and we’ll examine each in the context of SSL.

  1. Keep it as simple as you can. Complexity is the enemy of security.
  2. Design strength beats design secrecy every time.
  3. Validate any trust you depend on for security.

How does SSL stack up? Well, right up front let’s be clear, they nail the second one. The standards and the implementations are open. So, hey, at least that’s good. However, the first point about the KISS principle is a big problem, since SSL is very complex. I’ve been coding since I was six years old. I am not a crypto expert, but I’ve also done my fair share of implementing cryptographic algorithms in C code, which is the same languages as OpenSSL. My impression of SSL’s level of complexity is somewhere between ridiculous and “to the moon, Alice!” I’m not the first person to ask “How can one ever secure something that complex?” That’s an especially troubling question when one considers the troubling criticisms of the OpenSSL project’s code quality.

What would I propose to replace SSL? Well, I’d say that for one we should completely change the trust model. It needs to be decentralized (eliminating a sweet target for state and corporate actors). In my opinion, trust should be allocated more like it works in PGP. If I trust person X, company Y, and non-profit org Z then I would allocate greater trust to any key they have all three signed.  Also, I’d allow for some greater variability in the way trust is distributed and by whom. For example, if Mozilla puts trust (and embeds it in their browser) in keys signed by the freedom-loving non-profit organizations but Microsoft puts their trust in other big nasty corporations, then it properly reflects the value-system of the users and gives greater choice in exactly who I’m trusting.

The bottom line is this: SSL sucks and must die. Some would say the only way out is through, and we should improve SSL. I disagree. It should be phased out and eventually completely discarded. As an individual you should start distrusting SSL as much as can be borne in your personal situation. I already know that appears to be a long shot these days, but I’m a geek, I don’t care. Like most geeks I’m more interested in building a technical meritocracy rather than continuing to support the corporatist status quo.



Why I Dislike PulseAudio

•February 17, 2016 • Leave a Comment

Pulse audio is un-friendly toward all Unix-like platforms (including and especially Linux). I have had every problem in the book with Pulse audio on Linux. I haven’t the disrespect to defile my NetBSD system with it(perish the thought!). I personally consider it hostile software and counterproductive to even engage with it. It’s not limited to taking up 90+ % of the CPU. It also:

  • Has buffering issues which cause sound to stutter, jitter, and basically sound like crap. Today. Now. On at least two Linux systems I’m forced to use (forced if I want to play these modest games under Linux). One is Ubuntu and the other is Fedora and they are patched to the hilt. It often takes more CPU than the games I play under Linux do!
  • Doesn’t actually support some of the sampling rates it advertises to clients. Ie.. insufficient checking on it’s operational parameters. Then it won’t re-sample, or if it does, does so in non-realtime (ie.. poorly).
  • Almost never works with the hardware mixer on the systems I use. Ie.. changing volume either creates noise, doesn’t work, or mutes the whole channel and makes it UN-adjustable.
  • Skips and jitters in bully-victim scenarios with other system daemons as they try to do (often very little) tasks on the system.
  • Mis-identifies mixer controls (ie.. headphone port is wrong or missing, line out is mismatched etc..)
  • Bugs out or overreacts to sound events like removing a headphone from the headphone jack.
  • SIGSEGV then takes some client apps with it (Chrome, Skype, etc..). * Often skips or jitters the audio while adjusting volume.
  • Breaks or jitters input streams when adjusting line-in or mic volume.
  • Often has a freak out when one client plays at one rate then another client uses a faster rate. The first (slower) client then goes at the higher rate and you get Alvin and the Chipmunks.
  • Tends to “ruin” other configurations. Ie.. it creates dependencies in Ubuntu that (especially recently) make it impossible to uninstall and replace with alternatives without custom compiling packages (firefox depends on it). Another example, you can’t get esound anymore since they force a (inferior and fake) replacement pulseaudio-esound compatibility crapware (that’s shares all of Pulseaudio’s issues plus adds a few of it’s own). You’ll have big problems just going back to ALSA or OSS. Especially with their version of mplayer (where they’ve married it to pulseaudio too closely). I love mplayer. How dare they sully it with this.

Pulseaudio is way too over-complicated. It’s some dream of Lennarts while he was at a rave or something. Maybe he fancies himself some kind of sound engineer. *YAWN* *SHRUG*. Here’s me … wholly unimpressed despite the clubware he sports at conferences. News flash, Lenny, while you are a brogrammer you aren’t a member of The Crystal Method, “G”.

Every single other solution in the same or similar space is MUCH better. I’m thinking Jack, Arts, Esound, etc.. They may not have all the features, but they WORK, generally. I’ve had Pulseaudio on at least 7 machines. It worked acceptably on ONE. Hand waving about “it’s better now” was old 3 years ago. It’s not better, just more bloated.  I have (way) more complaints and buggy, anger inducing experiences with Pulseaudio, but I guess I’ll end here. I consider it one of the biggest PoS parts of Linux overall. It’s like the Kyoto Climate accords. It sucks the air out of the room and provides a very sub-par “solution” (that isn’t) that occupies the space where something much better should be. I doubt anyone is still reading this, so I’ll leave my further issues as an exercise for those who still choose to drink the Lennartix^H^H^H Uhh, I mean Linux kool-aid. I’m an enemy of Lennart’s way of doing things and I’m not the least bit ashamed to say so (long and loud) as I listen to flawless playback via Esound (esd).

PS: Stop writing comments saying it’s all fixed and magic now. It’s not. If you think I’m wrong. Wonderful, write your own blog entry about what an idiot I am.

App Store == Crap Store

•January 25, 2012 • 1 Comment

So not only is Apple bringing the App Store to the desktop OS but now Microsoft is planning to do the same with Windows 8. My response is: meh, *shrug*. 

No, I have not been living in a cave for the last 8 years. However, I don’t give a damn about the app store concept or reality for the same reason I don’t care about software piracy anymore. I don’t need either of them. I have something so much better: Free Software. 

Yep. You read me correctly. Perhaps some folks couldn’t imagine life without hunting up the perfect app to aid in your every activity. However, I have a different viewpoint which contrasts the “app store” “solution” with free software and I can articulate it in bullet point clarity. 

  • Just about all the apps in anybody’s “app store” are complete garbage. Just because the store features a slick format and the device you run it on has a shiny glass screen does not mean the app will be well designed, appropriate, or remotely useful. In contrast, free software is designed by people who had a specific need or idea. If the “market” for free software is smaller, it still exists for a purpose. The authors aren’t trying to get paid, they are trying to scratch an itch. 
  • Free software typically doesn’t make you sign contracts, get spied on, or be used for marketing reasons. There might be a few exceptions, but the ratio isn’t anywhere close with what you’d get from an “app store”.
  • Support for free software is also free. If you must have paid 3rd party support that’s still an option, too. Support for commercial applications is typical an additional cost outside the budget for simply acquiring the software.
  • Documentation for free software is written by people in a hurry who’d rather be writing code. Good. I’m in a hurry, too. Let’s get this application fired up and working pronto and learn the key features post-haste. Give me a readme.txt, online mini-how-to, man page, install.txt, quickstart, or irc channel for a quick setup any day over a corporate professional tech-writer-written document.
  • Free software has had the “app store” concept down for a very, very, long time. It’s called a package repository. You can search for applications, get their descriptions, and automatically install or remove them and all their dependencies without one feature that the “app store” has: the price. 
  • There are no corporate censors of free software package repositories. You are also free to set up your own, for free. 
  •  So, if Microsoft or Apple decide that I don’t really need access to some security related application because only a software terrorist would need to play Postal 2, Grand Theft Auto or some other controversial title then I get the shaft? 
  • Sure, you can still install stand alone applications – for now. How long do you think a company like Microsoft will make that easy ?


Jump up and down and do three cheers for the app store? I’ll pass. 

NetApp, Spinnaker, GX, CoralFS … and the Pantywaist “Product Managers”

•March 13, 2011 • 9 Comments

Could Have Been a Fanboy

I’ve been a NetApp customer in one form or another for 12 years. I know their products very well and I work in a large NetApp environment with over 18 filers. My Filers are very busy and some are really scaled out. I know OnTap better than a lot of guys who work at NetApp who are friends of mine. I wish them well, but I feel a bit like a neglected spouse. When are they going to stop patting themselves on the back for doing mainly hardware refreshes and get down to the business of integrating the Spinnaker technology they’ve spent years marginalizing and internally fighting?

The promise of GX – Squandering Spinnaker

When NetApp announced it would buy Spinnaker in late 2003 a lot of execs, market droids, and product managers told the press that they intended to use their scale-out approach to strengthen NetApp’s own product line. Rather than truly integrate it with OnTap they created the “GX” line. They saw the global namespace scale out model as only being appealing to HPC and entertainment markets. They failed to get the feedback that “limiting us to slice-and-dice 32-bit 16TB aggregates really sucks” from their customers. I can tell you that we were saying it. I know I wasn’t the only one. NetApp’s OnTap continued to provide great performance and stability so we hung out and tried to be patient while they promised that OnTap would soon brandish the power of the Spinnaker model without affecting OnTap’s performance and stability.

Fast forward to 2011. They’ve had seven years to integrate the technology and meanwhile lots of other players such as Isilon, IBM SoNAS, Panasas, and others have matured in NetApp’s traditional areas of strength. Those players started off with a scale out model and were not held back by any legacy requirements for backward compatibility or upgrade paths, it’s true. However, NetApp has made huge profits during those seven years. If they’d really wanted to get GX line integrated with the traditional filers, it would have been a done deal.

Why the Hesitation?

Having worked for a lot of large companies I’ve seen similar opportunities wasted. Big companies get political. In fighting and silos often keep them from truly integrating the goals that the company visionaries have.  What happened at NetApp? Why have they failed to deliver on their promises that sysadmins like me still haven’t forgot? Well, I wonder was it:

  • Engineers who were “loyal” to OnTap rejected the Spinnaker approach?
  • Spinnaker engineers were too far from the R&D action and geographically dispersed away from the “old school” OnTap folks?
  • Marketing folks didn’t think they could get customers to understand the scale-out model and thought they’d be accused of being HPC-only or creating Movie-Maker-Filers ?
  • Some bad blood and silo wars between camps inside the company?
  • Product managers didn’t have the stones to offer up a truly different next-gen version of OnTap. This is my personal opinion.

I do personally blame their product managers. No matter which of these excuses you favor, it’s ultimately their fault. Having worked with many of them in the past (not at NetApp, mind you), I find that about 80% are incompetent folks who think of themselves as technology gurus, but lacked enough skill to “make it” as a line-level geek. If they talk enough, someone gets the idea that maybe they’d be better putting them in charge of the geeks rather than expecting them to write code or otherwise produce results. The hard truth is that you need to be educated in the school of hard knocks to be a good product manager. Few of them finished their degrees at that prestigious institution and fewer still want to leave once they have. So, it’s rare that they see what needs to be done and simply do it rather than making excuses to drag the product along at a snails pace, hoping that if they don’t change things much, nobody will fire them.

OnTap 8.1 – No CoralFS or Striped Aggregates

Okay, we get 64Bit aggregates which will give us @100TB sized aggregates. Nowadays, that’s not nearly good enough. Yes, we’ll get a clumsily unified namespace that I still have to manage behind the scenes. It’s too little and too late. Perhaps 8.1.x or 8.2, huh? Wait a few more years? Is this seriously the strategy in the era of 3TB drives and fierce competition from folks who already solved these problems and can match or exceed OnTap’s stability? What’s worse is that 8.1 isn’t offering striped aggregates or CoralFS. This is the WAFL alternative secret-sauce that Spinnaker already had in production 9 years ago. This is the scale-out formula NetApp promised us to have integrated in their press release in November of 2003. Sorry, NetApp, I have a long memory. I was excited by that announcement and hoped my favorite storage vendor was about to get that much better with the introduction of some new blood. I have to admit, I’m still waiting, but without as much hope that they can deliver.

Someone, tell me I’m wrong about 8.1 I’d love to retract my accusations.

Why Not Show Some Leadership?

NetApp, why don’t you fire your product managers and bring in some new folks who can make it happen more quickly? It’s not for lack of cash that you guys have failed thus far. However, you don’t truly fail until you quit trying. So, if I were the CTO, I might consider the following:

  • Screw it. We are bring CoralFS back into 8.1 and delaying the product launch. We are going to activate the market droids and inform them of the value of doing this. Customers already don’t need convincing that “buckets” (aggregates) are not the ideal approach. Tell the coders, testers, and documentation folks you’ll give them a 10% bonus if they can pull it off by Q4 2011.
  • Flush the whole mess. We’ll freeze 7.3 and you can either buy that on new hardware, or you can buy some kind of OnTap-GX-enabled kit. You can use the same hardware, but you have to upgrade to the new OS. NetApp could provide great deals on swing hardware and re-invigorate their professional services folks to do the heavy lifting instead of trying to figure out the best way to offshore them.   People can take the pain if you can provide a clear path and some clear benefits of doing it.

Either way, promises are getting thin these days; call it the seven year itch.

Linux Virtualization – A Sysadmin’s Survey of the Scene

•January 2, 2011 • 9 Comments

The VMware Question

I’ve done quite a bit of work with VMware ESX the last few years and even though I have some serious purist-concerns about the management toolset I have to admit that it’s the product to beat in the enterprise these days. VMware is currently the state of the art for managability and beats the pants off most everyone else out there right now.

Yes, you can run 10-50 virtual machines without any fancy toolset to to manage them. I’ve done as much at home just using the venerable qemu and nothing more than shell scripts. However, I’ve also worked in environments with thousands of virtualized hosts. When things get this scaled up, you need some tools with some serious horsepower to keep things running smooth. Questions like “what host is that VM on?” and “What guests were running on that physical server when it crashed?” and “How can we shut down all the VMs during our maintenance window?” become harder and harder to answer without a good management toolset. So, before we continue dig the perspective I’m coming from. Picture 4000 guests running on 150 beefy physical hosts connected to 2-3 SANs across 3 data centers. This is not all that uncommon anymore. There are plenty of hosting companies that are even larger environments. Right now they pretty much all run VMware and that’s not going to change until someone can get at least close to their level of manageability.

The Vmware Gold Standard

Well first let’s talk about what Vmware get’s right so we can see where everyone else falls.Vmware’s marketing-speak is so thick and they change the names of their products so often that it’ll make your head spin. So, keep in mind. I’m just going to keep it real and speak from the sysadmin point of view not the CIO Magazine angle. Here’s what makes Vmware so attractive from my point of view:

  • Live Migration (vMotion)
  • Live storage migration (vStorageMotion or whatever they call it)
  • Good SAN integration with lots of HBA drivers and very easy to add datastores from the SAN
  • Easy to setup multipathing for various SAN types
  • The “shares” concept for slicing CPU time on the physical box down to priority levels on the guests (think mainframe domains)
  • Easy to administer one machine (vSphere client) or many machines in a cluster (vCenter)
  • Distributed resource management (DRS) allows you to move busy VMs off to balance the load
  • Supports high availability clustering for guests (HA) and log-shipping-disk-and-memory-writes fault tolerant clones (FT). The latter is something I don’t think anyone else does just yet.
  • Allows you to over-commit both memory (using page sharing via their vmware-tools guest additions) and disk (using “thin provisioning”)
  • Allows easy and integrated access to guest consoles across a large number of clustered machines (vCenter)
  • Allows easy “evacuation” of hosts. Guests can spread themselves over the other nodes or all migrate to other hosts without a lot of administrative fuss. This allows you to do hardware maintenance on a host machine without taking downtime for the guests.
  • Customers in hosted environments can get access to a web-based console for just their host allowing them to reboot or view stats without getting support involved.
  • Some nice performance statistics are gathered both per-host and per-guest.
  • VMware is very efficient. In my testing only about 1-3% of the host’s CPU is degraded by the hypervisor. The rest the VMs do actually get. In some rare cases, they can even perform better (like cases where short bursty I/O allows the dual buffer caches of the OS and hypervisor to help out).

Why I Want to See them Fail (Vmware)

Want reasons besides the fact that they are a big evil corporation making very expensive non-free proprietary software with very expensive support contracts ?  Well let’s see:

  1. They refuse to make a Linux native or open-source client (vSphere client). It also doesn’t work at all with Wine (in fact WineHQ rates it GARBAGE and I agree). Want to see the console of a guest in Linux – forget it. The closest you can get is run it inside Windows in a desktop virtualization app like VirtualBox or Vmware workstation for Linux. I’ve also done it via SeamlessRDP to a “real” Windows server. Don’t even leave a comment saying you can use the web-console to view guest consoles. You can’t, period. The web console has about 40% of the functionality of the fat client (and it’s not the most used functionality for big environments) and is good for turning guests on and off. That’s about it. The web console has a LONG way to go. If they do beef it up, I’m afraid they’ll use Java or ActiveX to make it slow and clumsy.
  2. They are removing the Linux service console. Yes you can still get a Redhat-a-like service console for ESX 4.0 but not for ESXi 4.0. Also, they are planning to move all future releases toward a “bare metal hypervisor” (aka ESXi) in the future. Say goodbye to the service console and hello to what they call “vMA”. The latter is a not-as-cool pre-packaged VM appliance that remotely runs commands on your ESXi boxes. Did you like “vmware-cmd” and “esxcfg-mpio” well now you can federate your ESXi servers to this appliance and run the same tools from there against all the servers in your environment. The only problem is that the vMA kind of sucks and includes a lot of kludgy Perl scripts, not to mention is missing things that you might want like being able to do or script up directly on the host machine (it’s not a 100% functional replacement for ESX). The bottom line is that it’s not as Unixy anymore. They are moving toward a sort of domain-specific operating system (ESXi). I know I’m not the only one who will miss the ESX version when they can it. I’m friends with a couple of ex-VMware support folks who told me that they hated getting called on ESXi because it tied their hands. Customers never even knew about the VMa and frequently they had to wait while the clueless MCSE fumbled through putting together the vMA and wasted time that could have been spent troubleshooting if they’d been using ESX.

Redhat’s Latest – RHEV

Redhat has been making noise lately about it’s RHEL based virtualization product offerings. I’ve been wondering lately when they’d add something to the mix that would compete with VMware’s vCenter. I really hoped they’d do it right.  The story so far was that, in order to manage a large cluster of virtual machine host servers remotely from your sysadmin’s workstation you needed to VNC or XDMCP to the box and run Virtman or you could use command line tools. Anyone who has seen the level of consolidation and configuration options that vCenter offers to VMware admins would choke, roll their eyes, and/or laugh at those options. I’m a self-confessed unix bigot and even I know that “option” is a joke. Virtman is extremely limited and can only manage one server at a time.

Okay, so enter RHEV. Ready for this – the console runs on Windows only. Seriously! So you put up with a much less mature virtualization platform and you get stuck with Windows to manage it anyhow. I’ve never ran it with thousands of machines, but even with a few it was buggy, exhibiting interface lockups and showing about 60% of what vCenter can do. So, the only real advantage of having a true Unix-like platform to run on gets basically nullified by Redhat by pulling this stunt. Do us all a favor Redhat, sell your KVM development unit to someone with a clue.   KVM has some real potential, but gets lost in the suck of RHEV.

XenServer – Now 0wn3d by Citrix!

Well I had high hopes for Xen back in their “college days” before they got scooped up by Citrix. Now it’s a bizarre hybrid of an RPM-based distro (though they claim to be moving to a Debian base), a monstrous web-application platform (which isn’t all bad), and a whole lot of abstraction from the metal. My experience with their platform is about a year old and I wasn’t at all impressed. The web GUI had several serious issues like losing track of registered VMs when moving them around. It also had a lot brain-damaged Java-programmerish crap under the hood. I’m talking about tons of XML files to track VM configuration, naming, and location. Very little was traditional I-can-read-it-and-edit-it-just-fine-without-an-XML-viewer text files or key-value-pairs (ala an INI file). This and the fact that the virtual hard disks are big unreadable hashed names made popping the hood on XenServer a real mess.

Xen – In the buff on SuSE and Wrappered On Oracle VM Server

Well, SuSE 11 was the last time I played with Xen “in the raw”. Novell would like to sell you this thing call “Orchestrator” to try to give you something more than just  a Virtman interface to manage your Xen guests. I watched a demo by the Novell folks for Orchestrator and was not at all impressed. Half the functionality was something they said you’d basically have to script yourself. Well, news-flash Novell, I don’t need Orchestrator to write scripts to manage Xen. It may have changed since I last saw it, but IMHO as a long-time old-school sysadmin it added very little value.

So you want to try to script the management of Xen yourself? Well, it can be done. The problem is that almost all the CLI Xen tools are scripts themselves and are prone to hanging when the going gets tough. I had a fairly large Xen environment for a while and had a ton of problems with having to restart ‘xend’ to get the CLI tools unstuck. When they get stuck and you have an crond-enabled scripts depending on them you tend to get a train-wreck or at best a non-functional script.  It’s also very easy to step on your virtual machines using shared storage. There isn’t any locking mechanism that prevents you from starting the guest on two separate box using NFS or a clustered filesystem on a SAN. You have to use scripts and lock-files to overcome this. If you don’t you end up with badly corrupted guests. Additionally, the qcow2 format was very badly behaved when I last used Xen. Crashing SuSE 11 virtual servers resulted in more than a few corrupt qcow images. I had one that was a sparsefile claiming to be 110TB on a 2TB LUN.

What about OVM? Well if you want Oracle support I guess you could brave it. I tried it once and found it to be awful. Not only does it have some complicated three-tier setup, it’s also unstable as heck. I had it crash several times before I gave up and looked elsewhere. The GUI is web-based but it’s about as intuitive as a broken Rubik’s cube. You can download it for free after signing away your life to the Oracle Network. I didn’t spend much time on it after the first few terrible impressions I got.

Xen has potential, but until the CLI tools are more reliable it’s not worth it. The whole rig is a big hassle. That was my opinion about a year ago, anyhow.

Virtualbox – Now 0wn3d by Oracle!

Well, no it’s not an enterprise VM server. If they went down that path, it’d compete with OVM which is their absolutely horrible Xen-based offering.  However, I would like to say that VirtualBox has a few really good things going for it.

  1. It’s fast and friendly. The management interface is just as good as VMware workstation, IMHO.
  2. It does support live migration, though most folks don’t know that.
  3. It has a few projects like VboxWeb that might really bear fruit in large environments.
  4. It doesn’t use stupid naming conventions for it’s hard disk images. It names them the same as the machine they got created for.
  5. There are a decent set of CLI tools that come with it.

I have some real serious doubts about where Oracle will allow the product to go. It’s also half-heartedly open source. They keep the RDP and USB functionality away from you unless you buy it. For a workstation application, it’s pretty darn good. For an enterprise virtualization platform it might be even better than Xen, but nowhere near Vmware.


Let’s keep this short and sweet. Fabrice Bellard is a genius and his work on Qemu (and the hacked up version for Linux called KVM) is outstanding. As a standalone Unix-friendly virtualization tool it’s impressive in it’s performance and flexibility. However, outside of RHEV (which is currently awful) there aren’t any enterprise tools to manage large numbers of Qemu or KVM boxes. I also haven’t really seen any way to do what VMware and XenServer can do with “shares”  of CPU and memory between multiple machines. There is duplicate-page sharing now (KSM), but that’s a long way from the huge feature set in VMware.  I have the most hope for the Qemu family but I really wish there was some great Unix-friendly open source management tools out there for it outside the spattering of immature web-based single-maintainer efforts.

Proxmox VE the Small Dark Knight

There is Proxmox VE, which is a real underdog with serious potential. It supports accelerated KVM and also has operating system virtualization in the form of OpenVZ. Both are manageable via it’s web interface. It also has clustering and a built-in snapshot backup capability (and the latter in a form that VMware doesn’t offer out of the box). It does lack a lot of features that VMware has such as VMware’s “fault tolerance” (log shipping), DRS (for load balancing), Site Recover Manager (a DR toolsuite), and the whole “shares” thing. However, considering it’s based on free-as-in-free open-source software and it works darm good for what it does do, I’d say it’s got great potential. I’ve introduced it to a lot of folks who were surprised at how robust it was after trying it.

Virtuozzo and OpenVZ

Virtuozzo is operating system virtualiztion. It’s limited in that you can only host like-on-like virtual machines. Linux machines can only host Linux and Windows can only host Windows. I have tried it out quite a bit in lab environments but never production. I was impressed with how large of consolidation ratios you can get. If you have small machines that don’t do much you can pack a TON of them on one physical box. It also has a terrific web-GUI that works great in open-source browsers. It has an impressive level of resource management and sharing capabilities, too. It offers a much better individual VM management web-interface than VMware (by far). It also has a lot of chargeback features for hosting companies (their primary users).

I have a friend who worked for a VERY large hosting company and used it on a daily basis. His anecdotes were not as rosy. He told me that folks were often able to crash not only their own box but the physical host server, too. This caused him major painful outages. I didn’t like hearing that one bit. However, I have definitely seen VMware and even Qemu crash hard. I’ve seen VMware crash and corrupt it’s VMs once, too (in the 3.x days). That was painful. However, I wouldn’t take such stories lightly about Virtuozzo.  Another negative was their pricing. The folks at Parallels were quite proud of the product and the pricing wasn’t much better than for VMware. You’d think they’d want the business *shrug*.


There’s a nice panoply of choices out there now but nobody is really giving VMware a run for their money outside of niche areas like OS virtualization. I’d love to see something like Proxmox take off and give VMware some headaches.  I’d also like to see much higher levels of Unix friendly focus from the big boys. We aren’t all MCSEs and lobotomy recipients out here in sysadmin land and a few decent Unix tools on the CLI and native-GUI front would be well received from the non-VMware players. I know it’s about market share, but it doesn’t excuse moves like the Windows-only management for RHEV stunt (*disgusted*). Here’s hoping the future for the free, open, and clueful VM platforms is brighter.