NetApp, Spinnaker, GX, CoralFS … and the Pantywaist “Product Managers”

Could Have Been a Fanboy

I’ve been a NetApp customer in one form or another for 12 years. I know their products very well and I work in a large NetApp environment with over 18 filers. My Filers are very busy and some are really scaled out. I know OnTap better than a lot of guys who work at NetApp who are friends of mine. I wish them well, but I feel a bit like a neglected spouse. When are they going to stop patting themselves on the back for doing mainly hardware refreshes and get down to the business of integrating the Spinnaker technology they’ve spent years marginalizing and internally fighting?

The promise of GX – Squandering Spinnaker

When NetApp announced it would buy Spinnaker in late 2003 a lot of execs, market droids, and product managers told the press that they intended to use their scale-out approach to strengthen NetApp’s own product line. Rather than truly integrate it with OnTap they created the “GX” line. They saw the global namespace scale out model as only being appealing to HPC and entertainment markets. They failed to get the feedback that “limiting us to slice-and-dice 32-bit 16TB aggregates really sucks” from their customers. I can tell you that we were saying it. I know I wasn’t the only one. NetApp’s OnTap continued to provide great performance and stability so we hung out and tried to be patient while they promised that OnTap would soon brandish the power of the Spinnaker model without affecting OnTap’s performance and stability.

Fast forward to 2011. They’ve had seven years to integrate the technology and meanwhile lots of other players such as Isilon, IBM SoNAS, Panasas, and others have matured in NetApp’s traditional areas of strength. Those players started off with a scale out model and were not held back by any legacy requirements for backward compatibility or upgrade paths, it’s true. However, NetApp has made huge profits during those seven years. If they’d really wanted to get GX line integrated with the traditional filers, it would have been a done deal.

Why the Hesitation?

Having worked for a lot of large companies I’ve seen similar opportunities wasted. Big companies get political. In fighting and silos often keep them from truly integrating the goals that the company visionaries have.  What happened at NetApp? Why have they failed to deliver on their promises that sysadmins like me still haven’t forgot? Well, I wonder was it:

  • Engineers who were “loyal” to OnTap rejected the Spinnaker approach?
  • Spinnaker engineers were too far from the R&D action and geographically dispersed away from the “old school” OnTap folks?
  • Marketing folks didn’t think they could get customers to understand the scale-out model and thought they’d be accused of being HPC-only or creating Movie-Maker-Filers ?
  • Some bad blood and silo wars between camps inside the company?
  • Product managers didn’t have the stones to offer up a truly different next-gen version of OnTap. This is my personal opinion.

I do personally blame their product managers. No matter which of these excuses you favor, it’s ultimately their fault. Having worked with many of them in the past (not at NetApp, mind you), I find that about 80% are incompetent folks who think of themselves as technology gurus, but lacked enough skill to “make it” as a line-level geek. If they talk enough, someone gets the idea that maybe they’d be better putting them in charge of the geeks rather than expecting them to write code or otherwise produce results. The hard truth is that you need to be educated in the school of hard knocks to be a good product manager. Few of them finished their degrees at that prestigious institution and fewer still want to leave once they have. So, it’s rare that they see what needs to be done and simply do it rather than making excuses to drag the product along at a snails pace, hoping that if they don’t change things much, nobody will fire them.

OnTap 8.1 – No CoralFS or Striped Aggregates

Okay, we get 64Bit aggregates which will give us @100TB sized aggregates. Nowadays, that’s not nearly good enough. Yes, we’ll get a clumsily unified namespace that I still have to manage behind the scenes. It’s too little and too late. Perhaps 8.1.x or 8.2, huh? Wait a few more years? Is this seriously the strategy in the era of 3TB drives and fierce competition from folks who already solved these problems and can match or exceed OnTap’s stability? What’s worse is that 8.1 isn’t offering striped aggregates or CoralFS. This is the WAFL alternative secret-sauce that Spinnaker already had in production 9 years ago. This is the scale-out formula NetApp promised us to have integrated in their press release in November of 2003. Sorry, NetApp, I have a long memory. I was excited by that announcement and hoped my favorite storage vendor was about to get that much better with the introduction of some new blood. I have to admit, I’m still waiting, but without as much hope that they can deliver.

Someone, tell me I’m wrong about 8.1 I’d love to retract my accusations.

Why Not Show Some Leadership?

NetApp, why don’t you fire your product managers and bring in some new folks who can make it happen more quickly? It’s not for lack of cash that you guys have failed thus far. However, you don’t truly fail until you quit trying. So, if I were the CTO, I might consider the following:

  • Screw it. We are bring CoralFS back into 8.1 and delaying the product launch. We are going to activate the market droids and inform them of the value of doing this. Customers already don’t need convincing that “buckets” (aggregates) are not the ideal approach. Tell the coders, testers, and documentation folks you’ll give them a 10% bonus if they can pull it off by Q4 2011.
  • Flush the whole mess. We’ll freeze 7.3 and you can either buy that on new hardware, or you can buy some kind of OnTap-GX-enabled kit. You can use the same hardware, but you have to upgrade to the new OS. NetApp could provide great deals on swing hardware and re-invigorate their professional services folks to do the heavy lifting instead of trying to figure out the best way to offshore them.   People can take the pain if you can provide a clear path and some clear benefits of doing it.

Either way, promises are getting thin these days; call it the seven year itch.

~ by aliver on March 13, 2011.

9 Responses to “NetApp, Spinnaker, GX, CoralFS … and the Pantywaist “Product Managers””

  1. I need to careful with what I say on this since I’ve very much under NDA and don’t want to disadvantage Netapp by revealing information that I shouldn’t. For that reason, I’ll keep this as high level as I can and based on public information.

    Netapp have 2 customer bases – their traditional 7G storage customer and their scale out GX customers. Their challenge is that they have to service both customer bases at the same time. Us GX customers see all the advances they’re making and want to benefit from them; We’re all demanding feature parity as soon as possible.

    Their use of 2 code bases will hinder that, so their goal of moving toward a single code base is the correct approach.

    However, doing that whilst remaining duplicate components isn’t much better than not doing it at all (i.e in 8.0 and possibly future releases there is much duplication – network stacks, replication engines etc).

    The GX architecture is fundamentally different from 7G so the work of unpicking the network vs data access interdependencies of 7G and moving towards a split architecture like GX is no easy task.

    One thing Netapp did do with GX is effectively replace the data part of SpinFS with WAFL (which actually appears to be a cut down version of OnTAP). I’d guess the networking portion is the major amount of work and the reason it’s taken as long as it has for 8.1 and ultimately (and much later) a converged product, and that’s not even mentioning their move to OpenBSD.

    So, back to my original point – bells and whistles such as files striped across multiple nodes is all well and good, but not at the expense of converging the products.

    Coral from Spinnaker has some fundamental flaws that would have needed fixing to make it geniunely competitive. For example, not striping the metadata across multiple nodes/vols gives a performance bottleneck which could actually mean in some cases you lost performance using it. Also, with no resilience in the striped data you could lose 2 nodes in a 24 node cluster and be dead in the water.

    Netapp will no doubt highlight those issues and ask whether waiting even longer is worth it’s inclusion in their near term roadmap. Even when fixed, you’d then need to think about how to replicate, backup, dedup, clone those volumes. Would the extra manpower required to solve those problems be worthwhile? In the short term, I don’t think so.

    Netapp need to get a product out of the door which brings scale out clustering and their current 7 mode feature set to market as quickly as possible to show that the Spinnaker acquisition actually added value and that they can continue to innovate on a single and well designed platform.

    The rest can follow. Admittedly, they need to make sure transitioning to any new volume type is done online to make sure us customers don’t pay for their lack of development speed.

    • Great info. I wasn’t aware of the GX/CoralFS weaknesses that you highlight. That actually makes it seem like the situation is worse than I believed at first. The lack of striped/distributed metadata is particularly concerning. This is the same complaint I have with Lustre and pNFS. The fact that OneFS has solved this is a big plus for them. I’m interested on the open source side to see where Hammer (DragonFLY BSD) and Ceph end up doing with their metadata when they reach maturity.

  2. I hate it when people do this, as I’m sure it’s a result of typing too fast but 128bit aggregates? You mean 64bit, right?

  3. OK, further depression then somewhat of a green shoot from the wreckage:

    Coral:

    1) No data resilience other than that offered within a pair of nodes; RAID DP and a pair of filers.

    2) No striped metadata

    3) No ability to restripe – just restriping ‘lite’. Adding nodes and additional constituants is possible but existing data doesn’t restripe using it – that’s just for new data.

    4) Uses the backend cluster network to access most of the data within a stripe (presumably until pNFS). At the moment that’s ethernet compared to Infiniband of other solutions with the associated increased latency/reduced bandwidth.

    5) Caching at the data layer so you’ve got to cross the cluster network to get to remote caches (assuming no Flexcache in front of the cluster). The caches are also independent so a file could be in-cache on some nodes and not in others.

    6) No ability to mirror/replicate/DR striped vols

    Depressed yet? However, those facts relate to a very old version of their technology. Did they fix any of those issues before deciding not to include the next generation iteration of this in 8? I suppose only Netapp and a lucky few with inside information would know. However, if they did then the gap may not be quite so large as it appears against others such as Isilon.

    Speaking of which – can Isilon stripe files across nodes? The answer appears to be ‘some of the time’. I don’t pretend to understand the implementation details. However, there is obviously a challenge of having no RAID on a node and a block size of 8k. How do you store a 3 byte file when you have a 3 node cluster? My understand is that <128k files aren't stored too efficiently; Data gets mirrored rather than striped so you'd end up consuming 16k on disk for that 3 byte file but minimum resilience (i.e. mirrored). If you need N+2 or above, it's going to cost you….

    You then consider the challenge of deduplication across Isilon nodes. It's a much larger challenge without any constraints on file system size and no subdivision by volumes/aggregates. Can they do it? I'm sure they can, eventually. Netapp were first to market with ASIS so they have lots of experience. They haven't delivered a cross-node (or even volume) solution yet. Maybe they know something the others don't?

  4. Isilon does stripe files across nodes. Check this out: http://www.isilon.com/sites/default/files/tracked_assets/OneFS_Overview_final%2008.10.pdf

  5. I’ve struggled with NetApp the last couple years in general for similar reasons – there’s a lot of promise in what they show, but a lot of stumbling blocks to get there. I’m making more investments with EMC instead – good to keep everyone on their toes 🙂

    Isilon has been interesting – but not terribly price-competitive.

    • Hey Steve!

      You are either getting a great deal on NetApp, a lot better at layout on your aggregates, or getting gouged on Isilon. For X-series gear I’m paying around $1k/TB. For S-Series it’s about $2.5k/TB or so. For NetApp I’m up around $4.5k/TB on DS4243 based SATA and vulgar amounts of money for SAS. My current environment has about 7 clusters (mostly FAS 6000 series). Even if I could get them both on price parity, the WAFL reserve, snap reserve, and “storage buckets” effect pushes my usable space on NetApp just under 50%. On my Isilons I’m getting 90% because there is no snapreserve, the OneFS recommended full rate is 90%, and there are no “buckets” (ala 16TB 32-bit aggregates).

      That being said, I inherited a lot of stupid practices. When I first started this gig everything was thick. Volume guarantee was on everywhere, LUNs space reservation was fully committed across the board, and no ASIS was on anywhere, and everything was created in it’s own volume (killing dedupe). These guys were stuck in 1999. It’s been pretty hard to steer the beast that is our storage environment to better practices, but we have made a lot of progress. Some of the old practices are hard to shake (like everything being in a separate volume) because of how a hosted company has to segregate everything and the fact that we use accursed vfilers that make everything more painful.

      All and all, Isilon is delivering on the promise that NetApp made in 2003 more than NetApp, but YMMV. Not everyone has the same requirements as I do.

  6. […] purchased Spinnaker with the promise of integrating clustering technologies into its core platform. This practitioner sums up the frustrations we’ve heard time and time again from NetApp […]

Leave a comment