Bad Experiences with Sun 7410 Unified Storage Appliance Filers

If you want the short version: run away screaming as fast as you can. You will find all kind of magazine reviews for the unified storage line that includes the 7410 as it’s flagship. You will see Fishworks developer blogs at Sun telling you that you can get insanely high speeds from these filers, and you will see lots of slick marketing for them on Sun’s website. Let me, the guy that’s worked with eight of them over a period of about six months (very close to their release) provide probably the only voice of contention you are going to experience during your googling on these turds. I’ll give a run down of the 7410’s features and then we’ll cut the crap and talk real.

Use of ZFS as a back end storage filesystem and all the associated benefits that comes with (storage pools, snapshots, compression, good performance, raid-z (raid6), and volume-management like capabilities, replication, and self-healing)
Use of commodity SATA disk drives. In my case simple Seagate 1Tb disks with no custom firmware or EMC-alike microcode crap to keep you from replacing them from OTS disks.
Multi-path SATA JBODs and LSI SAS controllers that connect to SAS directors on the back end of the JBODs. Sounds great right?
Use of standard Sun Galaxy class servers as heads. Thus insuring as newer servers come out and the “fishworks” filer software is ported to them, you can get better performance.
A GUI interface even a Windows MCSE could use offering you a lot of very pretty analytics which cover some actual real-world usage scenarios.

Now I must admit that all that sounds great. In fact, it is great. The filers do, in fact, have these features and they do work out of the box. They don’t nickle and dime you like NetApp or EMC over features like replication or compression and the price is very competitive compared to NetApp and especially with EMC or HP storage.

Now for the bad news

The 7410 has endemic instability problems and a terrible internal design that will probably insure that they stay that way.

They crash more or less constantly. I’d like to say it was only a localized problem to one set of filers we’ve used. However, it’s continuous and chronic and happens to all 7 of our filers. We’ve filed many novel bugs with Sun from everything to GUI interface lockups which nearly always coincide CLI locksouts and disable your ability to administer the filers to old-fashion kernel panics with all kinds of nice zfs calls in the backtrace. Repeatable, and constant are these bugs.
Their interface is so painfully slow and inefficient it can cause problems of real magnitude. The GUI can lock up the CLI? Check. The GUI is full of CSS, Javascript, and DHTML issues? Check. The CLI hangs and freezes on simple operations like showing configuration of network and storage? Check.
The command line interface is written in Javascript. Form your own opinions on that one.
Cluster join, failover, and rejoin times are FOREVER compared to their competition. The fastest I’ve ever seen is 4.5 minutes and that’s with a minimum number of disks (48). Add more disks and it’s even slower. Not to mention the fact that if the clusters actually succeed to failover without locking up you can count yourself very fortunate. Kind of defeats the whole point of having a cluster at all, wouldn’t you say, Sun?
Simple operations in the GUI can crash not only an individual filer but also the cluster, too. I’ve had it crash due to simple network reconfiguration or storage rebuild. How about a crash due to stopping replication or a crash of both filer heads in a cluster while trying to failover. Yep. I’ve seen all that and many, many times.
They had the bright idea they should use the Solaris Express (beta) code instead of the mainstream Solaris 10 codebase.
The wiz-bang analytics are very often simply wrong. I’ve compared sniffer output and nfsstat results to what it says and it’s as simple as this: it lies.

This product seems to have become a victim of the Solaris 10 mentality that what’s been working for the last 40 years for Unix is all wrong and broken. We need XML config files, Javascript coded core applications, and GUIs these days, right? Wrong. This is an enterprise product and it’s made as if it’s going to be run by 5th graders. The marketing wants your manager to believe it’s going to allow him to reduce his head count of SAN guys by buying this thing. Sorry, but the gear has to actually stay running before you can do that. When you predicate a storage appliance on XML, Javascript, and other web toys for the core functionality and not just the GUI it’s asking for trouble. These guys should have taken a lesson from NetApp and followed the KISS principle to utilize ZFS and beat them at their own game. Instead, I’m left wondering how I can make excuses not to deploy anything on these boat anchors that crash so often (often kernel panics, not just interface lockups) that customers are blaming me for data corruption (due to the crashes) and the general instability of the system. Had I anything substantive to do with the selection of these units, I’d have said “No thanks, Sun.”

~ by aliver on June 20, 2009.

Posted in File Systems

9 Responses to “Bad Experiences with Sun 7410 Unified Storage Appliance Filers”

While you certainly make valid points, we have a two pairs of 7410C.

The Q3 release fixed almost all bugs you mentioned. Also Cluster Failover with six full shelfs is now done in under 1 Minute…

Most of the problems lie within the appliance kit software, which is brandnew. IMHO Sun had to push the product too early on the market, but overall ,I like the product much more than our NetApp boxes.

Anonymous said this on December 17, 2009 at 11:58 am | Reply
Still seeing multi-minute failover times on my 7410 and 7310 filers. I can’t wait to see them on the loading dock heading back to Sun. We’ve got rid of our roadblock-CTO who was in bed with Sun and that’s cleared the way for Isilon and NetApp. All hail HR. I’ve spent years with Solaris, and though I’m not a huge fan of the OS, it was always very stable (‘cept for the eCache days). The Fishworks gear feels like they went out of their way to trash their own reputation. SMF and the rest of the junk in Solaris 10 was already doing that well enough. I guess they weren’t satisfied. I hope Oracle buries them before they start to smell.

Swift Griggs said this on December 19, 2009 at 11:35 pm | Reply
The 7410’s are Junk – we crashed and lost all of our systems on the storage, within 30 days of deployment. Stay Away!!!

Anonymous - Large IT Shop said this on June 24, 2010 at 8:56 am | Reply
Here’s another interesting thread about the 7410s. Overall, it sounds pretty negative.

http://forums.sun.com/thread.jspa?threadID=5392327

aliver said this on June 24, 2010 at 4:33 pm | Reply
Can anyone comment on the stability of the S7320 Storage Appliance midrange device. NFS protocol and using up to 70 TB raw capacity of the rated 192 TB?

john lee said this on September 28, 2011 at 8:08 pm | Reply
- So far, the one 7320 I’ve had a chance to play with has been less than impressive. I can do faster NFS from my old dell laptop running solaris 10. The filesystems lock up all the time on both linux and solaris hosts, with no logical reason, and no information on the server itself. If the choice were mine, I would go with Nexenta, from whom you can at least get support. Or if the budget is there, go with a NetAPP.
  
  tk said this on January 3, 2012 at 1:21 pm | Reply
I don’t know which is worse: all the business fxxxed up by this piece of junk or Sun (Oracle) saying to us that this will be fixed soon. Oracle, please…stay in the DB area and never cross again the line to the hardware world…NEVER!

Luca said this on December 6, 2012 at 3:15 pm | Reply
- Obviously these 7410 series were product of Sun…which from all appearances, rushed out the door prior to the sale.
  
  Like everyone here, I’ve been dealing with similar issues since these 7410s came out. While I like the “ease” of configuration and the analytics (true or false – is without comparison to other vendors), there are so many gotchas around the entire product. The issues surrounding ZFS and Sun/Oracle not really taking ownership of the application/services (ie, hatchet-job SMB implementation) their boxes provide, is very far from enterprise class.
  
  Last week, we had to add a FC card to our NetApp 3250. We did a failover and only lost a single ping. No services were impacted. At the same time, we were replacing an HBA on one of the 7410c heads. We failed it over and ended up with no resources for about 30 minutes. The second head management was lost and the first head is still shut off as we work through the issues. I will admit here, our 7410c systems were wired up incorrectly by Sun when they were originally installed…a configuration we’ve been following ever since. That’s our bad. That could be the cause of some of the failover issues. Proper wiring should be verified first:
  http://docs.oracle.com/cd/E19548-01/pdf/821-1386.pdf (Page 57)
  
  While I truly appreciate the hard work and dedication of their support engineers, it has to be a dead-end proposition for them trying to support these storage systems.
  
  Sometimes you do get what you pay for.
  
  ytsejamer1 said this on September 13, 2013 at 1:30 pm | Reply
We have a pair of 7410s in a cluster with two full disk chassis. Pile…Of…Crap.

We too have had, in the last 12 months:

*Delete snapshots in a pool crashes hosting filer, it fails over to the other filer and crashes that one too. Yay!

*We are on a sharedshell with Oracle now for this one: Both Readzilla disks (solid state drives, configured as cache for the pools) have been marked as Removed by ZFS, but this has not triggered any alerts or warnings on the system. The only reason it was seen was because the engineer happened to run a zpool status, which under the support contract we’re not allowed to do. Engineer was in looking into protecting us against the problem above as it’s a bug that’s not been fixed.

*Firmware upgrade –> Kernel Panics

*On a more general note with Oracle: DOA replacement parts, including disks, for the last four parts. Including a DOA motherboard for one of the filers.

It’s got to the point now where we may as well just raise a call with Oracle about this storage appliance as soon as we get any problem with any system that uses it for storage. 9 times out of 10, it’ll be the storage.

Ian said this on March 14, 2013 at 7:26 am | Reply