How not to provide robust clustered storage with Linux and GFS

(The title is a bit strong, on purpose)

LWN links to an article describing how to provide robust clustered storage with Linux and GFS.

While explaining how to setup GFS can be nice, the incentive made me jump.

The author writes:

Load balancing is difficult; often we need to share file systems via NFS or other mechanisms to provide a central location for the data. While you may be protected against a Web server node failure, you are still sharing fate with the central storage node. Using GFS, the free clustered file system in Linux, you can create a truly robust cluster that does not depend on other servers. In this article, we show you how to properly configure GFS.

In case you don't know, GFS is not exactly a "clustered storage". It is more a "shared storage". You have one storage, and several clients accessing it. You have one storage array, compared to the NFS case, where you have one central server for the data. But what is a storage array except a special (and expensive) kind of server ? You don't depend on other servers, but you depend on other servers ? How is that supposed to be different ?

Conceptually, a clustered file system allows multiple operating systems to mount the same file system, and write to it at the same time. There are many clustered file systems available including Sun's Lustre, OCFS from Oracle, and GFS for Linux.

OCFS and GFS are the same class of file systems, but Lustre is definitely out of league and would, actually, provide a truly robust cluster that does not depend on other servers. Lustre is a truly clustered filesystem, that distributes data on several nodes such that losing some nodes don't make you lose access to the data.

With the incentive given by the author, and considering he lists Lustre as an example, I would actually have preferred an article about setting up Lustre.

2009-04-07 20:29:04+0900

miscellaneous, p.d.o

Both comments and pings are currently closed.

13 Responses to “How not to provide robust clustered storage with Linux and GFS”

  1. Np237 Says:

    Lustre doesn’t solve this problem at all, it only moves it to the MDS and OSS – mostly like NFS does.

    Each MDT or OST is a filesystem on a local storage, and this storage has to be shared between two MDS or OSS which act as failovers. And failover is always less reliable than a “pure” distributed filesystem like OCFS2 or GFS.

    Lustre is a filesystem for performance, certainly not for reliability – although its reliability has much improved over the years.

    If what you are looking for is real reliability, you should try a pair of NetApps, or something of the like. Not really fast, very expensive, but with true 99,9999% 365 day/year availability.

  2. Andre Felipe Machado Says:

    Hello,
    GlusterFS has some interesting plugins and flexibility that allow high availability.
    Its design using each host (brick) own filesystem is very clever.
    And simple.
    It deserves a closer look.
    http://www.gluster.org
    Regards.

  3. Marcos Dione Says:

    It’s also incredible that NFS is catalogued as a “Distributed File System” in several places, like the wikipedia page on distributed filesystems[1], and not as a simple network file system, more close to SMB. The lack of this kind of differentiation, network file system, shared storage, distributed file system, etc, has led several researcher groups to implement the wrong filesystem in their clusters.


    [1] http://en.wikipedia.org/wiki/List_of_file_systems#Distributed_file_systems

  4. glandium Says:

    Np237: The point is: OCFS2 or GFS are not distributed. And you can certainly use your existing servers to act as OST and MDT (you can with NFS, too, but that would still be a SPOF).

  5. Np237 Says:

    You’re all mixing (and I am too, a bit) “Distributed file system” with “parallel file system”. NFS and CIFS are called distributed file systems (that’s bad terminology, but I didn’t invented that), however they are not parallel. OCFS2, GFS and Lustre are all parallel filesystems, but Lustre has a client/server model while OCFS2 ans GFS don’t.

    Anyway, my point has nothing to do with terminology. If you think you will have more reliability with Lustre, you’re fooling yourself. Lustre uses a number of MDT and OST, each of which is a SPOF. The MDS and OSS that serve these targets are all redunded, but they are in the very same way an active/passive cluster of NFS servers accessing the same disks would be. As such, Lustre is worse for reliability than NFS (several SPOFs instead of one), and there is no possible comparison with OCFS2/GFS (for which you can consider the storage as a SPOF, but there is no failover).

    Furthermore, considering the storage as a SPOF is a bit abusive. There are several ways to make it not so, from redundant controllers with Multipath I/O to software RAID 1 over several devices.

  6. Jeff Schroeder Says:

    And you sir should take a look at glusterfs:
    http://www.gluster.org

    It is pretty cool stuff.

  7. glandium Says:

    Np237: AFAIK, Lustre supports some degree of failure, so that each server, individually, is not a SPOF.

    As for multipath I/O, multiple controllers and such, sure, you can do that, but then you’re even more relying on “other servers”, which breaks even more the original article’s incentive.

    Andre, Jeff: I read a while ago about gluster, but it totally got off my mind. I’ll have to give it a try some day.

  8. Np237 Says:

    Each Lustre server is not a SPOF because of failover. Only because of that. And without multiple controllers or whatever is needed to keep the targets available, the availability will remain mediocre.

    OTOH, OCFS2 over a software RAID1 array will have a very high availability, and very few failure pathes – since there is no failover mechanism to trigger.

  9. glandium Says:

    Np237: I don’t know what you mean with OCFS2 over a software RAID1 array, but AFAIK, it is not possible to have server-based software RAID unless under very limited conditions and with servers cooperation, so let’s assume you are talking about a storage system doing mirroring. The only way i see that working for high availability is with multiple controllers and multiple paths. How is no failover mechanism involved here ?
    Maybe Lustre has a bad failover mechanism, but that doesn’t make other setups exempt of a failover mechanism.

  10. Justin Says:

    You can run OCFS or GFS on top of DRBD and have no SPOF….

  11. Np237 Says:

    Exactly as Justin says. With DRBD in active/active mode, not only there is no SPOF, but there is no failover involved (and failover is the enemy of availability). With OCFS2 on top of it, there is no failover involved at the filesystem level. There is no SPOF nor failover at any point of the chain.

    Lustre’s failover mechanism is not bad, but it is here. The way it works for each MDT/OST is exactly like a pair of NFS servers sharing an array of disks with Heartbeat for failover. You still have a SPOF on your disk array, and you still have a failover mechanism at the filesystem level. Except that you have several MDTs and several OSTs, each of which is critical for availability.

  12. glandium Says:

    DRBD’s primary-primary mode with a shared disk file system (GFS, OCFS2). These systems are very sensitive to failures of the replication network. Currently we cannot generally recommend this for production use.

    From http://www.drbd.org/home/mirroring/
    Though, I admit, it is a nice setup and doubling the replication network over two different networking chips should reduce the risks of failure of the replication network.

  13. Andre Felipe Machado Says:

    Hello,
    Some time ago, I actually built a cluster with GFS
    http://www.techforce.com.br/index.php/news/linux_blog/red_hat_cluster_suite_debian_etch
    and tried with using DRBD and other approaches
    http://www.techforce.com.br/index.php/news/linux_blog/virtualizacao_e_servico_de_arquivos_em_cluster_ha_com_debian_etch_parte_1
    http://www.techforce.com.br/index.php/news/linux_blog/virtualizacao_e_servico_de_arquivos_em_cluster_ha_com_debian_etch_parte_2
    http://www.techforce.com.br/index.php/news/linux_blog/virtualizacao_e_servico_de_arquivos_em_cluster_ha_com_debian_etch_parte_3

    After some time of use, it became clear (as of today), that Gluster approach is the most elegant and flexible one.
    Its array of installable plugins allows configurations suitable for various use scenarios.
    It is showing its value here.
    Some configurations do not have SPOF, with some performance or complexitiy trade-offs, of course.

    The 1.x is at Debian repositories, already.
    I hope the upcoming 2.x version become available at backports.
    Regards.
    Andre Felipe Machado