Storage Cluster: A Challenge to LJ Staff and Readers
For a few years I have been trying to create a "distributed cluster storage system" (see below) on standard Linux hardware. I have been unsuccessful. I have looked into buying one and they do exist, but are so expensive I can't afford one. They also are designed for much larger enterprises and have tons of features I don't want or need. I am hoping the Linux community can help me create this low cost "distributed cluster storage system" which I think other small businesses could use. Please help me solve this so we can publish the solution to the open source community.
I am open to any reasonable solution (including buying one) that I can afford (under $3000). I already have some hardware for this project which includes all the nodes and 2 data servers which are: 2 @ Supermicro systems with dual Xeon 3.0 GHz CPUs, 8GB RAM, 4 @ 750GB Seagate SATA HDs.
I have tried to use all of these technologies at one point or another in various combinations to create my solution but have not succeeded. drbd, nfs, gfs, ocfs, aoe, iscsi, heartbeat, ldirectord, round robin DNS, pvfs, cman, clvm, glusterfs, and several fencing solutions.
Description of my "distributed cluster storage system":
- Data server: 2 units (appliances/servers) that each have a 4+ drives in a RAID5 disk set (3 active, 1 hot spare). These 2 units can be active/passive or active/active I don't care. These two units should mirror each other in in real-time. If 1 unit fails for any reason the other picks up the load and carries on without any delay or hang time on the clients. If a unit fails, when it comes back up, I want the data to be re-synced automatically. Then the unit should "come back on-line" (assuming its normal state is active) after it is synced. It would be even more ideal if the data servers could be 1-N instead of just 1-2.
- Data clients: Each cluster node machine (the clients) in the server farm (CentOS 5.4 OS) will mount 1 or more data partitions (in read/write mode) provided by the data server(s). Multiple clients will mount the same partition at the same time in r/w, so network file locking is needed. If the a server goes down (multiple HD failure, network issue, power supply, etc) the other server takes over 100% of the traffic and the client machines never know.
To re-cap:
- 2 data servers mirroring each other in real time.
- Auto fail over to the working server if one fails (without the clients needing to be restarted, or even being interrupted).
- Auto re-sync of the data if a failed unit comes back on-line, when the sync is done the unit goes active again (assuming its normal state is active).
- Multiple machines mounting the same partition in read/write mode (some kind of network file system).
- Linux CentOS will be used on the cluster nodes.
What follows is a (partial) description of what I've tried and why it's failed to live up to the requirements:
For the most part I got all the technology listed working as advertised. The problems are mostly related to 1 server failing which can cause a 3 second+ hang in the network file system. The problem is that when a server goes down any solution that uses heartbeat or round robin DNS will hang the network file system for 3 seconds or more. While this is not a problem for many technologies like http, ftp, and ssh (which is what heartbeat is designed for) this poses a big problem for a file system. If you are loading a web page and it takes 3 seconds you may not even notice, or you would hit your reload button and it would work, however a file system is not anywhere near that tolerant of delays, lots of things break when a mount point is unavailable for 3 seconds.
So with DRBD in active/passive mode and heartbeat I set up a "Highly Available NFS Server" with these instructions.
With Linux's implementation of NFS the NFS mount will hang and require manual intervention to recover, even when server #2 takes over the virtual IP. Sometimes you have to reboot the client machines to recover the NFS mounts. This means that it is not "highly available" anymore.
With iscsi I could not find a good way to replicate the LUNs to multiple machines. As I understand the technology it is a 1 to 1 relationship not 1 to many. And again it would use heartbeat or round robin DNS for redundancy which would hang if one of the servers was down.
Moving on to GFS, I find that almost all of the fencing solutions available for GFS make GFS completely unusable. If you have a power switch fence, your client machines will be cold rebooted when the problem might be a flaky network cable, or an overloaded network switch. Cold rebooting any server is very dangerous (if you ask any experienced sysadmin). A simple example, if your boot loader was damaged for some reason after the machine was up it could run for years without a problem, if you reboot, the boot loader will bite you. If you have a managed network switch you could lose all communications with a server because the network file system was down. Many servers will need network communication for other things that are not reliant on the network file system. Again, very high risk in my opinion for what could/should be solved another way. The one solution I do like is the AOE fencing method. This simply removes the MAC of your unreachable client from the "allowable MACs" for the AOE software server. This should not effect anything on the client machine except for the network file system.
I did get a drbd, AOE, heartbeat, GFS, and an AOE fence combination working, but again when a server goes down there is a hang time on the network file system of at least 3 seconds.
Finally there is glusterfs. This seemed like the ideal solution, and I am still trying to work with the glusterfs community on getting this to work. The 2 problems with this solution are:
- When a server goes down there is still a delay/timeout that the mount point is unavailable.
- When a failed server comes back up it has no way to "re-sync" with the working server.
The reason for the second item is that this is a client side replication solution. This means that each client is responsible for writing its file to each server. So basically the servers are unaware of each other. The advantages of this client side replication is the scalability. According to the glusterfs community they are talking hundreds of servers and petabytes of storage.
A final note on all of this, I run one simple test on all my network file systems that I know will bring consistent results. What I have found is that any network file system will be at best 10-25% the speed of a local file system. (also note this is over 1gb copper ethernet not fiber channel) When running (this creates a 1gb test file):
dd of=/network/mount/point/test.out if=/dev/zero bs=1024 count=1M
I get about 11MB on NFS, 25MB on glusterfs, and 170+MB on the local file system. So you have to be ok with the performance hit, which is acceptable for what I am doing.
p.s. I just learned about Lustre, now under the Sun/Oracle brand. I will be testing it out soon.
Chad Columbus is a freelance IT consultant and application developer. Feel free to contact him if you have a need for his services.