Personal tools
You are here: Home Wiki AoE Optimizations
Views

ATA over Ethernet optimizations

Page alignment problems

There is an alignment issue going on when an AoE (or possibly iSCSI or fibrechannel) device is partitioned.

Loather on the #aoe channel of irc.freenode.net has written a short paper explaining the issue in detail.

If I write blocks direct to the AoE disk device etherd/e0.0, for instance, the block writes are directly aligned with pages and I can stream at full-speed(in this case I'm saturating my gigabit ethernet interface on a write) to the underlying device.

If I add a partition table and write to this partition, the writes are offset by 512 bytes, which is not a multiple of a page size (and, incedentally, the exact size of an x86 partition table).

That is to say, if I write blocks directly to etherd/e0.0p1, the writes cause page cache reads every time.

I have discovered that this alignment error causes every write() on the initiator to be offset into two cache pages on the target. So, even if the cache reads don't happen on the initiator, they do happen on the target.

The off-by-one error is a result of the default CHS geometry:

AoE devices default to a 255 head, 63 SPT geometry. This causes the first partition to start at the beginning of the 63rd block on the disk.

63 isn't a multiple of 8.

Solution: I edited the partition table to align the disk with the geometry of 256 heads and 32 sectors per track. I also had to change the number of cylinders on the virtual disk to align with the change in bytes per cylinder.

Powers of two align nicely with 4096-byte pages. This way, pages on the underlying device (target) directly correspond with pages on the virtual device (initiator). Having anything other than a multiple of eight as the number of sectors per track will result in a misalignment.

However, I now have a filesystem on a partitioned AoE device that can write at full speed (minus filesystem overhead) without the cache alignment penalty. This is definitely a huge step in the right direction!

Hardware flow control and jumbo frames

Use ethtool to enable flow control on both sides, for both TX and RX. That helps tremendously:

/sbin/ethtool --pause eth1 autoneg off rx on
/sbin/ethtool --pause eth1 autoneg off tx on

Check the maximum size of the receive (RX) buffer:

/sbin/ethtool -g eth0

It is probably set to 128 or something similarly small for receive. Set the RX buffer to the maximum size:

/sbin/ethtool -G eth0 rx 4096

Also, increase the interface MTU to 9000 on both sides:

ifconfig eth0 mtu 9000

And set sys.net.core.rmem_default, sys.net.core.rmem_max, sys.net.core.wmem_default, sys.net.core.wmem_max to 262144. That will enable jumbo frames and increase the kernel ring buffer size to compensate for the larger frames. These settings should be put into /etc/sysctl.conf or whatever the debian equivalent is.

I put this at the bottom of my /etc/sysctl.conf:

# Optimizations to increase speed for AoE
net.core.rmem_default = 262144
net.core.rmem_max = 262144
net.core.wmem_default = 262144
net.core.wmem_max = 262144
net.core.netdev_max_backlog=10000 

Not really relevant to AoE but a lot of the tuning options are the same so I put this here: If you are doing iSCSI or some similar TCP protocol you can try these settings in sysctl.conf as well:

net.ipv4.tcp_rmem = 32768 436600 873200
net.ipv4.tcp_wmem = 8192 436600 873200

Then run /sbin/sysctl -p to have /etc/sysctl.conf re-read.

Set the RX buffer to the maximum size:

ethtool -G eth0 rx 4096

If you are using an mtu of 9000 you have to tell the switch also. To set up a Cisco 3560G switch (and possibly others) to use jumbo packets with mtu of 9000 do this:

3750(config)# system mtu 1546
3750(config)# exit
3750# reload

Note that this requires a reload which will take the switch down as it reboots.

Use a quality gigabit switch

Even for two devices, it will make a big difference because of the port buffers. This means either Cisco or HP. Definitely NOT Dell, Netgear, Linksys, or anything of that ilk.

Use "deadline" IO scheduler

On the server, see if you can change the default I/O scheduler algorithm. Kernels prior to 2.6.12 use the 'as' (or anticipatory) scheduler by default. This scheduling algorithm prefers to stay in the same area of the disk, and will service adjacent requests made to the area of the disk where the head lies before moving on to other areas and servicing their requests. On desktop systems that are typically doing one or two things, this is fine. On servers, this is a different story and can result in a lot of I/O starvation. Post-2.6.12, the default I/O scheduler is 'cfq' (or complete fair queueing), which does a much better job at avoiding I/O starvation. In my tests, its performance was acceptable, but not exactly what I was looking for. I think the best scheduling algorithm I've found for this is 'deadline,' which actually sets a deadline for requests to be serviced. This can result in some extra seeks, but it guarantees that there won't be any I/O starvation at all, because all requests are guaranteed to be serviced within a preset amount of time after they come in.

To change the default I/O scheduler, there are two ways. Reboot the machine with elevator=deadline on the kernel command line, or, if your kernel is new enough, (I think this was added in 2.6.14) you can echo the name of the desired scheduling algorithm to /sys/block/<device>/queue/scheduler on a per-device basis. I reccomend the reboot, just because it involves less work, but if your machine can't go down for whatever reason, then manually set it using sysfs.

Just to make sure it really always takes effect I like to put this in my rc.local:

for device in /sys/block/sd*; 
        do echo "deadline" > $device/queue/scheduler; 
        echo $device; 
done

Increase readahead

Consider increasing your read-ahead setting (which defaults to a very small value for todays hardware) using something like:

# /sbin/blockdev --getra /dev/sdb
256
# /sbin/blockdev --setra 16384 /dev/sdb

as described here: http://portal.itauth.com/2007/11/20/howto-linux-double-your-disk-read-performance-single-command

Again, I put this in my rc.local:

# Also configure all of our disk devices to use larger read-ahead
for device in /sys/block/sd*; do
        base=`basename $device`
        /sbin/blockdev --setra 16384 /dev/$base
done

Change IO scheduling per device

The iosched stuff is per-device, so you'd be changing a lot of parameters to tweak with that. You can find it under /sys/block/<device>/queue It might make sense to adjust this depending on whether that particular device (AoE device) is going to be a database server, media streaming server, small file server (Maildirs) etc. You may have to be using deadline IO scheduler to do this.

Change swap parameters

There are swappiness tweaks you can make which make the system swap out unneeded areas of stuff to disk, allowing more disk cache. You might want to do this in the Xen VM more than on the AoE server. Look into any of the pd_flush directives. A lot of stuff under /proc/sys/vm can be tweaked to help with that

Choose your SATA controller carefully

From IRC:

<tessier> I am pondering how to lay out the disks in the disk nodes I intend to build in the next month or two...
<tessier> I am going to buy the same kit you guys have.
<loather-work> i'd strongly suggest a different SATA controller, though
<tessier> Might need to do some experimentation.
<tessier> Yeah, I was thinking that too.
<loather-work> the marvell support under linux is less than great
<loather-work> a lot of the LSI boards use the fusion MPT architecture, which is very well supported and full-featured under linux
<tessier> I am pondering doing RAID 1 over RAID 5 like the first disk nodes I set up (not so attractive because it involves RAID 5 over a bunch of disks) or doing RAID 1 across each (but then what happens if I lose one of the disks, does it take down that whole disk node?) or do I do mirror on top of mirror (4 disks with the same data? Maybe great performance and reliability but now we are tying up 4 disks, overkill?) etc.
<tessier> Yeah, I would look at LSI boards
<tessier> linagee uses them and has had good success
<loather-work> i've used LSI in the past too -- it's rock solid reliable, and very fast.
<tessier> I think I might make a set of disks specifically for swap too. In their own volume group.
<loather-work> to be honest? i think you'd be better off striping across multiple RAID5 sets
<loather-work> we're doing that as well
<tessier> Because if one vm starts swapping hard I don't want to penalize everyone's normal disk access
<tessier> Maybe strip across 3 disk RAID 5 sets?
<loather-work> currently it's a single disk for all instances (bad daniel!) -- i'd rather it be a well-performing raid10
<tessier> The supermicro cases have 15 disk slots iirc. That is 5 3 disk RAID sets.
<tessier> hehehe
<tessier> I am just pleased that the disk is still the bottleneck and not the network
<loather-work> yea, i'd go with two four disk sets and leave the remaining disks for swap and whatnot
<tessier> Look at all of the companies who have no more disk than this but spend tons on fibrechannel. For no better performance. Maybe a little less latency on the wire. That's it.

Choose 10k+ RPM drives

Higher disk RPM helps a lot. 10,000 rpm seems to be the fastest SATA drive you can buy at the moment. SCSI goes to 15k. But I still recommend SATA over SCSI because you can buy two SATA disks (effectively 20,000 rpm?) for the price of a SCSI rig.

Use SMALL disks

Just buy the best price/gig. You need SPINDLES not storage if you want performance.

Also look at how long a RAID rebuild is going to take. It seems to be around 5 hours for 750G disks.

Avoid RAID 5

I know it is tempting to try to RAID 5 all of your disks together. But disks are cheap and RAID 5 kills performance.

Consider joining the Battle Against Any Raid Five: http://www.miracleas.com/BAARF/

Or at least read the anti-RAID 5 papers and essays they link to.

You don't want to pay the small-write penalty. You don't want to lose 80% of your performance during a rebuild. You don't want to risk blowing up your whole array and losing all of your data when you accidentally try to replace the wrong disk.

You want to let those heads move independently of each other and seek all over the disk. When you use RAID 5 you slave all of the heads together. When one moves to read in, say, 100 blocks of data, they all move. This kills your access times.

Instead of RAID 5 just use RAID 1 or even better, RAID 10. Linux has an awesome RAID10 implementation. The Linux RAID10 mode is not simply RAID 1+0. Read up on it. You will want to use the far method or the offset method, the near method doesn't provide a speed-up with a two-disk array which you can do with the Linux RAID10.

This is better performing AND safer than RAID 5. If you are using a mirrored set of disk nodes I recommend mirroring a pair across the disk nodes instead of within the disk node.

If you are doing a 5 hour RAID rebuild on your 8 disk RAID 5 array and another disk fails (which isn't unlikely as drive failures correlate together according to a CMU study and a RAID rebuild really hammers the remaining good disks) you are going to be one very unhappy hacker. Using mirroring greatly reduces the chance of this happening.

Seagate Disks

Make sure you remove the go-slow jumper from Seagate SATA disks. They come from the factory jumpered to 1.5Gb/s SATA-I. If you have SATA-II capable controller and disks you will want to remove the jumper so you can get 3.0Gb/s. I did this and got a 50% speed increase. You won't double your speed because at the moment most SATA disks can't push out 3.0Gb/s of data even though the bus can take it.



Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: