cat brain.log | less

Getting it down on `paper`

Performance Problems on an MD1000 with PERC/6 Disk Array

One of the machines that I use at work was exhibiting severely slow disk I/O performance. I noticed that any directory I traversed to for the first time took 30-50 seconds to `cd` into. After that first time, however, the response was speedy. I suspected hardware problems, since caching would explain the quick re-traversal. This is my story.

  1. Check the lights. All green, now what?
  2. Check the load. Normal.
  3. Try hdparm. What? man hdparm
    // normal machine 1:
    $ sudo /sbin/hdparm -Tt /dev/md0:
     Timing cached reads:   15348 MB in  2.00 seconds = 7688.21 MB/sec
     Timing buffered disk reads:  808 MB in  3.19 seconds = 253.58 MB/sec
    
    // normal machine 2:
    $ sudo /sbin/hdparm -Tt /dev/md0:
     Timing cached reads:   18448 MB in  1.99 seconds = 9254.38 MB/sec
     Timing buffered disk reads:  1654 MB in  3.00 seconds = 551.33 MB/sec
    
    // problem machine:
    $ sudo /sbin/hdparm -Tt /dev/sdc:
     Timing cached reads:   27660 MB in  1.99 seconds = 13883.97 MB/sec
     Timing buffered disk reads:    8 MB in 14.64 seconds = 559.51 kB/sec

    Yes, we definitively have a problem. 500 kB/sec is cable modem speed, and I know my hard drive can store data faster than that… ok, now what?

  4. Check the logs.
    1. Reboot into bios
    2. Check the event history
    3. Reboot into PERC bios
    4. Check the event history.
    5. Call Dell, the ones with all the answers and none of the obligation to share. They know the track-record of the hardware they sell, so when you call them up, they just sent you replacement parts for whatever everyone else’s problem was. This time, it worked.
  5. The phone call to Dell went through the obvious: Reboot? Logs? Firmware? blah blah… Dell says they saw errors with disk 9. Ok, that’s something we didn’t see the first time around. Anyway…
  6. Dell decided to send a new PERC card, new PERC cable, and because of the disk error, a new disk.

We decided to only replace the PERC card first, to see if that would solve the problem. It worked!

// problem machine after new PERC card:
$ sudo /sbin/hdparm -Tt /dev/sdc
 Timing cached reads:   25368 MB in  1.99 seconds = 12727.72 MB/sec
 Timing buffered disk reads:  256 MB in  3.00 seconds =  85.33 MB/sec

That’s 2 orders of magnitude improvement, but it’s not good enough. The other machines are doing better. Let’s change the look-ahead cache size.

// Set lookahead to 8192 blocks:
$ sudo /sbin/blockdev --setra 8192 /dev/sdc
$ sudo /sbin/hdparm -Tt /dev/sdc
/dev/sdc:
 Timing cached reads:   23988 MB in  1.99 seconds = 12035.05 MB/sec
 Timing buffered disk reads:  1202 MB in  3.00 seconds = 400.48 MB/sec

Now that’s what I’m talkin’ about!! Yes, 400 MB/sec is what we’re expecting for local disks. Call it a day, grab me a beer, celebrate — disaster thwarted.

When all was said and done, we double-checked the logs and didn’t find anything having to do with disk 9, so we figured it was all probably the card. It’s better not to rebuild the array in this case because we don’t use striping or mirroring… we need the space, and the “data” is regenerable (it’s processed output, not source data). I would have replaced the cable too, since there’s no penalty for doing so, but I wasn’t the tinkerer today.

So that’s it: bad PERC/6 card. It’s not the first bad enterprise part we’ve received from Dell. They need to work on that.

 

Comments

No comments so far.

(comments are closed)