Software-RAID HOWTO: 고 가용성(High-Availability) RAID

다음 이전 차례

9. 고 가용성(High-Availability) RAID

Q: RAID는 데이터 손실을 막아줄 수 있다, 하지만, 시스템을 손상없이 가능한 오래 켜놓을 수 있는가?

RAID can help protect me against data loss. But how can I also ensure that the system is up as long as possible, and not prone to breakdown? Ideally, I want a system that is up 24 hours a day, 7 days a week, 365 days a year.
A: 고 가용성(High-Availability)은 좀 더 어렵고 비싼 것이다. 아래의 hint와 tip, 생각, 소문이 그 문제에 대해서 도와줄 것이다.
- 같은 IDE 리본 케이블에 연결된 디스크중 하나에 오류가 나면, 디스크 두개가 모두 망가진 것으로 인식될 것이다. 하나의 IDE 케이블에는 하나의 디스크만을 사용해라.
- SCSI chain 역시 하나의 오류디스크가 모든 디스크에 접근 못하게 할것이다. 같은 SCSI chain에 같은 RAID 시스템의 디스크들을 두지 말아라.
- 디스크 콘트롤러도 역시 여러개를 사용하라.
- 모든 디스크를 같은 회사, 같은 모델로 쓰지 말아라. 디스크들이 물리적인 충격을 받을 경우 좀 더 안전할 것이다.
- CPU나 콘트롤러의 실패할 경우에 대비해서 SCSI를 두개의 컴퓨터와 연결되는 "twin-tailed" 상태로 설정할 수 있을 것이다. (아래 원문을 참고 하시길.. -.-; 쩝.)
- 항상 UPS를 사용하고 shutdown을 하라.
- SCSI 케이블은 매우 까다롭고, 문제가 되기 쉬운 것으로 알려저 있다. 살 수 있는 가장 좋은 질의 케이블을 사용해라.
- SSI (Serial Storage Architecture) 를 보고 다소 비싸더라도 안전하다고 알려진 제품을 사용해라.
- 즐겨라, 망가지는 것은 당신이 생각하는 것보다 나중 일일 것이다.
High-Availability is difficult and expensive. The harder you try to make a system be fault tolerant, the harder and more expensive it gets. The following hints, tips, ideas and unsubstantiated rumors may help you with this quest.
- IDE disks can fail in such a way that the failed disk on an IDE ribbon can also prevent the good disk on the same ribbon from responding, thus making it look as if two disks have failed. Since RAID does not protect against two-disk failures, one should either put only one disk on an IDE cable, or if there are two disks, they should belong to different RAID sets.
- SCSI disks can fail in such a way that the failed disk on a SCSI chain can prevent any device on the chain from being accessed. The failure mode involves a short of the common (shared) device ready pin; since this pin is shared, no arbitration can occur until the short is removed. Thus, no two disks on the same SCSI chain should belong to the same RAID array.
- Similar remarks apply to the disk controllers. Don't load up the channels on one controller; use multiple controllers.
- Don't use the same brand or model number for all of the disks. It is not uncommon for severe electrical storms to take out two or more disks. (Yes, we all use surge suppressors, but these are not perfect either). Heat & poor ventilation of the disk enclosure are other disk killers. Cheap disks often run hot. Using different brands of disk & controller decreases the likelihood that whatever took out one disk (heat, physical shock, vibration, electrical surge) will also damage the others on the same date.
- To guard against controller or CPU failure, it should be possible to build a SCSI disk enclosure that is "twin-tailed": i.e. is connected to two computers. One computer will mount the file-systems read-write, while the second computer will mount them read-only, and act as a hot spare. When the hot-spare is able to determine that the master has failed (e.g. through a watchdog), it will cut the power to the master (to make sure that it's really off), and then fsck & remount read-write. If anyone gets this working, let me know.
- Always use an UPS, and perform clean shutdowns. Although an unclean shutdown may not damage the disks, running ckraid on even small-ish arrays is painfully slow. You want to avoid running ckraid as much as possible. Or you can hack on the kernel and get the hot-reconstruction code debugged ...
- SCSI cables are well-known to be very temperamental creatures, and prone to cause all sorts of problems. Use the highest quality cabling that you can find for sale. Use e.g. bubble-wrap to make sure that ribbon cables to not get too close to one another and cross-talk. Rigorously observe cable-length restrictions.
- Take a look at SSI (Serial Storage Architecture). Although it is rather expensive, it is rumored to be less prone to the failure modes that SCSI exhibits.
- Enjoy yourself, its later than you think.

다음 이전 차례