Q:
RAID는 데이터 손실을 막아줄 수 있다, 하지만, 시스템을 손상없이
가능한 오래 켜놓을 수 있는가?
RAID can help protect me against data loss. But how can I also
ensure that the system is up as long as possible, and not prone
to breakdown? Ideally, I want a system that is up 24 hours a
day, 7 days a week, 365 days a year.
A:
고 가용성(High-Availability)은 좀 더 어렵고 비싼 것이다.
아래의 hint와 tip, 생각, 소문이 그 문제에 대해서 도와줄 것이다.
- 같은 IDE 리본 케이블에 연결된 디스크중 하나에 오류가 나면,
디스크 두개가 모두 망가진 것으로 인식될 것이다.
하나의 IDE 케이블에는 하나의 디스크만을 사용해라.
- SCSI chain 역시 하나의 오류디스크가 모든 디스크에
접근 못하게 할것이다. 같은 SCSI chain에 같은 RAID 시스템의
디스크들을 두지 말아라.
- 디스크 콘트롤러도 역시 여러개를 사용하라.
- 모든 디스크를 같은 회사, 같은 모델로 쓰지 말아라.
디스크들이 물리적인 충격을 받을 경우 좀 더 안전할 것이다.
- CPU나 콘트롤러의 실패할 경우에 대비해서 SCSI를 두개의 컴퓨터와
연결되는 "twin-tailed" 상태로 설정할 수 있을 것이다.
(아래 원문을 참고 하시길.. -.-; 쩝.)
- 항상 UPS를 사용하고 shutdown을 하라.
- SCSI 케이블은 매우 까다롭고, 문제가 되기 쉬운 것으로 알려저 있다.
살 수 있는 가장 좋은 질의 케이블을 사용해라.
- SSI (Serial Storage Architecture) 를 보고 다소 비싸더라도
안전하다고 알려진 제품을 사용해라.
- 즐겨라, 망가지는 것은 당신이 생각하는 것보다 나중 일일 것이다.
High-Availability is difficult and expensive. The harder
you try to make a system be fault tolerant, the harder
and more expensive it gets. The following hints, tips,
ideas and unsubstantiated rumors may help you with this
quest.
- IDE disks can fail in such a way that the failed disk
on an IDE ribbon can also prevent the good disk on the
same ribbon from responding, thus making it look as
if two disks have failed. Since RAID does not
protect against two-disk failures, one should either
put only one disk on an IDE cable, or if there are two
disks, they should belong to different RAID sets.
- SCSI disks can fail in such a way that the failed disk
on a SCSI chain can prevent any device on the chain
from being accessed. The failure mode involves a
short of the common (shared) device ready pin;
since this pin is shared, no arbitration can occur
until the short is removed. Thus, no two disks on the
same SCSI chain should belong to the same RAID array.
- Similar remarks apply to the disk controllers.
Don't load up the channels on one controller; use
multiple controllers.
- Don't use the same brand or model number for all of
the disks. It is not uncommon for severe electrical
storms to take out two or more disks. (Yes, we
all use surge suppressors, but these are not perfect
either). Heat & poor ventilation of the disk
enclosure are other disk killers. Cheap disks
often run hot.
Using different brands of disk & controller
decreases the likelihood that whatever took out one disk
(heat, physical shock, vibration, electrical surge)
will also damage the others on the same date.
- To guard against controller or CPU failure,
it should be possible to build a SCSI disk enclosure
that is "twin-tailed": i.e. is connected to two
computers. One computer will mount the file-systems
read-write, while the second computer will mount them
read-only, and act as a hot spare. When the hot-spare
is able to determine that the master has failed (e.g.
through a watchdog), it will cut the power to the
master (to make sure that it's really off), and then
fsck & remount read-write. If anyone gets
this working, let me know.
- Always use an UPS, and perform clean shutdowns.
Although an unclean shutdown may not damage the disks,
running ckraid on even small-ish arrays is painfully
slow. You want to avoid running ckraid as much as
possible. Or you can hack on the kernel and get the
hot-reconstruction code debugged ...
- SCSI cables are well-known to be very temperamental
creatures, and prone to cause all sorts of problems.
Use the highest quality cabling that you can find for
sale. Use e.g. bubble-wrap to make sure that ribbon
cables to not get too close to one another and
cross-talk. Rigorously observe cable-length
restrictions.
- Take a look at SSI (Serial Storage Architecture).
Although it is rather expensive, it is rumored
to be less prone to the failure modes that SCSI
exhibits.
- Enjoy yourself, its later than you think.