Sostituire un disco RAID1: guida pratica

9 Test

Simuliamo ora la rottura di un disco. Non importa se scegliete /dev/hda o /dev/hdb. Nell’esempio seguente si assume che /dev/hdb sia guasto.

Potete simulare la rottura spegnendo la macchina e rimuovendo fisicamente /dev/hdb, oppure rimuovendolo a caldo con i comandi seguenti:

mdadm --manage /dev/md0 --fail /dev/hdb1  
mdadm --manage /dev/md1 --fail /dev/hdb5  
mdadm --manage /dev/md2 --fail /dev/hdb6

Poi rimuovete le partizioni dall’array:

mdadm --manage /dev/md0 --remove /dev/hdb1  
mdadm --manage /dev/md1 --remove /dev/hdb5  
mdadm --manage /dev/md2 --remove /dev/hdb6

Spegnete il sistema:

shutdown -h now

Quindi inserite un nuovo disco come /dev/hdb (se avete simulato il guasto di /dev/hda, rimontate il disco fisico in /dev/hda e connettete il nuovo come /dev/hdb) e riavviate. Il sistema dovrebbe avviarsi normalmente.

Ora verificate lo stato degradato dell’array:

cat /proc/mdstat

Esempio di output:

[root@server1 ~]# cat /proc/mdstat  
Personalities : [raid1]  
md1 : active raid1 hda5[0]  
      417536 blocks [2/1] [U_]  
  
md0 : active raid1 hda1[0]  
      176576 blocks [2/1] [U_]  
  
md2 : active raid1 hda6[0]  
      4642688 blocks [2/1] [U_]  
  
unused devices:   
[root@server1 ~]#

L’output di fdisk -l dovrebbe somigliare a questo:

[root@server1 ~]# fdisk -l  
  
Disk /dev/hda: 5368 MB, 5368709120 bytes  
255 heads, 63 sectors/track, 652 cylinders  
Units = cylinders of 16065 * 512 = 8225280 bytes  
Disk identifier: 0x00000000  
  
   Device Boot      Start         End      Blocks   Id  System  
/dev/hda1   *           1          22      176683+  fd  Linux raid autodetect  
/dev/hda2             23         652     5060475    5  Extended  
/dev/hda5             23          74      417658+  fd  Linux raid autodetect  
/dev/hda6             75         652     4642753+  fd  Linux raid autodetect  
  
Disk /dev/hdb: 5368 MB, 5368709120 bytes  
16 heads, 63 sectors/track, 10402 cylinders  
Units = cylinders of 1008 * 512 = 516096 bytes  
Disk identifier: 0x00000000  
  
Disk /dev/hdb doesn't contain a valid partition table  
  
Disk /dev/md2: 4754 MB, 4754112512 bytes  
2 heads, 4 sectors/track, 1160672 cylinders  
Units = cylinders of 8 * 512 = 4096 bytes  
Disk identifier: 0x00000000  
  
Disk /dev/md2 doesn't contain a valid partition table  
  
Disk /dev/md0: 180 MB, 180813824 bytes  
2 heads, 4 sectors/track, 44144 cylinders  
Units = cylinders of 8 * 512 = 4096 bytes  
Disk identifier: 0x00000000  
  
Disk /dev/md0 doesn't contain a valid partition table  
  
Disk /dev/md1: 427 MB, 427556864 bytes  
2 heads, 4 sectors/track, 104384 cylinders  
Units = cylinders of 8 * 512 = 4096 bytes  
Disk identifier: 0x00000000  
  
Disk /dev/md1 doesn't contain a valid partition table  
[root@server1 ~]#

Copiate ora la tabella delle partizioni da /dev/hda a /dev/hdb:

sfdisk -d /dev/hda | sfdisk --force /dev/hdb

Esempio di output del comando:

[root@server1 #] sfdisk -d /dev/hda | sfdisk --force /dev/hdb  
Warning: extended partition does not start at a cylinder boundary.  
DOS and Linux will interpret the contents differently.  
Checking that no-one is using this disk right now ...  
OK  
  
Disk /dev/hdb: 10402 cylinders, 16 heads, 63 sectors/track  
  
sfdisk: ERROR: sector 0 does not have an msdos signature  
 /dev/hdb: unrecognized partition table type  
Old situation:  
No partitions found  
New situation:  
Units = sectors of 512 bytes, counting from 0  
  
   Device Boot    Start       End   #sectors  Id  System  
/dev/hdb1   *        63    353429     353367  fd  Linux raid autodetect  
/dev/hdb2        353430  10474379   10120950   5  Extended  
/dev/hdb3             0         -         0   0  Empty  
/dev/hdb4             0         -         0   0  Empty  
/dev/hdb5        353493   1188809     835317  fd  Linux raid autodetect  
/dev/hdb6       1188873  10474379    9285507  fd  Linux raid autodetect  
Warning: partition 1 does not end at a cylinder boundary  
Successfully wrote the new partition table  
  
Re-reading the partition table ...  
  
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)  
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1  
(See fdisk(8).)  
[root@server1 ~]#

Dopo aver scritto la tabella, azzerate eventuali superblock RAID residui su /dev/hdb:

mdadm --zero-superblock /dev/hdb1  
mdadm --zero-superblock /dev/hdb5  
mdadm --zero-superblock /dev/hdb6

Poi aggiungete le partizioni al RAID:

mdadm -a /dev/md0 /dev/hdb1  
mdadm -a /dev/md1 /dev/hdb5  
mdadm -a /dev/md2 /dev/hdb6

Controllate lo stato di sincronizzazione:

cat /proc/mdstat

Esempio durante la ricostruzione:

[root@server1 ~]# cat /proc/mdstat  
Personalities : [raid1]  
md1 : active raid1 hdb5[2] hda5[0]  
      417536 blocks [2/1] [U_]  
        resync=DELAYED  
  
md0 : active raid1 hdb1[1] hda1[0]  
      176576 blocks [2/2] [UU]  
  
md2 : active raid1 hdb6[2] hda6[0]  
      4642688 blocks [2/1] [U_]  
      [===========>.........]  recovery = 59.9% (2784512/4642688) finish=7.5min speed=4076K/sec  
  
unused devices:   
[root@server1 ~]#

Attendete il completamento della sincronizzazione. Quando l’array è sano, lo stato mostra [UU]:

[root@server1 ~]# cat /proc/mdstat  
Personalities : [raid1]  
md1 : active raid1 hdb5[1] hda5[0]  
      417536 blocks [2/2] [UU]  
  
md0 : active raid1 hdb1[1] hda1[0]  
      176576 blocks [2/2] [UU]  
  
md2 : active raid1 hdb6[1] hda6[0]  
      4642688 blocks [2/2] [UU]  
  
unused devices:   
[root@server1 ~]#

Infine reinstallate il bootloader GRUB su entrambi i dischi avviabili:

grub

Nel prompt di grub eseguite:

root (hd0,0)  
setup (hd0)  
root (hd1,0)  
setup (hd1)  
quit

Importante: verificate sempre due volte i nomi dei dispositivi prima di eseguire comandi distruttivi. Un errore su sfdisk o dd può cancellare dati.

Playbook rapido

Identificare il disco guasto con cat /proc/mdstat e fdisk -l
Simulare il fail con mdadm –manage –fail e –remove
Spegnere e sostituire il disco fisico se necessario
Copiare la tabella partizioni: sfdisk -d /dev/hda | sfdisk –force /dev/hdb
mdadm –zero-superblock sulle partizioni nuove
mdadm -a per aggiungere le partizioni al RAID
Monitorare con cat /proc/mdstat fino a [UU]
Reinstallare GRUB su entrambi i dischi

Checklist per ruolo

Amministratore: verificare backup recenti; approvare la procedura; comunicare la finestra di manutenzione
Operatore: eseguire i comandi, documentare gli errori, monitorare la ricostruzione
Verificatore QA: confermare che il servizio si avvia e che i filesystem sono montati

Criteri di accettazione

L’array RAID mostra [UU] per tutte le risorse
Il sistema si avvia correttamente da entrambi i dischi (GRUB installato)
Non ci sono messaggi di errore in dmesg relativi a device o superblock

Quando questa procedura può fallire e alternative

Il nuovo disco è difettoso: sostituirlo con un altro modello/slot
Differenze di geometria disco problematiche: usare partizioni con allineamento corretto e verificare con fdisk o parted
Superblock non cancellabile: usare mdadm –zero-superblock con il dispositivo corretto o riavviare in single-user

Alternative:

Usa il live CD/USB per operazioni su disco senza mount attivi
In ambienti enterprise, eseguire hot-swap e lasciare che il controller/hardware gestisca il rebuild

Troubleshooting rapido

Se la ricostruzione non parte, controllare /var/log/messages o dmesg per errori SCSI
Se sfdisk rifiuta la scrittura, verificare che il disco non sia protetto in hardware
Se GRUB non si installa, usare grub-install o reinstallare da un live environment

10 Link

The Software-RAID Howto: http://tldp.org/HOWTO/Software-RAID-HOWTO.html
Mandriva: http://www.mandriva.com

Sommario

Simulare il guasto, copiare la tabella partizioni e aggiungere il nuovo disco al RAID ricostruendo l’array sono i passaggi chiave.
Verificate lo stato con cat /proc/mdstat e reinstallate il bootloader su entrambi i dischi.
Tenete una checklist e ruoli chiari per minimizzare rischi operativi.

Sostituzione di un disco guasto in RAID1

9 Test

Playbook rapido

Checklist per ruolo

Criteri di accettazione

Quando questa procedura può fallire e alternative

Troubleshooting rapido

10 Link

Materiali simili

Installare e usare Podman su Debian 11

Guida rapida a apt-pinning su Debian

Forzare FSR 4 con OptiScaler: guida completa

Dansguardian + Squid NTLM su Debian Etch

Riparare errore installazione SD su Android

Cartelle di rete con KNetAttach e remote:/