Kernel failure with software RAID-1 due to single drive read errors

I am running Fedora 19 (kernel 3.11.3-201.fc19.x86_64) with a software RAID-1 (mdadm) setup on two identical Seagate 1GB drives:

# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb3[1] sda3[0]
      973827010 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Recently, some errors are presenting on one of the two drives: smartd has detected "1 Currently unreadable (pending) sectors" and "1 Offline uncorrectable sectors". The RAID array "rescheduled" some sectors. Then about a day later, the kernel produced various I/O messages/exceptions:

Oct 18 06:39:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors
Oct 18 06:39:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors
...
Oct 18 07:09:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors
Oct 18 07:09:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors
...
Oct 18 07:30:28 x kernel: [467502.192792] md/raid1:md1: sdb3: rescheduling sector 1849689328
Oct 18 07:30:28 x kernel: [467502.192822] md/raid1:md1: sdb3: rescheduling sector 1849689336
Oct 18 07:30:28 x kernel: [467502.192846] md/raid1:md1: sdb3: rescheduling sector 1849689344
Oct 18 07:30:28 x kernel: [467502.192870] md/raid1:md1: sdb3: rescheduling sector 1849689352
Oct 18 07:30:28 x kernel: [467502.192895] md/raid1:md1: sdb3: rescheduling sector 1849689360
Oct 18 07:30:28 x kernel: [467502.192919] md/raid1:md1: sdb3: rescheduling sector 1849689368
Oct 18 07:30:28 x kernel: [467502.192943] md/raid1:md1: sdb3: rescheduling sector 1849689376
Oct 18 07:30:28 x kernel: [467502.192966] md/raid1:md1: sdb3: rescheduling sector 1849689384
Oct 18 07:30:28 x kernel: [467502.192991] md/raid1:md1: sdb3: rescheduling sector 1849689392
Oct 18 07:30:28 x kernel: [467502.193035] md/raid1:md1: sdb3: rescheduling sector 1849689400
...
Oct 19 06:26:08 x kernel: [550035.944191] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 19 06:26:08 x kernel: [550035.944224] ata3.01: BMDMA stat 0x64
Oct 19 06:26:08 x kernel: [550035.944248] ata3.01: failed command: READ DMA EXT
Oct 19 06:26:08 x kernel: [550035.944274] ata3.01: cmd 25/00:08:15:fb:9c/00:00:6c:00:00/f0 tag 0 dma 4096 in
Oct 19 06:26:08 x kernel: [550035.944274]          res 51/40:00:1c:fb:9c/40:00:6c:00:00/10 Emask 0x9 (media error)
Oct 19 06:26:08 x kernel: [550035.944322] ata3.01: status: { DRDY ERR }
Oct 19 06:26:08 x kernel: [550035.944340] ata3.01: error: { UNC }
Oct 19 06:26:08 x kernel: [550036.573438] ata3.00: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550036.621444] ata3.01: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550036.621507] sd 2:0:1:0: [sdb] Unhandled sense code
Oct 19 06:26:08 x kernel: [550036.621516] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550036.621523] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 19 06:26:08 x kernel: [550036.621530] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550036.621537] Sense Key : Medium Error [current] [descriptor]
Oct 19 06:26:08 x kernel: [550036.621555] Descriptor sense data with sense descriptors (in hex):
Oct 19 06:26:08 x kernel: [550036.621562]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 19 06:26:08 x kernel: [550036.621606]         6c 9c fb 1c 
Oct 19 06:26:08 x kernel: [550036.621626] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550036.621638] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 19 06:26:08 x kernel: [550036.621646] sd 2:0:1:0: [sdb] CDB: 
Oct 19 06:26:08 x kernel: [550036.621653] Read(10): 28 00 6c 9c fb 15 00 00 08 00
Oct 19 06:26:08 x kernel: [550036.621692] end_request: I/O error, dev sdb, sector 1822227228
Oct 19 06:26:08 x kernel: [550036.621719] raid1_end_read_request: 9 callbacks suppressed
Oct 19 06:26:08 x kernel: [550036.621727] md/raid1:md1: sdb3: rescheduling sector 1816361448
Oct 19 06:26:08 x kernel: [550036.621782] ata3: EH complete
Oct 19 06:26:08 x kernel: [550041.155637] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 19 06:26:08 x kernel: [550041.155669] ata3.01: BMDMA stat 0x64
Oct 19 06:26:08 x kernel: [550041.155694] ata3.01: failed command: READ DMA EXT
Oct 19 06:26:08 x kernel: [550041.155719] ata3.01: cmd 25/00:08:15:fb:9c/00:00:6c:00:00/f0 tag 0 dma 4096 in
Oct 19 06:26:08 x kernel: [550041.155719]          res 51/40:00:1c:fb:9c/40:00:6c:00:00/10 Emask 0x9 (media error)
Oct 19 06:26:08 x kernel: [550041.155767] ata3.01: status: { DRDY ERR }
Oct 19 06:26:08 x kernel: [550041.155785] ata3.01: error: { UNC }
Oct 19 06:26:08 x kernel: [550041.343437] ata3.00: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550041.391438] ata3.01: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550041.391501] sd 2:0:1:0: [sdb] Unhandled sense code
Oct 19 06:26:08 x kernel: [550041.391510] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550041.391517] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 19 06:26:08 x kernel: [550041.391525] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550041.391532] Sense Key : Medium Error [current] [descriptor]
Oct 19 06:26:08 x kernel: [550041.391546] Descriptor sense data with sense descriptors (in hex):
Oct 19 06:26:08 x kernel: [550041.391553]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 19 06:26:08 x kernel: [550041.391596]         6c 9c fb 1c 
Oct 19 06:26:08 x kernel: [550041.391616] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550041.391624] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 19 06:26:08 x kernel: [550041.391636] sd 2:0:1:0: [sdb] CDB: 
Oct 19 06:26:08 x kernel: [550041.391643] Read(10): 28 00 6c 9c fb 15 00 00 08 00
Oct 19 06:26:08 x kernel: [550041.391681] end_request: I/O error, dev sdb, sector 1822227228
Oct 19 06:26:08 x kernel: [550041.391737] ata3: EH complete
Oct 19 06:26:08 x kernel: [550041.409686] md/raid1:md1: read error corrected (8 sectors at 1816363496 on sdb3)
Oct 19 06:26:08 x kernel: [550041.409705] handle_read_error: 9 callbacks suppressed
Oct 19 06:26:08 x kernel: [550041.409709] md/raid1:md1: redirecting sector 1816361448 to other mirror: sda3

The computer continued to log some entries into syslog until about an hour later, and then finally became completely unresponsive. No "kernel oops" was recorded in syslog. The computer had to be rebooted, at which point the raid array was in a resync status. After the resync, all appears well and the drives seem to be working normally.

I also noted that all the rescheduled sectors are exactly 8 sectors apart, which seems very strange to me.

Lastly, after reboot and approximately a day or two later but significantly after the raid resync was complete, the drive reset the unreadable (pending) and offline uncorrectable sector count, which I believe is normal as the drive takes those sectors offline and remaps them:

Oct 20 01:05:42 x kernel: [    2.186400] md: bind<sda3>
Oct 20 01:05:42 x kernel: [    2.204826] md: bind<sdb3>
Oct 20 01:05:42 x kernel: [    2.209618] md: raid1 personality registered for level 1
Oct 20 01:05:42 x kernel: [    2.210079] md/raid1:md1: not clean -- starting background reconstruction
Oct 20 01:05:42 x kernel: [    2.210087] md/raid1:md1: active with 2 out of 2 mirrors
Oct 20 01:05:42 x kernel: [    2.210122] md1: detected capacity change from 0 to 997198858240
Oct 20 01:05:42 x kernel: [    2.210903] md: resync of RAID array md1
Oct 20 01:05:42 x kernel: [    2.210911] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct 20 01:05:42 x kernel: [    2.210915] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
Oct 20 01:05:42 x kernel: [    2.210920] md: using 128k window, over a total of 973827010k.
Oct 20 01:05:42 x kernel: [    2.241676]  md1: unknown partition table
...
Oct 20 06:33:10 x kernel: [19672.235467] md: md1: resync done.
...
Oct 21 05:35:50 x smartd[455]: Device: /dev/sdb [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email
Oct 21 05:35:50 x smartd[455]: Device: /dev/sdb [SAT], No more Offline uncorrectable sectors, warning condition reset after 1 email

smartctl -a /dev/sdb now shows

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail  Always       -       195
...
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
...

Some questions arise:

1) Why would all the rescheduled sectors be exactly 8 apart?

2) Why would the kernel become unresponsive and require a reboot? Shouldn't this be exactly the situation RAID-1 is meant to handle without the system blowing up?

3) Why would the unreadable and offline uncorrectable counts reset only 23 hours after the raid resync was complete?

4
2022-07-25 20:45:23
Source Share
Answers: 0