[LU-9068] Hardware problem resulting in bad blocks - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.5.5
Labels:
None

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

We encountered a hardware problem on the MDT storage device (DDN 7700) that resulted in bad blocks. The file system continued to operate but yesterday went read-only when it stumbled over a bad sector.

We ran fsck against the file system with the most current e2fsprogs which repaired the file system but dumped 90 objects/files into lost+found. All but 2 belonged to one user. But one of the files/objects belongs to root and has a low inode number #5749 that appears to be a data file.

We are very concerned that this particular file may be lustre relevant and would like your guidance on what we should do. (Obviously we are able to mount the file system ldiskfs.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

stat.5749.txt
17 kB
31/Jan/17 5:51 PM

Issue Links

is related to

LU-6696 ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 0 in progress, 0 in flight: -5

Resolved

Activity

[LU-9068] Hardware problem resulting in bad blocks

Joe Mervini added a comment - 31/Jan/17 7:19 PM

Note: All the OSS have remained up with all OSTs mounted.

Joe Mervini added a comment - 31/Jan/17 7:19 PM Note: All the OSS have remained up with all OSTs mounted.

Joe Mervini added a comment - 31/Jan/17 7:16 PM - edited

We rebooted the node and started lustre. The file system mounted and was in the process of connecting clients when it panicked with an LBUG. (This is before trying to rename the file.)

Jan 31 11:51:57 gmds1 kernel: [944595.780330] LustreError: 86284:0:(ldlm_lib.c:2324:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880bd2005800 x1557048164247136/t0(177528794042) o35->0e26
9f32-c7c8-9037-e29a-c1f762282b43@10.1.9.69@o2ib:0/0 lens 392/0 e 0 to 0 dl 1485888778 ref 2 fl Interpret:/6/ffffffff rc 0/-1
Jan 31 11:51:57 gmds1 kernel: [944596.013130] LustreError: 86284:0:(ldlm_lib.c:2324:target_queue_recovery_request()) Skipped 202 previous similar messages
Jan 31 11:52:12 gmds1 kernel: [944610.729259] Lustre: gscratch-MDT0000: Client 5a2a05db-5f33-aecf-9cce-44f4bfe225ab (at 10.1.6.230@o2ib) reconnecting, waiting for 7326 clients in recovery for 3:28
Jan 31 11:52:12 gmds1 kernel: [944610.887017] Lustre: Skipped 431 previous similar messages
Jan 31 11:52:14 gmds1 kernel: [944612.182223] LustreError: 85847:0:(ldlm_lib.c:2324:target_queue_recovery_request()) @@@ dropping resent queued req  req@ffff880c1c85c850 x1556994744167148/t0(177528744813) o35->ea44
7cc5-9dd6-7b57-077b-df9f767a42e4@10.1.1.156@o2ib:0/0 lens 392/0 e 0 to 0 dl 1485888795 ref 2 fl Interpret:/6/ffffffff rc 0/-1
Jan 31 11:52:14 gmds1 kernel: [944612.415138] LustreError: 85847:0:(ldlm_lib.c:2324:target_queue_recovery_request()) Skipped 127 previous similar messages
Jan 31 11:52:14 gmds1 kernel: [944612.589958] Lustre: gscratch-MDT0000: Denying connection for new client 4bb27966-87a7-7a95-6e3e-1007a28f70bd (at 10.1.7.205@o2ib), waiting for all 7326 known clients (2152 recovere
d, 1857 in progress, and 0 evicted) to recover in 3:26
Jan 31 11:52:14 gmds1 kernel: [944612.756886] Lustre: Skipped 12 previous similar messages
Jan 31 11:52:14 gmds1 kernel: [944612.896489] LustreError: 86016:0:(llog_cat.c:195:llog_cat_id2handle()) gscratch-OST004a-osc-MDT0000: error opening log id 0x19cee:1:0: rc = -2
Jan 31 11:52:15 gmds1 kernel: [944613.046320] LustreError: 86016:0:(llog_cat.c:586:llog_cat_process_cb()) gscratch-OST004a-osc-MDT0000: cannot find handle for llog 0x19cee:1: -2
Jan 31 11:52:24 gmds1 kernel: [944622.532283] LustreError: 86182:0:(llog_osd.c:757:llog_osd_next_block()) gscratch-MDT0000-osd: missed desired record? 45953 > 39475
Jan 31 11:52:24 gmds1 kernel: [944622.681106] LustreError: 86182:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 6 in progress, 0 in flight: -2
Jan 31 11:52:24 gmds1 kernel: [944622.833222] LustreError: 86182:0:(osp_sync.c:872:osp_sync_thread()) LBUG
Jan 31 11:52:24 gmds1 kernel: [944622.908388] Pid: 86182, comm: osp-syn-89-0
Jan 31 11:52:24 gmds1 kernel: [944622.979766]
Jan 31 11:52:24 gmds1 kernel: [944622.979768] Call Trace:
Jan 31 11:52:25 gmds1 kernel: [944623.113187]  [<ffffffffa05ed8f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jan 31 11:52:25 gmds1 kernel: [944623.185016]  [<ffffffffa05edef7>] lbug_with_loc+0x47/0xb0 [libcfs]
Jan 31 11:52:25 gmds1 kernel: [944623.255067]  [<ffffffffa15e6e03>] osp_sync_thread+0x753/0x7d0 [osp]
Jan 31 11:52:25 gmds1 kernel: [944623.325388]  [<ffffffff8154b79e>] ? schedule+0x3ee/0xb70
Jan 31 11:52:25 gmds1 kernel: [944623.395727]  [<ffffffffa15e66b0>] ? osp_sync_thread+0x0/0x7d0 [osp]
Jan 31 11:52:25 gmds1 kernel: [944623.466450]  [<ffffffff810a6d1e>] kthread+0x9e/0xc0
Jan 31 11:52:25 gmds1 kernel: [944623.534286]  [<ffffffff8100c2ca>] child_rip+0xa/0x20
Jan 31 11:52:25 gmds1 kernel: [944623.601056]  [<ffffffff810a6c80>] ? kthread+0x0/0xc0
Jan 31 11:52:25 gmds1 kernel: [944623.666502]  [<ffffffff8100c2c0>] ? child_rip+0x0/0x20
Jan 31 11:52:25 gmds1 kernel: [944623.731052]
Jan 31 11:52:26 gmds1 kernel: [944623.790477] Kernel panic - not syncing: LBUG

Joe Mervini added a comment - 31/Jan/17 7:16 PM - edited We rebooted the node and started lustre. The file system mounted and was in the process of connecting clients when it panicked with an LBUG. (This is before trying to rename the file.) Jan 31 11:51:57 gmds1 kernel: [944595.780330] LustreError: 86284:0:(ldlm_lib.c:2324:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880bd2005800 x1557048164247136/t0(177528794042) o35->0e26 9f32-c7c8-9037-e29a-c1f762282b43@10.1.9.69@o2ib:0/0 lens 392/0 e 0 to 0 dl 1485888778 ref 2 fl Interpret:/6/ffffffff rc 0/-1 Jan 31 11:51:57 gmds1 kernel: [944596.013130] LustreError: 86284:0:(ldlm_lib.c:2324:target_queue_recovery_request()) Skipped 202 previous similar messages Jan 31 11:52:12 gmds1 kernel: [944610.729259] Lustre: gscratch-MDT0000: Client 5a2a05db-5f33-aecf-9cce-44f4bfe225ab (at 10.1.6.230@o2ib) reconnecting, waiting for 7326 clients in recovery for 3:28 Jan 31 11:52:12 gmds1 kernel: [944610.887017] Lustre: Skipped 431 previous similar messages Jan 31 11:52:14 gmds1 kernel: [944612.182223] LustreError: 85847:0:(ldlm_lib.c:2324:target_queue_recovery_request()) @@@ dropping resent queued req req@ffff880c1c85c850 x1556994744167148/t0(177528744813) o35->ea44 7cc5-9dd6-7b57-077b-df9f767a42e4@10.1.1.156@o2ib:0/0 lens 392/0 e 0 to 0 dl 1485888795 ref 2 fl Interpret:/6/ffffffff rc 0/-1 Jan 31 11:52:14 gmds1 kernel: [944612.415138] LustreError: 85847:0:(ldlm_lib.c:2324:target_queue_recovery_request()) Skipped 127 previous similar messages Jan 31 11:52:14 gmds1 kernel: [944612.589958] Lustre: gscratch-MDT0000: Denying connection for new client 4bb27966-87a7-7a95-6e3e-1007a28f70bd (at 10.1.7.205@o2ib), waiting for all 7326 known clients (2152 recovere d, 1857 in progress, and 0 evicted) to recover in 3:26 Jan 31 11:52:14 gmds1 kernel: [944612.756886] Lustre: Skipped 12 previous similar messages Jan 31 11:52:14 gmds1 kernel: [944612.896489] LustreError: 86016:0:(llog_cat.c:195:llog_cat_id2handle()) gscratch-OST004a-osc-MDT0000: error opening log id 0x19cee:1:0: rc = -2 Jan 31 11:52:15 gmds1 kernel: [944613.046320] LustreError: 86016:0:(llog_cat.c:586:llog_cat_process_cb()) gscratch-OST004a-osc-MDT0000: cannot find handle for llog 0x19cee:1: -2 Jan 31 11:52:24 gmds1 kernel: [944622.532283] LustreError: 86182:0:(llog_osd.c:757:llog_osd_next_block()) gscratch-MDT0000-osd: missed desired record? 45953 > 39475 Jan 31 11:52:24 gmds1 kernel: [944622.681106] LustreError: 86182:0:(osp_sync.c:872:osp_sync_thread()) ASSERTION( rc == 0 || rc == LLOG_PROC_BREAK ) failed: 0 changes, 6 in progress, 0 in flight: -2 Jan 31 11:52:24 gmds1 kernel: [944622.833222] LustreError: 86182:0:(osp_sync.c:872:osp_sync_thread()) LBUG Jan 31 11:52:24 gmds1 kernel: [944622.908388] Pid: 86182, comm: osp-syn-89-0 Jan 31 11:52:24 gmds1 kernel: [944622.979766] Jan 31 11:52:24 gmds1 kernel: [944622.979768] Call Trace: Jan 31 11:52:25 gmds1 kernel: [944623.113187] [<ffffffffa05ed8f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jan 31 11:52:25 gmds1 kernel: [944623.185016] [<ffffffffa05edef7>] lbug_with_loc+0x47/0xb0 [libcfs] Jan 31 11:52:25 gmds1 kernel: [944623.255067] [<ffffffffa15e6e03>] osp_sync_thread+0x753/0x7d0 [osp] Jan 31 11:52:25 gmds1 kernel: [944623.325388] [<ffffffff8154b79e>] ? schedule+0x3ee/0xb70 Jan 31 11:52:25 gmds1 kernel: [944623.395727] [<ffffffffa15e66b0>] ? osp_sync_thread+0x0/0x7d0 [osp] Jan 31 11:52:25 gmds1 kernel: [944623.466450] [<ffffffff810a6d1e>] kthread+0x9e/0xc0 Jan 31 11:52:25 gmds1 kernel: [944623.534286] [<ffffffff8100c2ca>] child_rip+0xa/0x20 Jan 31 11:52:25 gmds1 kernel: [944623.601056] [<ffffffff810a6c80>] ? kthread+0x0/0xc0 Jan 31 11:52:25 gmds1 kernel: [944623.666502] [<ffffffff8100c2c0>] ? child_rip+0x0/0x20 Jan 31 11:52:25 gmds1 kernel: [944623.731052] Jan 31 11:52:26 gmds1 kernel: [944623.790477] Kernel panic - not syncing: LBUG

Andreas Dilger added a comment - 31/Jan/17 6:57 PM

It looks like this is indeed a Lustre internal file, based on the FID reported in the stat output:

lma: fid=[0x1:0x19c6b:0x0]

This object is an llog file, and can be renamed to O/1/d11/105579 (the FID number in decimal). However, it may be that the MDS has moved on once it detected this log file was missing (there may be some error messages on the MDS console to that effect). At worst this may mean a few OST objects are orphaned, but they will be cleaned up when LFSCK is run with Lustre 2.7 or later (it isn't mentioned in this ticket what version you are running).

So, in summary, it is possible to repair this file, which may be helpful, but isn't critical if this would take the system offline.

Andreas Dilger added a comment - 31/Jan/17 6:57 PM It looks like this is indeed a Lustre internal file, based on the FID reported in the stat output: lma: fid=[0x1:0x19c6b:0x0] This object is an llog file, and can be renamed to O/1/d11/105579 (the FID number in decimal). However, it may be that the MDS has moved on once it detected this log file was missing (there may be some error messages on the MDS console to that effect). At worst this may mean a few OST objects are orphaned, but they will be cleaned up when LFSCK is run with Lustre 2.7 or later (it isn't mentioned in this ticket what version you are running). So, in summary, it is possible to repair this file, which may be helpful, but isn't critical if this would take the system offline.

Joe Mervini added a comment - 31/Jan/17 6:31 PM

We examined the file with debugfs and these are the stats (I'm attaching the full output.):

debugfs 1.42.13.wc5 (15-Apr-2016)
/dev/mapper/360001ff0a096f0000000000d88530000: catastrophic mode - not reading inode or group bitmaps
Inode: 5749 Type: regular Mode: 0644 Flags: 0x0
Generation: 3182218171 Version: 0x00000000:00000000
User: 0 Group: 0 Size: 4153344
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 8120
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x588e2a64:6eccc4e4 – Sun Jan 29 10:46:12 2017
atime: 0x589093a1:254e4128 – Tue Jan 31 06:39:45 2017
mtime: 0x588e2a64:6eccc4e4 – Sun Jan 29 10:46:12 2017
crtime: 0x588e2a64:6eccc4e4 – Sun Jan 29 10:46:12 2017
Size of extra inode fields: 28
Extended attributes stored in inode body:
lma = "00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 6b 9c 01 00 00 00 00 00 " (24)
lma: fid=[0x1:0x19c6b:0x0] compat=0 incompat=0

Since the create time occurred after we encountered a hardware hick-up we believe that it might be benign but would like confirmation.

Joe Mervini added a comment - 31/Jan/17 6:31 PM We examined the file with debugfs and these are the stats (I'm attaching the full output.): debugfs 1.42.13.wc5 (15-Apr-2016) /dev/mapper/360001ff0a096f0000000000d88530000: catastrophic mode - not reading inode or group bitmaps Inode: 5749 Type: regular Mode: 0644 Flags: 0x0 Generation: 3182218171 Version: 0x00000000:00000000 User: 0 Group: 0 Size: 4153344 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 8120 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x588e2a64:6eccc4e4 – Sun Jan 29 10:46:12 2017 atime: 0x589093a1:254e4128 – Tue Jan 31 06:39:45 2017 mtime: 0x588e2a64:6eccc4e4 – Sun Jan 29 10:46:12 2017 crtime: 0x588e2a64:6eccc4e4 – Sun Jan 29 10:46:12 2017 Size of extra inode fields: 28 Extended attributes stored in inode body: lma = "00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 6b 9c 01 00 00 00 00 00 " (24) lma: fid= [0x1:0x19c6b:0x0] compat=0 incompat=0 Since the create time occurred after we encountered a hardware hick-up we believe that it might be benign but would like confirmation.

Joe Mervini added a comment - 31/Jan/17 5:06 PM

Thanks Peter. Yes - It's coming up. The site is really sluggish here at Sandia. Don't know if it's just us...

Joe Mervini added a comment - 31/Jan/17 5:06 PM Thanks Peter. Yes - It's coming up. The site is really sluggish here at Sandia. Don't know if it's just us...

Peter Jones added a comment - 31/Jan/17 4:55 PM

Joe

I created this ticket on your behalf. Please confirm whether you are able to post updates ok

Peter

Peter Jones added a comment - 31/Jan/17 4:55 PM Joe I created this ticket on your behalf. Please confirm whether you are able to post updates ok Peter

Hardware problem resulting in bad blocks

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates