[LU-11578] ldiskfs_map_blocks: comm mdt00_100: lblock 0 mapped to illegal pblock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.10.4
Labels:
None
Environment:
CentOS 7.4

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We hit the following issue today on Oak's MDT0. We just added another MDT a few days ago so it was empty but I started today to lfs migrate a test directory (from MDT0 to MDT1) when this happened on MDT0. lfs migrate did actually work for a while (about ~40k inodes have been migrated) until MDT0 did this:

Oct 26 17:26:13 oak-md1-s2 kernel: LDISKFS-fs error (device dm-0): ldiskfs_map_blocks:594: inode #659619751: block 774843950: comm mdt00_100: lblock 0 mapped to illegal pblock (length 1)
Oct 26 17:26:13 oak-md1-s2 kernel: Aborting journal on device dm-0-8.
Oct 26 17:26:13 oak-md1-s2 kernel: LustreError: 3844:0:(osd_handler.c:1586:osd_trans_commit_cb()) transaction @0xffff881b024c6a80 commit error: 2
Oct 26 17:26:13 oak-md1-s2 kernel: LDISKFS-fs (dm-0): Remounting filesystem read-only

I performed a fsck with the new e2fsprogs-1.44.3.wc1-0.el7.x86_64
Not sure this fixed our issue as I don't see any reference to inode 659619751 in it. I'm attaching the full fsck logs.

Hopefully, MDT0 then restarted without problem but I haven't touched the partially migrated directory at this time to avoid further issues on a Friday night on this production system. My feeling is that lfs migrate hit some migrated inode that somehow got corrupted on MDT0? The system has been working fine for weeks now (we only have the issue reported in ~~LU-11205~~ regarding changelog_clear errors), so I assume this is due to my running lfs migrate. I can perform troubleshooting next week. Any recommendation to avoid this in the future is welcome, thanks much!

Stephane

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oak-md1-s2_kernel.log
2.48 MB
27/Oct/18 4:30 AM
oak-MDT0000_fsck.log
299 kB
27/Oct/18 4:32 AM
oak-MDT0000-fsck-2018-11-06.log
20 kB
06/Nov/18 7:03 PM

Issue Links

duplicates

LU-12485 (osd_handler.c:2146:osd_object_release()) ASSERTION( !(o->oo_destroyed == 0 && o->oo_inode && o->oo_inode->i_nlink == 0) ) faile

Resolved

is related to

LU-13157 migrate symlink with target name length > 59 cause crash

Resolved

LU-14511 oak-MDT0000: deleted inode referenced and aborted journal

Resolved

is related to

LU-10581 osd_handler.c:1978:osd_object_release()) LBUG

Open

Activity

[LU-11578] ldiskfs_map_blocks: comm mdt00_100: lblock 0 mapped to illegal pblock

Stephane Thiell added a comment - 27/Nov/18 8:03 PM

Hey YangSheng,

Have you been able to take a look at this? Let me know if you need anything else from me.

Thanks!
Stephane

Stephane Thiell added a comment - 27/Nov/18 8:03 PM Hey YangSheng, Have you been able to take a look at this? Let me know if you need anything else from me. Thanks! Stephane

Stephane Thiell added a comment - 16/Nov/18 7:09 PM

Hi YangSheng,

Sure, thanks for taking a look. I just uploaded the vmcore and matching debuginfo rpms:

lustre-debuginfo-2.10.4-1.el7.centos.x86_64.rpm
kernel-debuginfo-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm
kernel-debuginfo-common-x86_64-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm
and one of the vmcores as LU-11578-vmcore-oak-md1-s2-2018-11-06-11-00-23

Thanks!
Stephane

Stephane Thiell added a comment - 16/Nov/18 7:09 PM Hi YangSheng, Sure, thanks for taking a look. I just uploaded the vmcore and matching debuginfo rpms: lustre-debuginfo-2.10.4-1.el7.centos.x86_64.rpm kernel-debuginfo-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm kernel-debuginfo-common-x86_64-3.10.0-693.2.2.el7_lustre.pl1.x86_64.rpm and one of the vmcores as LU-11578 -vmcore-oak-md1-s2-2018-11-06-11-00-23 Thanks! Stephane

Yang Sheng added a comment - 16/Nov/18 6:51 AM

Hi, Stephane,

Sorry for the late reply, Could you please upload vmcore on our ftp site? ftp.whamcloud.com. This crash is really stranger for me. Also debuginfo rpms if you still keep them. If not, as comment in this ticket, you should use 2.10.4 + 3.10.0-692.2.2 kernel for your MDS?

Thanks,
YangSheng

Yang Sheng added a comment - 16/Nov/18 6:51 AM Hi, Stephane, Sorry for the late reply, Could you please upload vmcore on our ftp site? ftp.whamcloud.com. This crash is really stranger for me. Also debuginfo rpms if you still keep them. If not, as comment in this ticket, you should use 2.10.4 + 3.10.0-692.2.2 kernel for your MDS? Thanks, YangSheng

Stephane Thiell added a comment - 15/Nov/18 7:44 AM

Hi YangSheng,

Any ideas of what could have happened here?

Thanks!
Stephane

Stephane Thiell added a comment - 15/Nov/18 7:44 AM Hi YangSheng, Any ideas of what could have happened here? Thanks! Stephane

Stephane Thiell added a comment - 06/Nov/18 7:32 PM

I tried a second time after shutting down the client that ran the lfs migrate but the same LBUG happened. I was able to put it back into production by trying a third time and doing an abort_recovery after most of the clients recovered.

[  156.180372] Lustre: oak-MDT0000: Denying connection for new client cdb722e9-f2fd-2e1e-3be0-ae3ca5c64dbb(at 10.8.15.2@o2ib6), waiting for 1417 known clients (1406 recovered, 9 in progress, and 0 evicted) to recover in 8:45
[  156.202310] Lustre: Skipped 1 previous similar message
[  159.651156] LustreError: 3530:0:(mdt_handler.c:6167:mdt_iocontrol()) oak-MDT0000: Aborting recovery for device
[  159.662323] LustreError: 3530:0:(ldlm_lib.c:2606:target_stop_recovery_thread()) oak-MDT0000: Aborting recovery
[  159.673500] Lustre: 3850:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery
[  159.685661] Lustre: oak-MDT0000: disconnecting 11 stale clients
[  159.720906] Lustre: 3850:0:(ldlm_lib.c:1619:abort_req_replay_queue()) @@@ aborted:  req@ffff88100d339200 x1616374803034688/t0(279195624051) o36->ebc6458d-1c97-7178-4cec-be2598c570b5@10.9.114.12@o2ib4:319/0 lens 624/0 e 0 to 0 dl 1541532139 ref 1 fl Complete:/4/ffffffff rc 0/-1
[  159.748315] LustreError: 3850:0:(ldlm_lib.c:1640:abort_lock_replay_queue()) @@@ aborted:  req@ffff88102435a100 x1610729933017152/t0(0) o101->10a6d338-b751-f002-23d5-4f33526aa200@10.12.4.29@o2ib:537/0 lens 328/0 e 0 to 0 dl 1541532357 ref 1 fl Complete:/40/ffffffff rc 0/-1
[  159.789840] Lustre: oak-MDT0000: Recovery over after 0:55, of 1417 clients 1406 recovered and 11 were evicted.

I have two vmcore's of the above LBUG and system logs also available if needed.

Best,
Stephane

Stephane Thiell added a comment - 06/Nov/18 7:32 PM I tried a second time after shutting down the client that ran the lfs migrate but the same LBUG happened. I was able to put it back into production by trying a third time and doing an abort_recovery after most of the clients recovered. [ 156.180372] Lustre: oak-MDT0000: Denying connection for new client cdb722e9-f2fd-2e1e-3be0-ae3ca5c64dbb(at 10.8.15.2@o2ib6), waiting for 1417 known clients (1406 recovered, 9 in progress, and 0 evicted) to recover in 8:45 [ 156.202310] Lustre: Skipped 1 previous similar message [ 159.651156] LustreError: 3530:0:(mdt_handler.c:6167:mdt_iocontrol()) oak-MDT0000: Aborting recovery for device [ 159.662323] LustreError: 3530:0:(ldlm_lib.c:2606:target_stop_recovery_thread()) oak-MDT0000: Aborting recovery [ 159.673500] Lustre: 3850:0:(ldlm_lib.c:2059:target_recovery_overseer()) recovery is aborted, evict exports in recovery [ 159.685661] Lustre: oak-MDT0000: disconnecting 11 stale clients [ 159.720906] Lustre: 3850:0:(ldlm_lib.c:1619:abort_req_replay_queue()) @@@ aborted: req@ffff88100d339200 x1616374803034688/t0(279195624051) o36->ebc6458d-1c97-7178-4cec-be2598c570b5@10.9.114.12@o2ib4:319/0 lens 624/0 e 0 to 0 dl 1541532139 ref 1 fl Complete:/4/ffffffff rc 0/-1 [ 159.748315] LustreError: 3850:0:(ldlm_lib.c:1640:abort_lock_replay_queue()) @@@ aborted: req@ffff88102435a100 x1610729933017152/t0(0) o101->10a6d338-b751-f002-23d5-4f33526aa200@10.12.4.29@o2ib:537/0 lens 328/0 e 0 to 0 dl 1541532357 ref 1 fl Complete:/40/ffffffff rc 0/-1 [ 159.789840] Lustre: oak-MDT0000: Recovery over after 0:55, of 1417 clients 1406 recovered and 11 were evicted. I have two vmcore's of the above LBUG and system logs also available if needed. Best, Stephane

Stephane Thiell added a comment - 06/Nov/18 7:04 PM

The MDT crashed again when starting:

Nov 06 10:59:34 oak-md1-s2 kernel: Lustre: oak-MDT0000: Recovery already passed deadline 0:29. If you do not want to wait more, please abort the recovery by force.
Nov 06 10:59:46 oak-md1-s2 kernel: Lustre: oak-MDT0000: Denying connection for new client 8b854071-cc99-990d-3c95-75ea796b9ff2(at 10.8.13.14@o2ib6), waiting for 1417 known clients (1391 recovered, 9 in progress, and 0 evicted) to recover in 0:16
Nov 06 10:59:46 oak-md1-s2 kernel: Lustre: Skipped 1 previous similar message
Nov 06 11:00:03 oak-md1-s2 kernel: Lustre: oak-MDT0000: recovery is timed out, evict stale exports
Nov 06 11:00:03 oak-md1-s2 kernel: Lustre: oak-MDT0000: disconnecting 17 stale clients
Nov 06 11:00:06 oak-md1-s2 kernel: LustreError: 5052:0:(osd_handler.c:1982:osd_object_release()) LBUG
Nov 06 11:00:06 oak-md1-s2 kernel: Pid: 5052, comm: tgt_recover_0
Nov 06 11:00:06 oak-md1-s2 kernel: 
                                   Call Trace:
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc07e47ae>] libcfs_call_trace+0x4e/0x60 [libcfs]

Message from syslogd@oak-md1-s2 at Nov  6 11:00:06 ...
 kernel:LustreError: 5052:0:(osd_handler.c:1982:osd_object_release()) LBUG
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc07e483c>] lbug_with_loc+0x4c/0xb0 [libcfs]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc1017610>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc099cc4d>] lu_object_put+0x17d/0x3d0 [obdclass]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc0bf3778>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc0b39987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc0b3a616>] target_recovery_thread+0x616/0x1360 [ptlrpc]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffffc0b3a000>] ? target_recovery_thread+0x0/0x1360 [ptlrpc]
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffff810b098f>] kthread+0xcf/0xe0
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
Nov 06 11:00:06 oak-md1-s2 kernel:  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
Nov 06 11:00:06 oak-md1-s2 kernel: 
Nov 06 11:00:06 oak-md1-s2 kernel: Kernel panic - not syncing: LBUG

Stephane Thiell added a comment - 06/Nov/18 7:04 PM The MDT crashed again when starting: Nov 06 10:59:34 oak-md1-s2 kernel: Lustre: oak-MDT0000: Recovery already passed deadline 0:29. If you do not want to wait more, please abort the recovery by force. Nov 06 10:59:46 oak-md1-s2 kernel: Lustre: oak-MDT0000: Denying connection for new client 8b854071-cc99-990d-3c95-75ea796b9ff2(at 10.8.13.14@o2ib6), waiting for 1417 known clients (1391 recovered, 9 in progress, and 0 evicted) to recover in 0:16 Nov 06 10:59:46 oak-md1-s2 kernel: Lustre: Skipped 1 previous similar message Nov 06 11:00:03 oak-md1-s2 kernel: Lustre: oak-MDT0000: recovery is timed out, evict stale exports Nov 06 11:00:03 oak-md1-s2 kernel: Lustre: oak-MDT0000: disconnecting 17 stale clients Nov 06 11:00:06 oak-md1-s2 kernel: LustreError: 5052:0:(osd_handler.c:1982:osd_object_release()) LBUG Nov 06 11:00:06 oak-md1-s2 kernel: Pid: 5052, comm: tgt_recover_0 Nov 06 11:00:06 oak-md1-s2 kernel: Call Trace: Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc07e47ae>] libcfs_call_trace+0x4e/0x60 [libcfs] Message from syslogd@oak-md1-s2 at Nov 6 11:00:06 ... kernel:LustreError: 5052:0:(osd_handler.c:1982:osd_object_release()) LBUG Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc07e483c>] lbug_with_loc+0x4c/0xb0 [libcfs] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc1017610>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc099cc4d>] lu_object_put+0x17d/0x3d0 [obdclass] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc0bf3778>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc0b39987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc0b3a616>] target_recovery_thread+0x616/0x1360 [ptlrpc] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffff816a8fad>] ? __schedule+0x39d/0x8b0 Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffffc0b3a000>] ? target_recovery_thread+0x0/0x1360 [ptlrpc] Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0 Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 Nov 06 11:00:06 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0 Nov 06 11:00:06 oak-md1-s2 kernel: Nov 06 11:00:06 oak-md1-s2 kernel: Kernel panic - not syncing: LBUG

Stephane Thiell added a comment - 06/Nov/18 7:03 PM

Attached results of fsck on MDT0000 as oak-MDT0000-fsck-2018-11-06.log

Stephane Thiell added a comment - 06/Nov/18 7:03 PM Attached results of fsck on MDT0000 as oak-MDT0000-fsck-2018-11-06.log

Stephane Thiell added a comment - 06/Nov/18 6:04 PM

Ok this is not good. I tried to launch lfs migrate again:

lfs migrate --mdt-index 1 ./sthiell

Immediately the MDT0 crashed with:

[5516611.089815] LustreError: 15595:0:(osd_handler.c:1982:osd_object_release()) LBUG
[5516611.114441] Pid: 15595, comm: mdt00_105
[5516611.127614] 
Call Trace:
[5516611.141658]  [<ffffffffc08587ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
[5516611.163987]  [<ffffffffc085883c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[5516611.185171]  [<ffffffffc1019610>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs]
[5516611.212390]  [<ffffffffc099ec4d>] lu_object_put+0x17d/0x3d0 [obdclass]
[5516611.235012]  [<ffffffffc116d233>] mdt_reint_migrate_internal.isra.38+0x793/0x16e0 [mdt]
[5516611.261959]  [<ffffffffc0b44851>] ? ldlm_cli_enqueue_local+0x271/0x860 [ptlrpc]
[5516611.287397]  [<ffffffffc116e3e5>] mdt_reint_rename_or_migrate.isra.39+0x265/0x860 [mdt]
[5516611.315164]  [<ffffffffc0b47ee0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc]
[5516611.337804]  [<ffffffffc0b42340>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[5516611.361002]  [<ffffffffc116e9f0>] mdt_reint_migrate+0x10/0x20 [mdt]
[5516611.382451]  [<ffffffffc1172a40>] mdt_reint_rec+0x80/0x210 [mdt]
[5516611.402775]  [<ffffffffc115431b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[5516611.425379]  [<ffffffffc115fe47>] mdt_reint+0x67/0x140 [mdt]
[5516611.444583]  [<ffffffffc0bdad95>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[5516611.467794]  [<ffffffffc0b83bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[5516611.494998]  [<ffffffffc0b80228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[5516611.517896]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[5516611.539085]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[5516611.558854]  [<ffffffffc0b87332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[5516611.580053]  [<ffffffffc0b868a0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[5516611.602365]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[5516611.618967]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[5516611.635856]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[5516611.654178]  [<ffffffff810b08c0>] ? kthread+0x0/0xe0
[5516611.671066] 
[5516611.676514] Kernel panic - not syncing: LBUG
[5516611.691113] CPU: 2 PID: 15595 Comm: mdt00_105 Tainted: G           OE  ------------   3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1
[5516611.728879] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017
[5516611.754060]  ffff880e12696e00 00000000e8661d26 ffff8816667b3a08 ffffffff816a3db1
[5516611.778993]  ffff8816667b3a88 ffffffff8169dc74 ffffffff00000008 ffff8816667b3a98
[5516611.803930]  ffff8816667b3a38 00000000e8661d26 00000000e8661d26 ffff88103ce4f8b8
[5516611.828860] Call Trace:
[5516611.837456]  [<ffffffff816a3db1>] dump_stack+0x19/0x1b
[5516611.854920]  [<ffffffff8169dc74>] panic+0xe8/0x20d
[5516611.871240]  [<ffffffffc0858854>] lbug_with_loc+0x64/0xb0 [libcfs]
[5516611.892136]  [<ffffffffc1019610>] osd_object_release+0x60/0x60 [osd_ldiskfs]
[5516611.915909]  [<ffffffffc099ec4d>] lu_object_put+0x17d/0x3d0 [obdclass]
[5516611.942513]  [<ffffffffc116d233>] mdt_reint_migrate_internal.isra.38+0x793/0x16e0 [mdt]
[5516611.972007]  [<ffffffffc0b44851>] ? ldlm_cli_enqueue_local+0x271/0x860 [ptlrpc]
[5516611.996623]  [<ffffffffc116e3e5>] mdt_reint_rename_or_migrate.isra.39+0x265/0x860 [mdt]
[5516612.024113]  [<ffffffffc0b47ee0>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc]
[5516612.049600]  [<ffffffffc0b42340>] ? ldlm_expired_completion_wait+0x240/0x240 [ptlrpc]
[5516612.075932]  [<ffffffffc116e9f0>] mdt_reint_migrate+0x10/0x20 [mdt]
[5516612.097115]  [<ffffffffc1172a40>] mdt_reint_rec+0x80/0x210 [mdt]
[5516612.118587]  [<ffffffffc115431b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[5516612.140628]  [<ffffffffc115fe47>] mdt_reint+0x67/0x140 [mdt]
[5516612.160410]  [<ffffffffc0bdad95>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[5516612.183611]  [<ffffffffc0b83bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
[5516612.209387]  [<ffffffffc0b80228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[5516612.232549]  [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
[5516612.253726]  [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
[5516612.273494]  [<ffffffffc0b87332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[5516612.294971]  [<ffffffffc0b868a0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[5516612.322155]  [<ffffffff810b098f>] kthread+0xcf/0xe0
[5516612.338755]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
[5516612.360224]  [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
[5516612.378539]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40

After than, when trying to remount MDT0, I got:

Nov 06 09:56:39 oak-md1-s2 kernel: Lustre: Lustre: Build Version: 2.10.4
Nov 06 09:56:39 oak-md1-s2 kernel: LNet: Using FMR for registration
Nov 06 09:56:39 oak-md1-s2 kernel: LNet: Added LNI 10.0.2.52@o2ib5 [8/256/0/180]
Nov 06 09:56:41 oak-md1-s2 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait.
Nov 06 09:57:33 oak-md1-s2 kernel: LDISKFS-fs (dm-1): recovery complete
Nov 06 09:57:33 oak-md1-s2 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc
Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450747
Nov 06 09:57:36 oak-md1-s2 kernel: Aborting journal on device dm-1-8.
Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs (dm-1): Remounting filesystem read-only
Nov 06 09:57:36 oak-md1-s2 kernel: LustreError: 3564:0:(osd_scrub.c:1962:osd_ios_lookup_one_len()) Fail to find #460450747 in lost+found (11/0): rc = -5
Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450866
Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450930
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450942
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450954
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450963
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450965
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450970
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460451000
Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460451002
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 137-5: oak-MDT0000_UUID: not available for connect from 10.9.107.16@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: Skipped 1 previous similar message
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:304:server_mgc_set_fs()) can't set_fs -30
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start targets: -30
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:1589:server_put_super()) no obd oak-MDT0000
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:135:server_deregister_mount()) oak-MDT0000 not registered
Nov 06 09:57:37 oak-md1-s2 kernel: Lustre: server umount oak-MDT0000 complete
Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount  (-30)

Oak is down. I'm running fsck on it at the moment.
Thanks,
Stephane

Stephane Thiell added a comment - 06/Nov/18 6:04 PM Ok this is not good. I tried to launch lfs migrate again: lfs migrate --mdt-index 1 ./sthiell Immediately the MDT0 crashed with: [5516611.089815] LustreError: 15595:0:(osd_handler.c:1982:osd_object_release()) LBUG [5516611.114441] Pid: 15595, comm: mdt00_105 [5516611.127614] Call Trace: [5516611.141658] [<ffffffffc08587ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [5516611.163987] [<ffffffffc085883c>] lbug_with_loc+0x4c/0xb0 [libcfs] [5516611.185171] [<ffffffffc1019610>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] [5516611.212390] [<ffffffffc099ec4d>] lu_object_put+0x17d/0x3d0 [obdclass] [5516611.235012] [<ffffffffc116d233>] mdt_reint_migrate_internal.isra.38+0x793/0x16e0 [mdt] [5516611.261959] [<ffffffffc0b44851>] ? ldlm_cli_enqueue_local+0x271/0x860 [ptlrpc] [5516611.287397] [<ffffffffc116e3e5>] mdt_reint_rename_or_migrate.isra.39+0x265/0x860 [mdt] [5516611.315164] [<ffffffffc0b47ee0>] ? ldlm_blocking_ast+0x0/0x170 [ptlrpc] [5516611.337804] [<ffffffffc0b42340>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc] [5516611.361002] [<ffffffffc116e9f0>] mdt_reint_migrate+0x10/0x20 [mdt] [5516611.382451] [<ffffffffc1172a40>] mdt_reint_rec+0x80/0x210 [mdt] [5516611.402775] [<ffffffffc115431b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [5516611.425379] [<ffffffffc115fe47>] mdt_reint+0x67/0x140 [mdt] [5516611.444583] [<ffffffffc0bdad95>] tgt_request_handle+0x925/0x1370 [ptlrpc] [5516611.467794] [<ffffffffc0b83bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [5516611.494998] [<ffffffffc0b80228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [5516611.517896] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [5516611.539085] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [5516611.558854] [<ffffffffc0b87332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [5516611.580053] [<ffffffffc0b868a0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [5516611.602365] [<ffffffff810b098f>] kthread+0xcf/0xe0 [5516611.618967] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [5516611.635856] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [5516611.654178] [<ffffffff810b08c0>] ? kthread+0x0/0xe0 [5516611.671066] [5516611.676514] Kernel panic - not syncing: LBUG [5516611.691113] CPU: 2 PID: 15595 Comm: mdt00_105 Tainted: G OE ------------ 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 [5516611.728879] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.6.0 10/26/2017 [5516611.754060] ffff880e12696e00 00000000e8661d26 ffff8816667b3a08 ffffffff816a3db1 [5516611.778993] ffff8816667b3a88 ffffffff8169dc74 ffffffff00000008 ffff8816667b3a98 [5516611.803930] ffff8816667b3a38 00000000e8661d26 00000000e8661d26 ffff88103ce4f8b8 [5516611.828860] Call Trace: [5516611.837456] [<ffffffff816a3db1>] dump_stack+0x19/0x1b [5516611.854920] [<ffffffff8169dc74>] panic+0xe8/0x20d [5516611.871240] [<ffffffffc0858854>] lbug_with_loc+0x64/0xb0 [libcfs] [5516611.892136] [<ffffffffc1019610>] osd_object_release+0x60/0x60 [osd_ldiskfs] [5516611.915909] [<ffffffffc099ec4d>] lu_object_put+0x17d/0x3d0 [obdclass] [5516611.942513] [<ffffffffc116d233>] mdt_reint_migrate_internal.isra.38+0x793/0x16e0 [mdt] [5516611.972007] [<ffffffffc0b44851>] ? ldlm_cli_enqueue_local+0x271/0x860 [ptlrpc] [5516611.996623] [<ffffffffc116e3e5>] mdt_reint_rename_or_migrate.isra.39+0x265/0x860 [mdt] [5516612.024113] [<ffffffffc0b47ee0>] ? ldlm_blocking_ast_nocheck+0x310/0x310 [ptlrpc] [5516612.049600] [<ffffffffc0b42340>] ? ldlm_expired_completion_wait+0x240/0x240 [ptlrpc] [5516612.075932] [<ffffffffc116e9f0>] mdt_reint_migrate+0x10/0x20 [mdt] [5516612.097115] [<ffffffffc1172a40>] mdt_reint_rec+0x80/0x210 [mdt] [5516612.118587] [<ffffffffc115431b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [5516612.140628] [<ffffffffc115fe47>] mdt_reint+0x67/0x140 [mdt] [5516612.160410] [<ffffffffc0bdad95>] tgt_request_handle+0x925/0x1370 [ptlrpc] [5516612.183611] [<ffffffffc0b83bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc] [5516612.209387] [<ffffffffc0b80228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [5516612.232549] [<ffffffff810c4822>] ? default_wake_function+0x12/0x20 [5516612.253726] [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90 [5516612.273494] [<ffffffffc0b87332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [5516612.294971] [<ffffffffc0b868a0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [5516612.322155] [<ffffffff810b098f>] kthread+0xcf/0xe0 [5516612.338755] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 [5516612.360224] [<ffffffff816b4f58>] ret_from_fork+0x58/0x90 [5516612.378539] [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40 After than, when trying to remount MDT0, I got: Nov 06 09:56:39 oak-md1-s2 kernel: Lustre: Lustre: Build Version: 2.10.4 Nov 06 09:56:39 oak-md1-s2 kernel: LNet: Using FMR for registration Nov 06 09:56:39 oak-md1-s2 kernel: LNet: Added LNI 10.0.2.52@o2ib5 [8/256/0/180] Nov 06 09:56:41 oak-md1-s2 kernel: LDISKFS-fs warning (device dm-1): ldiskfs_multi_mount_protect:322: MMP interval 42 higher than expected, please wait. Nov 06 09:57:33 oak-md1-s2 kernel: LDISKFS-fs (dm-1): recovery complete Nov 06 09:57:33 oak-md1-s2 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,acl,no_mbcache,nodelalloc Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450747 Nov 06 09:57:36 oak-md1-s2 kernel: Aborting journal on device dm-1-8. Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs (dm-1): Remounting filesystem read-only Nov 06 09:57:36 oak-md1-s2 kernel: LustreError: 3564:0:(osd_scrub.c:1962:osd_ios_lookup_one_len()) Fail to find #460450747 in lost+found (11/0): rc = -5 Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450866 Nov 06 09:57:36 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450930 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450942 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450954 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450963 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450965 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460450970 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460451000 Nov 06 09:57:37 oak-md1-s2 kernel: LDISKFS-fs error (device dm-1): ldiskfs_lookup:1809: inode #11: comm mount.lustre: deleted inode referenced: 460451002 Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 137-5: oak-MDT0000_UUID: not available for connect from 10.9.107.16@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: Skipped 1 previous similar message Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:304:server_mgc_set_fs()) can't set_fs -30 Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start targets: -30 Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:1589:server_put_super()) no obd oak-MDT0000 Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount_server.c:135:server_deregister_mount()) oak-MDT0000 not registered Nov 06 09:57:37 oak-md1-s2 kernel: Lustre: server umount oak-MDT0000 complete Nov 06 09:57:37 oak-md1-s2 kernel: LustreError: 3564:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-30) Oak is down. I'm running fsck on it at the moment. Thanks, Stephane

Stephane Thiell added a comment - 06/Nov/18 5:16 PM

Hi YangSheng,

Thanks for the explanation, this is helpful and that makes sense. And this would also explain why fsck didn't find anything interesting. I'll try to check what exactly have been migrated and possibly launch lfs migrate again today and see how it goes. I'll report back any useful info.

Best,
Stephane

Stephane Thiell added a comment - 06/Nov/18 5:16 PM Hi YangSheng, Thanks for the explanation, this is helpful and that makes sense. And this would also explain why fsck didn't find anything interesting. I'll try to check what exactly have been migrated and possibly launch lfs migrate again today and see how it goes. I'll report back any useful info. Best, Stephane

Yang Sheng added a comment - 06/Nov/18 11:53 AM

Hi, Stephane,

I don't think this inode can be moved to other place, Since the failure ldiskfs_map_blocks was issued by read syscall. And then MDT0 was remounted as read-only immediately after this failure. From failure location we can see the block No(774843950) comes from extent status tree. This is a cache was managed by ldiskfs internal. It means this block No should valid before(it can be cached) and become invalid when read again. So i think this issue should relate to memory corruption rather than migration.

Thanks,
YangSheng

Yang Sheng added a comment - 06/Nov/18 11:53 AM Hi, Stephane, I don't think this inode can be moved to other place, Since the failure ldiskfs_map_blocks was issued by read syscall. And then MDT0 was remounted as read-only immediately after this failure. From failure location we can see the block No(774843950) comes from extent status tree. This is a cache was managed by ldiskfs internal. It means this block No should valid before(it can be cached) and become invalid when read again. So i think this issue should relate to memory corruption rather than migration. Thanks, YangSheng

Stephane Thiell added a comment - 05/Nov/18 7:33 PM

Hi YangSheng,

Thanks! Looks like the inode can't be found anymore on MDT 0:

[root@oak-md1-s2 ~]# debugfs /dev/mapper/md1-rbod1-mdt0
debugfs 1.44.3.wc1 (23-July-2018)
debugfs:  ncheck 659619751
Inode	Pathname
debugfs:

The command took at least two hours to perform so I guess it worked. Perhaps the file was migrated to MDT 1 after all. But how to know... the inode # is probably different on MDT 1 after migration. Anyway, what do you think?

Thanks again,
Stephane

Stephane Thiell added a comment - 05/Nov/18 7:33 PM Hi YangSheng, Thanks! Looks like the inode can't be found anymore on MDT 0: [root@oak-md1-s2 ~]# debugfs /dev/mapper/md1-rbod1-mdt0 debugfs 1.44.3.wc1 (23-July-2018) debugfs: ncheck 659619751 Inode Pathname debugfs: The command took at least two hours to perform so I guess it worked. Perhaps the file was migrated to MDT 1 after all. But how to know... the inode # is probably different on MDT 1 after migration. Anyway, what do you think? Thanks again, Stephane

People

Assignee:: Yang Sheng

Reporter:: Stephane Thiell

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Oct/18 4:35 AM

Updated:: 26/Jul/22 3:56 AM

Resolved:: 26/Jul/22 3:56 AM