[LU-12295] MDS Panic on DNE2 directory removing Created: 13/May/19 Updated: 19/Dec/20 Resolved: 12/Sep/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.5 |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Tatsushi Takamura | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Epic/Theme: | dne | ||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
MDS Panic when handling remote object fails. Steps to reproduce are as follows: 1) create/delete files and directorys under striped directory [client]# lfs mkdir -c 2 -i 0 /mnt/lustre/dir [client]# lfs mkdir -c 2 -i 0 -D /mnt/lustre/dir [client]# while :; do rm -rf /mnt/lustre/dir/*; ./mdtest -v -n 1000 -p 1 -i 3 -d /mnt/lustre/dir; done 2) simulate ENOSPC error at remote object handling (that is, out_tx_write_exec() function) [MDS1]# while :; do sysctl lnet.fail_loc=0x1704 ; sleep 3; sysctl lnet.fail_loc=0; sleep 5; done {{}} MDS console and dump: {{}} Message from syslogd@rx200-076 at May 10 20:08:27 ... kernel:LustreError: 20269:0:(osd_handler.c:3229:osd_destroy()) ASSERTION( osd_inode_unlinked(inode) || inode->i_nlink == 1 || inode->i_nlink == 2 ) failed: Message from syslogd@rx200-076 at May 10 20:08:27 ... kernel:LustreError: 20269:0:(osd_handler.c:3229:osd_destroy()) LBUG [9798957.173503] Call Trace: [9798957.190509] [<ffffffffb3b0d78e>] dump_stack+0x19/0x1b [9798957.223630] [<ffffffffb3b07a90>] panic+0xe8/0x21f [9798957.254673] [<ffffffffc0ad18cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [9798957.294020] [<ffffffffc1133dd0>] osd_destroy+0x710/0x750 [osd_ldiskfs] [9798957.335950] [<ffffffffc1132bcd>] ? osd_ref_del+0x1ad/0x6a0 [osd_ldiskfs] [9798957.378897] [<ffffffffc1132141>] ? osd_attr_set+0x201/0xae0 [osd_ldiskfs] [9798957.422331] [<ffffffffb3b120d2>] ? down_write+0x12/0x3d [9798957.456457] [<ffffffffc0f6c851>] out_obj_destroy+0x101/0x2c0 [ptlrpc] [9798957.497826] [<ffffffffc0f6cac0>] out_tx_destroy_exec+0x20/0x190 [ptlrpc] [9798957.540746] [<ffffffffc0f67591>] out_tx_end+0xe1/0x5c0 [ptlrpc] [9798957.578950] [<ffffffffc0f6b6d3>] out_handle+0x1453/0x1bc0 [ptlrpc] [9798957.618701] [<ffffffffc0efbf72>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc] [9798957.661558] [<ffffffffc0f5fc69>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc] [9798957.711684] [<ffffffffc0f6138a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [9798957.755032] [<ffffffffc0f09e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [9798957.803047] [<ffffffffc0f06478>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [9798957.845811] [<ffffffffb34cee92>] ? default_wake_function+0x12/0x20 [9798957.885436] [<ffffffffb34c4abb>] ? __wake_up_common+0x5b/0x90 [9798957.922487] [<ffffffffc0f0d592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [9798957.962103] [<ffffffffc0f0cb00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [9798958.008436] [<ffffffffb34bae31>] kthread+0xd1/0xe0 [9798958.039672] [<ffffffffb34bad60>] ? insert_kthread_work+0x40/0x40 [9798958.078163] [<ffffffffb3b1f5f7>] ret_from_fork_nospec_begin+0x21/0x21 [9798958.119234] [<ffffffffb34bad60>] ? insert_kthread_work+0x40/0x40 {{}} Could you please look into this one? |
| Comments |
| Comment by Oleg Drokin [ 14/May/19 ] |
|
hm, it looks like I hit a very similar failure in master-next two days ago and yesterday: [ 5930.469393] LustreError: 9370:0:(osd_handler.c:3573:osd_destroy()) ASSERTION( osd_inode_unlinked(inode) || inode->i_nlink == 1 || inode->i_nlink == 2 ) failed: [ 5930.502768] LustreError: 9370:0:(osd_handler.c:3573:osd_destroy()) LBUG [ 5930.505164] Pid: 9370, comm: mdt_rdpg07_003 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018 [ 5930.509233] Call Trace: [ 5930.511319] [<ffffffffa02b27dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [ 5930.514891] [<ffffffffa02b288c>] lbug_with_loc+0x4c/0xa0 [libcfs] [ 5930.522770] [<ffffffffa0c4eeb3>] osd_destroy+0x713/0x750 [osd_ldiskfs] [ 5930.527762] [<ffffffffa0e8f83b>] lod_sub_destroy+0x1bb/0x450 [lod] [ 5930.531206] [<ffffffffa0e777a0>] lod_destroy+0x140/0x820 [lod] [ 5930.546681] [<ffffffffa0d39e26>] mdd_close+0x846/0xf30 [mdd] [ 5930.549991] [<ffffffffa0db7aab>] mdt_mfd_close+0x3fb/0x850 [mdt] [ 5930.555677] [<ffffffffa0dbd401>] mdt_close_internal+0xb1/0x220 [mdt] [ 5930.560137] [<ffffffffa0dbd790>] mdt_close+0x220/0x740 [mdt] [ 5930.564650] [<ffffffffa072eb05>] tgt_request_handle+0x915/0x15c0 [ptlrpc] [ 5930.567750] [<ffffffffa06d12b9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc] [ 5930.584402] [<ffffffffa06d52bc>] ptlrpc_main+0xb6c/0x20b0 [ptlrpc] [ 5930.585599] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [ 5930.587608] [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21 [ 5930.588809] [<ffffffffffffffff>] 0xffffffffffffffff [ 5930.589680] Kernel panic - not syncing: LBUG and [13720.662563] LustreError: 14705:0:(osd_handler.c:3573:osd_destroy()) ASSERTION( osd_inode_unlinked(inode) || inode->i_nlink == 1 || inode->i_nlink == 2 ) failed: [13720.683253] LustreError: 14705:0:(osd_handler.c:3573:osd_destroy()) LBUG [13720.684186] Pid: 14705, comm: mdt04_003 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018 [13720.685838] Call Trace: [13720.686625] [<ffffffffa02cb7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [13720.688731] [<ffffffffa02cb88c>] lbug_with_loc+0x4c/0xa0 [libcfs] [13720.690977] [<ffffffffa0c2aeb3>] osd_destroy+0x713/0x750 [osd_ldiskfs] [13720.701737] [<ffffffffa0e6b83b>] lod_sub_destroy+0x1bb/0x450 [lod] [13720.707438] [<ffffffffa0e537a0>] lod_destroy+0x140/0x820 [lod] [13720.712593] [<ffffffffa0d0aa63>] mdd_finish_unlink+0x123/0x410 [mdd] [13720.714811] [<ffffffffa0d0cce4>] mdd_unlink+0x9c4/0xad0 [mdd] [13720.719251] [<ffffffffa0dc177f>] mdo_unlink+0x43/0x45 [mdt] [13720.721165] [<ffffffffa0d83c15>] mdt_reint_unlink+0xb25/0x13e0 [mdt] [13720.728197] [<ffffffffa0d8a7c0>] mdt_reint_rec+0x80/0x210 [mdt] [13720.734164] [<ffffffffa0d66a40>] mdt_reint_internal+0x780/0xb50 [mdt] [13720.736305] [<ffffffffa0d71aa7>] mdt_reint+0x67/0x140 [mdt] [13720.744742] [<ffffffffa0727b05>] tgt_request_handle+0x915/0x15c0 [ptlrpc] [13720.758897] [<ffffffffa06ca2b9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc] [13720.798963] [<ffffffffa06ce2bc>] ptlrpc_main+0xb6c/0x20b0 [ptlrpc] [13720.801378] [<ffffffff810b4ed4>] kthread+0xe4/0xf0 [13720.822348] [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21 [13720.824379] [<ffffffffffffffff>] 0xffffffffffffffff [13720.826530] Kernel panic - not syncing: LBUG I have cashdumps too. |
| Comment by Olaf Faaland [ 25/Feb/20 ] |
|
I don't recall seeing this specific bug at LLNL, but we've seen a variety of failures when MDTs run out of space. It would be nice to work them so that users can recover on their own by deleting files/directories, and so that readdir/stat/open/close succeed while the housecleaning is being done. |
| Comment by Gerrit Updater [ 26/Aug/20 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39734 |
| Comment by Gerrit Updater [ 12/Sep/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39734/ |
| Comment by Peter Jones [ 12/Sep/20 ] |
|
Landed for 2.14 |