[LU-12295] MDS Panic on DNE2 directory removing Created: 13/May/19  Updated: 19/Dec/20  Resolved: 12/Sep/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Tatsushi Takamura Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Epic/Theme: dne
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS Panic when handling remote object fails.

Steps to reproduce are as follows:

1) create/delete files and directorys under striped directory
[client]# lfs mkdir -c 2 -i 0 /mnt/lustre/dir
[client]# lfs mkdir -c 2 -i 0 -D /mnt/lustre/dir
[client]# while :; do rm -rf /mnt/lustre/dir/*;  ./mdtest -v -n 1000 -p 1 -i 3 -d /mnt/lustre/dir; done

2) simulate ENOSPC error at remote object handling (that is, out_tx_write_exec() function)
[MDS1]# while :; do sysctl lnet.fail_loc=0x1704 ; sleep 3; sysctl lnet.fail_loc=0; sleep 5; done

{{}}

MDS console and dump:

{{}}

Message from syslogd@rx200-076 at May 10 20:08:27 ...
 kernel:LustreError: 20269:0:(osd_handler.c:3229:osd_destroy()) ASSERTION( osd_inode_unlinked(inode) || inode->i_nlink == 1 || inode->i_nlink == 2 ) failed:

Message from syslogd@rx200-076 at May 10 20:08:27 ...
 kernel:LustreError: 20269:0:(osd_handler.c:3229:osd_destroy()) LBUG

 [9798957.173503] Call Trace:
[9798957.190509]  [<ffffffffb3b0d78e>] dump_stack+0x19/0x1b
[9798957.223630]  [<ffffffffb3b07a90>] panic+0xe8/0x21f
[9798957.254673]  [<ffffffffc0ad18cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
[9798957.294020]  [<ffffffffc1133dd0>] osd_destroy+0x710/0x750 [osd_ldiskfs]
[9798957.335950]  [<ffffffffc1132bcd>] ? osd_ref_del+0x1ad/0x6a0 [osd_ldiskfs]
[9798957.378897]  [<ffffffffc1132141>] ? osd_attr_set+0x201/0xae0 [osd_ldiskfs]
[9798957.422331]  [<ffffffffb3b120d2>] ? down_write+0x12/0x3d
[9798957.456457]  [<ffffffffc0f6c851>] out_obj_destroy+0x101/0x2c0 [ptlrpc]
[9798957.497826]  [<ffffffffc0f6cac0>] out_tx_destroy_exec+0x20/0x190 [ptlrpc]
[9798957.540746]  [<ffffffffc0f67591>] out_tx_end+0xe1/0x5c0 [ptlrpc]
[9798957.578950]  [<ffffffffc0f6b6d3>] out_handle+0x1453/0x1bc0 [ptlrpc]
[9798957.618701]  [<ffffffffc0efbf72>] ? lustre_msg_get_opc+0x22/0xf0 [ptlrpc]
[9798957.661558]  [<ffffffffc0f5fc69>] ? tgt_request_preprocess.isra.26+0x299/0x790 [ptlrpc]
[9798957.711684]  [<ffffffffc0f6138a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[9798957.755032]  [<ffffffffc0f09e4b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[9798957.803047]  [<ffffffffc0f06478>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[9798957.845811]  [<ffffffffb34cee92>] ? default_wake_function+0x12/0x20
[9798957.885436]  [<ffffffffb34c4abb>] ? __wake_up_common+0x5b/0x90
[9798957.922487]  [<ffffffffc0f0d592>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[9798957.962103]  [<ffffffffc0f0cb00>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[9798958.008436]  [<ffffffffb34bae31>] kthread+0xd1/0xe0
[9798958.039672]  [<ffffffffb34bad60>] ? insert_kthread_work+0x40/0x40
[9798958.078163]  [<ffffffffb3b1f5f7>] ret_from_fork_nospec_begin+0x21/0x21
[9798958.119234]  [<ffffffffb34bad60>] ? insert_kthread_work+0x40/0x40

{{}}

Could you please look into this one?



 Comments   
Comment by Oleg Drokin [ 14/May/19 ]

hm, it looks like I hit a very similar failure in master-next two days ago and yesterday:

[ 5930.469393] LustreError: 9370:0:(osd_handler.c:3573:osd_destroy()) ASSERTION( osd_inode_unlinked(inode) || inode->i_nlink == 1 || inode->i_nlink == 2 ) failed: 
[ 5930.502768] LustreError: 9370:0:(osd_handler.c:3573:osd_destroy()) LBUG
[ 5930.505164] Pid: 9370, comm: mdt_rdpg07_003 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018
[ 5930.509233] Call Trace:
[ 5930.511319]  [<ffffffffa02b27dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[ 5930.514891]  [<ffffffffa02b288c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[ 5930.522770]  [<ffffffffa0c4eeb3>] osd_destroy+0x713/0x750 [osd_ldiskfs]
[ 5930.527762]  [<ffffffffa0e8f83b>] lod_sub_destroy+0x1bb/0x450 [lod]
[ 5930.531206]  [<ffffffffa0e777a0>] lod_destroy+0x140/0x820 [lod]
[ 5930.546681]  [<ffffffffa0d39e26>] mdd_close+0x846/0xf30 [mdd]
[ 5930.549991]  [<ffffffffa0db7aab>] mdt_mfd_close+0x3fb/0x850 [mdt]
[ 5930.555677]  [<ffffffffa0dbd401>] mdt_close_internal+0xb1/0x220 [mdt]
[ 5930.560137]  [<ffffffffa0dbd790>] mdt_close+0x220/0x740 [mdt]
[ 5930.564650]  [<ffffffffa072eb05>] tgt_request_handle+0x915/0x15c0 [ptlrpc]
[ 5930.567750]  [<ffffffffa06d12b9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
[ 5930.584402]  [<ffffffffa06d52bc>] ptlrpc_main+0xb6c/0x20b0 [ptlrpc]
[ 5930.585599]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[ 5930.587608]  [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21
[ 5930.588809]  [<ffffffffffffffff>] 0xffffffffffffffff
[ 5930.589680] Kernel panic - not syncing: LBUG

and

[13720.662563] LustreError: 14705:0:(osd_handler.c:3573:osd_destroy()) ASSERTION( osd_inode_unlinked(inode) || inode->i_nlink == 1 || inode->i_nlink == 2 ) failed: 
[13720.683253] LustreError: 14705:0:(osd_handler.c:3573:osd_destroy()) LBUG
[13720.684186] Pid: 14705, comm: mdt04_003 3.10.0-7.6-debug #1 SMP Wed Nov 7 21:55:08 EST 2018
[13720.685838] Call Trace:
[13720.686625]  [<ffffffffa02cb7dc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[13720.688731]  [<ffffffffa02cb88c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[13720.690977]  [<ffffffffa0c2aeb3>] osd_destroy+0x713/0x750 [osd_ldiskfs]
[13720.701737]  [<ffffffffa0e6b83b>] lod_sub_destroy+0x1bb/0x450 [lod]
[13720.707438]  [<ffffffffa0e537a0>] lod_destroy+0x140/0x820 [lod]
[13720.712593]  [<ffffffffa0d0aa63>] mdd_finish_unlink+0x123/0x410 [mdd]
[13720.714811]  [<ffffffffa0d0cce4>] mdd_unlink+0x9c4/0xad0 [mdd]
[13720.719251]  [<ffffffffa0dc177f>] mdo_unlink+0x43/0x45 [mdt]
[13720.721165]  [<ffffffffa0d83c15>] mdt_reint_unlink+0xb25/0x13e0 [mdt]
[13720.728197]  [<ffffffffa0d8a7c0>] mdt_reint_rec+0x80/0x210 [mdt]
[13720.734164]  [<ffffffffa0d66a40>] mdt_reint_internal+0x780/0xb50 [mdt]
[13720.736305]  [<ffffffffa0d71aa7>] mdt_reint+0x67/0x140 [mdt]
[13720.744742]  [<ffffffffa0727b05>] tgt_request_handle+0x915/0x15c0 [ptlrpc]
[13720.758897]  [<ffffffffa06ca2b9>] ptlrpc_server_handle_request+0x259/0xad0 [ptlrpc]
[13720.798963]  [<ffffffffa06ce2bc>] ptlrpc_main+0xb6c/0x20b0 [ptlrpc]
[13720.801378]  [<ffffffff810b4ed4>] kthread+0xe4/0xf0
[13720.822348]  [<ffffffff817c4c5d>] ret_from_fork_nospec_begin+0x7/0x21
[13720.824379]  [<ffffffffffffffff>] 0xffffffffffffffff
[13720.826530] Kernel panic - not syncing: LBUG

I have cashdumps too.

Comment by Olaf Faaland [ 25/Feb/20 ]

I don't recall seeing this specific bug at LLNL, but we've seen a variety of failures when MDTs run out of space.  It would be nice to work them so that users can recover on their own by deleting files/directories, and so that readdir/stat/open/close succeed while the housecleaning is being done.

Comment by Gerrit Updater [ 26/Aug/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39734
Subject: LU-12295 osd-ldiskfs: don't LBUG() if dir nlink is wrong
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1f563d379c6415b93fbc50d5613e532ebd6a9d34

Comment by Gerrit Updater [ 12/Sep/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39734/
Subject: LU-12295 mdd: don't LBUG() if dir nlink is wrong
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: afa39b3cceabccd19e7c412ff90667e95cbfe3e8

Comment by Peter Jones [ 12/Sep/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:51:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.