Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
Lustre 2.14.0, Lustre 2.15.1
-
None
-
Linux 5.4.0-1091-azure #96~18.04.1-Ubuntu SMP Tue Aug 30 19:15:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
-
3
-
9223372036854775807
Description
Reproducible on 2.15.1 and 2.14.0. Both clients and servers are running Ubuntu 18.04 as shown in Environment.
Steps to reproduce:
# confirm hsm is enabled
mds-node:~# lctl get_param mdt.lustrefs-MDT0000.hsm_control
mdt.lustrefs-MDT0000.hsm_control=enabled
# setup pcc on client 0
client-0:~# mkdir /pcc
client-0:~# chmod 777 /pcc /lustre
client-0:~# lhsmtool_posix --daemon --hsm-root /pcc --archive=2 /lustre < /dev/null > /tmp/copytool_log 2>&1
client-0:~# lctl pcc add /lustre /pcc -p "gid={0},gid={2001} rwid=2"
# setup pcc on client 1
client-1:~# mkdir /pcc
client-1:~# chmod 777 /pcc /lustre
client-1:~# lhsmtool_posix --daemon --hsm-root /pcc --archive=3 /lustre < /dev/null > /tmp/copytool_log 2>&1
client-1:~# lctl pcc add /lustre /pcc -p "gid={0},gid={2001} rwid=3"
# create file on client 0 and confirm in-cache
client-0:~# echo "test" > /lustre/test
client-0:~# lfs pcc state /lustre/test
file: /lustre/test, type: readwrite, PCC file: /pcc/0001/0000/0402/0000/0002/0000/0x200000402:0x1:0x0, user number: 0, flags: 0
# read file from client 1
client-1:~# lfs pcc state /lustre/test
file: /lustre/test, type: none
client-1:~# cat /lustre/test
cat: /lustre/test: No data available
client-1:~# cat /lustre/test
test
client-1:~# lfs pcc state /lustre/test
file: /lustre/test, type: none
# check pcc state, and attempt to attach again on client 0
client-0:~# lfs pcc state /lustre/test
file: /lustre/test, type: none
client-0:~# lfs pcc attach -i 2 /lustre/test
^C^C^C^C^C^C^C^C^C <---- hang
# while client 0 is hanging, check state on client 1
client-1:~# lfs pcc state /lustre/test
^C^C^C^C <---- hang
Minutes later things resolve and the stuck command lines return. Examining the MDS, it crashed and rebooted. Relevant
output from dmesg:
[ 3266.211270] LustreError: 11458:0:(mdt_handler.c:960:mdt_big_xattr_get()) ASSERTION( info->mti_big_lmm_used == 0 ) failed:
[ 3266.217023] LustreError: 11458:0:(mdt_handler.c:960:mdt_big_xattr_get()) LBUG
[ 3266.220653] Pid: 11458, comm: mdt_rdpg02_001 5.4.0-1091-azure #96~18.04.1-Ubuntu SMP Tue Aug 30 19:15:32 UTC 2022
[ 3266.220653] Call Trace TBD:
[ 3266.220654] Kernel panic - not syncing: LBUG
[ 3266.222778] CPU: 8 PID: 11458 Comm: mdt_rdpg02_001 Kdump: loaded Tainted: P OE 5.4.0-1091-azure #96~18.04.1-Ubuntu
[ 3266.224582] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 3266.224582] Call Trace:
[ 3266.224582] dump_stack+0x57/0x6d
[ 3266.224582] panic+0xf8/0x2d4
[ 3266.224582] lbug_with_loc+0x89/0x2c0 [libcfs]
[ 3266.224582] mdt_big_xattr_get+0x398/0x8b0 [mdt]
[ 3266.224582] ? mdd_read_unlock+0x2d/0xc0 [mdd]
[ 3266.224582] ? mdd_readpage+0x1919/0x1ed0 [mdd]
[ 3266.224582] __mdt_stripe_get+0x1d4/0x430 [mdt]
[ 3266.224582] mdt_attr_get_complex+0x56e/0x1af0 [mdt]
[ 3266.224582] mdt_mfd_close+0x2062/0x41c0 [mdt]
[ 3266.224582] ? lustre_msg_buf+0x17/0x50 [ptlrpc]
[ 3266.224582] ? __req_capsule_offset+0x5ae/0x6e0 [ptlrpc]
[ 3266.224582] mdt_close_internal+0x1f0/0x250 [mdt]
[ 3266.259003] mdt_close+0x483/0x13f0 [mdt][ 3266.259003] tgt_request_handle+0xc9a/0x1950 [ptlrpc]
[ 3266.259003] ? lustre_msg_get_transno+0x22/0xe0 [ptlrpc]
[ 3266.259003] ptlrpc_register_service+0x25e6/0x4610 [ptlrpc]
[ 3266.259003] ? __switch_to_asm+0x34/0x70
[ 3266.259003] kthread+0x121/0x140
[ 3266.259003] ? ptlrpc_register_service+0x1590/0x4610 [ptlrpc]
[ 3266.259003] ? kthread_park+0x90/0x90
[ 3266.259003] ret_from_fork+0x35/0x40
[ 3266.259003] Kernel Offset: 0x1be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Attachments
Issue Links
- is related to
-
LU-13816 LustreError: 18408:0:(mdt_handler.c:892:mdt_big_xattr_get()) ASSERTION( info->mti_big_lmm_used == 0 ) failed:
-
- Resolved
-
-
LU-13599 LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed
-
- Resolved
-
-
LU-13615 mdt_big_xattr_get()) ASSERTION( info->mti_big_lmm_used == 0 ) failed
-
- Closed
-
I've linked three potentially related bugs. The last one has a description that's particularly enlightening:
"This is result of inappropriate usage of mti_big_lmm buffer in various places. Originally it was introduced to be used for getting big LOV/LMV EA and passing them to reply buffers. Meanwhile it is widely used now for internal server needs. These cases should be distinguished and if there is no intention to return EA in reply then flag {mti_big_lmm_used}} should not be set. Maybe it is worth to rename it as mti_big_lmm_keep to mark that is to be kept until reply is packed."
This aligns with a comment about the non-internal version of get_stripe:
LASSERT(!info->mti_big_lmm_used);
rc = __mdt_stripe_get(info, o, ma, name);
/* since big_lmm is always used here, clear 'used' flag to avoid
* assertion in mdt_big_xattr_get().
*/ info->mti_big_lmm_used = 0;
I wonder if a codepath is being tickled that is (ab)using mit_big_lmm_used in a similar fashion that's not covered like this code path is.