Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.5
-
VMs with Lustre 2.12.5/master on ldiskfs
-
3
-
9223372036854775807
Description
I create this ticket to follow the issue seen by @apercher (cf. LU-8346's comment).
Here are the commands/configs to reproduce the issue:
fstab:
<serv1@ib1>:<serv2@ib1>:/fs1 /mnt/fs1 lustre defaults,_netdev,noauto,x-systemd.requires=lnet.service,flock,user_xattr,nosuid 0 0 <serv1@ib1>:<serv2@ib1>:/fs1/home /mnt/home lustre defaults,_netdev,noauto,x-systemd.requires=lnet.service,flock,user_xattr,nosuid 0 0
commands:
while true; do mount /mnt/home & mount /mnt/fs1 umount /mnt/home umount /mnt/fs1 lustre_rmmod done
After some iterations "rmmod lustre" will hang in "lu_context_key_degister"
dmesg (master branch):
[ 1560.484463] INFO: task rmmod:6430 blocked for more than 120 seconds. [ 1560.484480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1560.484496] rmmod D ffff9ddbdfd9acc0 0 6430 6396 0x00000080 [ 1560.484499] Call Trace: [ 1560.484504] [<ffffffff8b0266d2>] ? kmem_cache_free+0x1e2/0x200 [ 1560.484508] [<ffffffff8b585da9>] schedule+0x29/0x70 [ 1560.484531] [<ffffffffc0a0284d>] lu_context_key_degister+0xcd/0x150 [obdclass] [ 1560.484534] [<ffffffff8aec7880>] ? wake_bit_function_rh+0x40/0x40 [ 1560.484548] [<ffffffffc0a02a72>] lu_context_key_degister_many+0x72/0xb0 [obdclass] [ 1560.484550] [<ffffffff8b0266d2>] ? kmem_cache_free+0x1e2/0x200 [ 1560.484564] [<ffffffffc0d67347>] vvp_type_fini+0x27/0x30 [lustre] [ 1560.484577] [<ffffffffc09fc01b>] lu_device_type_fini+0x1b/0x20 [obdclass] [ 1560.484586] [<ffffffffc0d68d75>] vvp_global_fini+0x15/0x30 [lustre] [ 1560.484596] [<ffffffffc0d7beb4>] lustre_exit+0x31/0x17d [lustre] [ 1560.484599] [<ffffffff8af1c46e>] SyS_delete_module+0x19e/0x310 [ 1560.484601] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a [ 1560.484603] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 1560.484604] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a [ 1560.484606] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a [ 1560.484607] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a [ 1560.484609] [<ffffffff8b592ed2>] system_call_fastpath+0x25/0x2a [ 1560.484611] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a
crash backtrace (master branch):
crash> bt -F 6430
PID: 6430 TASK: ffff9ddbd5c0c1c0 CPU: 3 COMMAND: "rmmod"
#0 [ffff9ddbd5d2bd18] __schedule at ffffffff8b5858fa
ffff9ddbd5d2bd20: 0000000000000082 ffff9ddbd5d2bfd8
ffff9ddbd5d2bd30: ffff9ddbd5d2bfd8 ffff9ddbd5d2bfd8
ffff9ddbd5d2bd40: 000000000001acc0 [task_struct]
ffff9ddbd5d2bd50: kmem_cache_free+482 [dm_rq_target_io]
ffff9ddbd5d2bd60: 0000000000000000 00000000a8325962
ffff9ddbd5d2bd70: 0000000000000246 ll_thread_key
ffff9ddbd5d2bd80: bit_wait_table+2664 ffff9ddbd5d2bdd8
ffff9ddbd5d2bd90: 0000000000000000 0000000000000000
ffff9ddbd5d2bda0: ffff9ddbd5d2bdb0 schedule+41
#1 [ffff9ddbd5d2bda8] schedule at ffffffff8b585da9
ffff9ddbd5d2bdb0: ffff9ddbd5d2be20 lu_context_key_degister+205
#2 [ffff9ddbd5d2bdb8] lu_context_key_degister at ffffffffc0a0284d [obdclass]
ffff9ddbd5d2bdc0: ll_thread_key+36 00000000ffffffff
ffff9ddbd5d2bdd0: 0000000000000000 0000000000000000
ffff9ddbd5d2bde0: [task_struct] var_wake_function
ffff9ddbd5d2bdf0: bit_wait_table+2672 bit_wait_table+2672
ffff9ddbd5d2be00: 00000000a8325962 fffffffffffffff5
ffff9ddbd5d2be10: __this_module 0000000000000800
ffff9ddbd5d2be20: ffff9ddbd5d2be80 lu_context_key_degister_many+114
#3 [ffff9ddbd5d2be28] lu_context_key_degister_many at ffffffffc0a02a72 [obdclass]
ffff9ddbd5d2be30: ffff9ddb00000008 ffff9ddbd5d2be90
ffff9ddbd5d2be40: ffff9ddbd5d2be50 00000000a8325962
ffff9ddbd5d2be50: kmem_cache_free+482 vvp_session_key
ffff9ddbd5d2be60: vvp_thread_key 0000000000000000
crash> sym ll_thread_key
ffffffffc0da4a00 (D) ll_thread_key [lustre]
crash> struct lu_context_key ll_thread_key
struct lu_context_key {
lct_tags = 1073741832,
lct_init = 0xffffffffc0d67d20 <ll_thread_key_init>,
lct_fini = 0xffffffffc0d67e30 <ll_thread_key_fini>,
lct_exit = 0x0,
lct_index = 14,
lct_used = {
counter = 1
},
lct_owner = 0xffffffffc0da8b80 <__this_module>,
lct_reference = {<No data fields>}
}
The issue seems to be more recurrent on b2_12 branch.
Attachments
Issue Links
- is related to
-
LU-14547 santyn/109 fails on a local setup
- Resolved