[LU-14110] Race during several client mount instances (--> rmmod lustre hang) Created: 03/Nov/20  Updated: 26/Apr/22  Resolved: 22/Mar/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.5
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Etienne Aujames Assignee: Etienne Aujames
Resolution: Fixed Votes: 0
Labels: obdclass
Environment:

VMs with Lustre 2.12.5/master on ldiskfs


Issue Links:
Related
is related to LU-14547 santyn/109 fails on a local setup Resolved
Epic/Theme: client
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I create this ticket to follow the issue seen by @apercher  (cf. LU-8346's comment).

Here are the commands/configs to reproduce the issue:

fstab:

<serv1@ib1>:<serv2@ib1>:/fs1 /mnt/fs1 lustre defaults,_netdev,noauto,x-systemd.requires=lnet.service,flock,user_xattr,nosuid 0 0
<serv1@ib1>:<serv2@ib1>:/fs1/home /mnt/home lustre defaults,_netdev,noauto,x-systemd.requires=lnet.service,flock,user_xattr,nosuid 0 0

commands:

while true; do
 mount /mnt/home & mount /mnt/fs1
 umount /mnt/home
 umount /mnt/fs1
 lustre_rmmod
done

After some iterations "rmmod lustre" will hang in "lu_context_key_degister"

dmesg (master branch):

 [ 1560.484463] INFO: task rmmod:6430 blocked for more than 120 seconds.
 [ 1560.484480] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 [ 1560.484496] rmmod D ffff9ddbdfd9acc0 0 6430 6396 0x00000080
 [ 1560.484499] Call Trace:
 [ 1560.484504] [<ffffffff8b0266d2>] ? kmem_cache_free+0x1e2/0x200
 [ 1560.484508] [<ffffffff8b585da9>] schedule+0x29/0x70
 [ 1560.484531] [<ffffffffc0a0284d>] lu_context_key_degister+0xcd/0x150 [obdclass]
 [ 1560.484534] [<ffffffff8aec7880>] ? wake_bit_function_rh+0x40/0x40
 [ 1560.484548] [<ffffffffc0a02a72>] lu_context_key_degister_many+0x72/0xb0 [obdclass]
 [ 1560.484550] [<ffffffff8b0266d2>] ? kmem_cache_free+0x1e2/0x200
 [ 1560.484564] [<ffffffffc0d67347>] vvp_type_fini+0x27/0x30 [lustre]
 [ 1560.484577] [<ffffffffc09fc01b>] lu_device_type_fini+0x1b/0x20 [obdclass]
 [ 1560.484586] [<ffffffffc0d68d75>] vvp_global_fini+0x15/0x30 [lustre]
 [ 1560.484596] [<ffffffffc0d7beb4>] lustre_exit+0x31/0x17d [lustre]
 [ 1560.484599] [<ffffffff8af1c46e>] SyS_delete_module+0x19e/0x310
 [ 1560.484601] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a
 [ 1560.484603] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a
 [ 1560.484604] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a
 [ 1560.484606] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a
 [ 1560.484607] [<ffffffff8b592e09>] ? system_call_after_swapgs+0x96/0x13a
 [ 1560.484609] [<ffffffff8b592ed2>] system_call_fastpath+0x25/0x2a
 [ 1560.484611] [<ffffffff8b592e15>] ? system_call_after_swapgs+0xa2/0x13a

crash backtrace (master branch):

crash> bt -F 6430
PID: 6430   TASK: ffff9ddbd5c0c1c0  CPU: 3   COMMAND: "rmmod"
 #0 [ffff9ddbd5d2bd18] __schedule at ffffffff8b5858fa
    ffff9ddbd5d2bd20: 0000000000000082 ffff9ddbd5d2bfd8
    ffff9ddbd5d2bd30: ffff9ddbd5d2bfd8 ffff9ddbd5d2bfd8
    ffff9ddbd5d2bd40: 000000000001acc0 [task_struct]
    ffff9ddbd5d2bd50: kmem_cache_free+482 [dm_rq_target_io]
    ffff9ddbd5d2bd60: 0000000000000000 00000000a8325962
    ffff9ddbd5d2bd70: 0000000000000246 ll_thread_key
    ffff9ddbd5d2bd80: bit_wait_table+2664 ffff9ddbd5d2bdd8
    ffff9ddbd5d2bd90: 0000000000000000 0000000000000000
    ffff9ddbd5d2bda0: ffff9ddbd5d2bdb0 schedule+41
 #1 [ffff9ddbd5d2bda8] schedule at ffffffff8b585da9
    ffff9ddbd5d2bdb0: ffff9ddbd5d2be20 lu_context_key_degister+205
 #2 [ffff9ddbd5d2bdb8] lu_context_key_degister at ffffffffc0a0284d [obdclass]
    ffff9ddbd5d2bdc0: ll_thread_key+36 00000000ffffffff
    ffff9ddbd5d2bdd0: 0000000000000000 0000000000000000
    ffff9ddbd5d2bde0: [task_struct]    var_wake_function
    ffff9ddbd5d2bdf0: bit_wait_table+2672 bit_wait_table+2672
    ffff9ddbd5d2be00: 00000000a8325962 fffffffffffffff5
    ffff9ddbd5d2be10: __this_module    0000000000000800
    ffff9ddbd5d2be20: ffff9ddbd5d2be80 lu_context_key_degister_many+114
 #3 [ffff9ddbd5d2be28] lu_context_key_degister_many at ffffffffc0a02a72 [obdclass]
    ffff9ddbd5d2be30: ffff9ddb00000008 ffff9ddbd5d2be90
    ffff9ddbd5d2be40: ffff9ddbd5d2be50 00000000a8325962
    ffff9ddbd5d2be50: kmem_cache_free+482 vvp_session_key
    ffff9ddbd5d2be60: vvp_thread_key   0000000000000000
crash> sym ll_thread_key
ffffffffc0da4a00 (D) ll_thread_key [lustre]
crash> struct lu_context_key ll_thread_key
struct lu_context_key {
  lct_tags = 1073741832,
  lct_init = 0xffffffffc0d67d20 <ll_thread_key_init>,
  lct_fini = 0xffffffffc0d67e30 <ll_thread_key_fini>,
  lct_exit = 0x0,
  lct_index = 14,
  lct_used = {
    counter = 1
  },
  lct_owner = 0xffffffffc0da8b80 <__this_module>,
  lct_reference = {<No data fields>}
}

 

The issue seems to be more recurrent on b2_12 branch.

 



 Comments   
Comment by Gerrit Updater [ 06/Nov/20 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/40561
Subject: LU-14110 obdclass: Protect keys_fill instances
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1697925b59996bc88690eb57270a443d0ea44214

Comment by Gerrit Updater [ 06/Nov/20 ]

Etienne AUJAMES (eaujames@ddn.com) uploaded a new patch: https://review.whamcloud.com/40565
Subject: LU-14110 obdclass: Protect keys_fill instances
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d3999d5c2ea32593623df6c7dc7313646dcc5cbe

Comment by Gerrit Updater [ 22/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40565/
Subject: LU-14110 obdclass: Protect cl_env_percpu[]
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 881551fbb7335694b89a877072bcda0aeaf8705c

Comment by Peter Jones [ 22/Mar/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:06:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.