Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.16.0
-
3
-
9223372036854775807
Description
crash> bt PID: 95475 TASK: ffff9d51fe3fc740 CPU: 9 COMMAND: "umount" #0 [ffffb6218eb478b0] __schedule at ffffffff8b74e1d4 #1 [ffffb6218eb47910] schedule at ffffffff8b74e648 #2 [ffffb6218eb47920] schedule_timeout at ffffffff8b751cd3 #3 [ffffb6218eb479b8] ptlrpc_set_wait at ffffffffc1679185 [ptlrpc] #4 [ffffb6218eb47a30] ptlrpc_queue_wait at ffffffffc1679371 [ptlrpc] #5 [ffffb6218eb47a48] ptlrpc_disconnect_import at ffffffffc16a5165 [ptlrpc] #6 [ffffb6218eb47ac8] osp_disconnect at ffffffffc1d348d2 [osp] #7 [ffffb6218eb47ae8] osp_process_config at ffffffffc1d35a7f [osp] #8 [ffffb6218eb47b18] lod_sub_process_config at ffffffffc1abf901 [lod] #9 [ffffb6218eb47b58] lod_process_config at ffffffffc1ac7b2e [lod] #10 [ffffb6218eb47ba8] mdd_process_config at ffffffffc1b5ec8f [mdd] #11 [ffffb6218eb47bd8] mdt_stack_pre_fini at ffffffffc1bdfc19 [mdt] #12 [ffffb6218eb47c10] mdt_device_fini at ffffffffc1be5e97 [mdt] #13 [ffffb6218eb47c60] class_cleanup at ffffffffc12f3ed1 [obdclass] #14 [ffffb6218eb47ce0] class_process_config at ffffffffc12f4e35 [obdclass] #15 [ffffb6218eb47d50] class_manual_cleanup at ffffffffc12f6f15 [obdclass] #16 [ffffb6218eb47df0] server_put_super at ffffffffc1331143 [obdclass] #17 [ffffb6218eb47e98] generic_shutdown_super at ffffffff8b11bdcc
Log includes timeouts for disconnects requests ±71 seconds each.
00000100:00000400:0.0:1712841383.691084:0:95475:0:(client.c:2310:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1712841312/real 1712841312] req@000000000cc2b990 x1796043989671424/t0(0) o39->work2-MDT0001-osp-MDT0003@23421@kfi:24/4 lens 224/224 e 0 to 1 dl 1712841383 ref 2 fl Rpc:XNQr/0/ffffffff rc 0/-1 job:''
To reach 5 mins limit, it is enough to get 4-5 disconnects timeouts. When cluster has 200 osp devices it is easy.
With a disabling HA umount takes about 7min.
[root@work2n005 ~]# dmesg -T | grep work2-MDT0003 | egrep 'Failing|complete' [Thu Apr 11 09:18:18 2024] Lustre: Failing over work2-MDT0003 [Thu Apr 11 09:25:25 2024] Lustre: server umount work2-MDT0003 complete
Attachments
Issue Links
- is blocked by
-
LU-18045 MDT unmount can stuck on waiting for pending OSP locks
- Resolved