[LU-10630] recovery-random-scale test_fail_client_mds: client cannot connect to MDS Created: 07/Feb/18  Updated: 24/Nov/21  Resolved: 24/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Cannot Reproduce Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

recovery-random-scale test_fail_client_mds - Timeout occurred after 1444 mins, last suite running was recovery-random-scale, restarting cluster to continue tests
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/b98d4034-ff51-11e7-a7cd-52540065bddc

test_fail_client_mds failed with the following error:

Timeout occurred after 1444 mins, last suite running was recovery-random-scale, restarting cluster to continue tests
[   18.684976] LNet: Accept all, port 7988
[  113.727780] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007b744000: operation mds_connect to node 10.2.8.168@tcp failed: rc = -11
[  128.332376] random: crng init done
[  263.727780] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007b744000: operation mds_connect to node 10.2.8.168@tcp failed: rc = -11
[  413.727721] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007b744000: operation mds_connect to node 10.2.8.168@tcp failed: rc = -11
[  563.727711] LustreError: 11-0: lustre-MDT0000-mdc-ffff88007b744000: operation mds_connect to node 10.2.8.168@tcp failed: rc = -11


 Comments   
Comment by James Nunez (Inactive) [ 08/Feb/18 ]

There's not much to look at in the dmesg logs, but in the MDS1 (vm12) console log, we see the following stack trace

[  618.123147] LNet: Service thread pid 13443 was inactive for 60.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[  618.126713] Pid: 13443, comm: mdt00_003
[  618.127520] 
[  618.127520] Call Trace:
[  618.128418]  [<ffffffff816ab6b9>] schedule+0x29/0x70
[  618.129587]  [<ffffffff816a9004>] schedule_timeout+0x174/0x2c0
[  618.130846]  [<ffffffffc0abdd47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  618.132274]  [<ffffffff8109a6c0>] ? process_timeout+0x0/0x10
[  618.133532]  [<ffffffffc0ab2eb1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
[  618.134850]  [<ffffffffc13838d0>] osp_precreate_reserve+0x2e0/0x810 [osp]
[  618.136193]  [<ffffffff810c6440>] ? default_wake_function+0x0/0x20
[  618.137320]  [<ffffffffc1378c53>] osp_declare_create+0x193/0x590 [osp]
[  618.138609]  [<ffffffffc0bea619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[  618.139946]  [<ffffffffc12ca1dc>] lod_sub_declare_create+0xdc/0x210 [lod]
[  618.141282]  [<ffffffffc12c353e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[  618.142569]  [<ffffffffc12c44ba>] lod_alloc_rr.constprop.18+0x70a/0x1000 [lod]
[  618.143974]  [<ffffffffc12c8a8f>] lod_qos_prep_create+0xc0f/0x1830 [lod]
[  618.145265]  [<ffffffffc12c9c0d>] lod_prepare_create+0x25d/0x360 [lod]
[  618.146636]  [<ffffffffc12bbdce>] lod_declare_striped_create+0x1ee/0x970 [lod]
[  618.148097]  [<ffffffffc12ca1dc>] ? lod_sub_declare_create+0xdc/0x210 [lod]
[  618.149544]  [<ffffffffc12c00e4>] lod_declare_create+0x204/0x590 [lod]
[  618.150866]  [<ffffffffc0bea619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[  618.152333]  [<ffffffffc133139f>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
[  618.153810]  [<ffffffffc1321b53>] mdd_declare_create+0x53/0xe20 [mdd]
[  618.155154]  [<ffffffffc1325e69>] mdd_create+0x879/0x1410 [mdd]
[  618.156268]  [<ffffffffc11db305>] mdt_reint_open+0x1a45/0x2890 [mdt]
[  618.157529]  [<ffffffffc0c1e087>] ? upcall_cache_get_entry+0x3f7/0x8f0 [obdclass]
[  618.158905]  [<ffffffffc11beb53>] ? ucred_set_jobid+0x53/0x70 [mdt]
[  618.160143]  [<ffffffffc11cf410>] mdt_reint_rec+0x80/0x210 [mdt]
[  618.161253]  [<ffffffffc11aef8b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[  618.162529]  [<ffffffffc11bb457>] mdt_intent_reint+0x157/0x420 [mdt]
[  618.163720]  [<ffffffffc11b20b2>] mdt_intent_opc+0x442/0xad0 [mdt]
[  618.165008]  [<ffffffffc0e3bb90>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[  618.166314]  [<ffffffffc11b9c73>] mdt_intent_policy+0x1a3/0x360 [mdt]
[  618.167583]  [<ffffffffc0dea2fa>] ldlm_lock_enqueue+0x38a/0x970 [ptlrpc]
[  618.168942]  [<ffffffffc0e13a33>] ldlm_handle_enqueue0+0x8f3/0x1400 [ptlrpc]
[  618.170447]  [<ffffffffc0e3bc10>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[  618.171912]  [<ffffffffc0e99752>] tgt_enqueue+0x62/0x210 [ptlrpc]
[  618.173114]  [<ffffffffc0ea1965>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
[  618.173927]  [<ffffffffc0e45c7e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[  618.174816]  [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90
[  618.175639]  [<ffffffffc0e49422>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[  618.176374]  [<ffffffff810c0d30>] ? finish_task_switch+0x50/0x160
[  618.177202]  [<ffffffffc0e48990>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[  618.177932]  [<ffffffff810b252f>] kthread+0xcf/0xe0
[  618.178689]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
[  618.179434]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[  618.180131]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
[  618.180732] 
[  618.180936] LustreError: dumping log to /tmp/lustre-log.1516576126.13443
[  618.379102] Pid: 12418, comm: mdt00_002
[  618.379607] 
[  618.379607] Call Trace:
[  618.380078]  [<ffffffff816ab6b9>] schedule+0x29/0x70
[  618.380662]  [<ffffffff816a9004>] schedule_timeout+0x174/0x2c0
[  618.381362]  [<ffffffffc0abdd47>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[  618.382286]  [<ffffffff8109a6c0>] ? process_timeout+0x0/0x10
[  618.382976]  [<ffffffffc0ab2eb1>] ? cfs_block_sigsinv+0x71/0xa0 [libcfs]
[  618.383754]  [<ffffffffc13838d0>] osp_precreate_reserve+0x2e0/0x810 [osp]
[  618.384636]  [<ffffffff810c6440>] ? default_wake_function+0x0/0x20
[  618.385399]  [<ffffffffc1378c53>] osp_declare_create+0x193/0x590 [osp]
[  618.386245]  [<ffffffffc0bea619>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[  618.387066]  [<ffffffffc12ca1dc>] lod_sub_declare_create+0xdc/0x210 [lod]
[  618.387854]  [<ffffffffc12c353e>] lod_qos_declare_object_on+0xbe/0x3a0 [lod]
[  618.388780]  [<ffffffffc12c44ba>] lod_alloc_rr.constprop.18+0x70a/0x1000 [lod]
[  618.389774]  [<ffffffffc07009d5>] ? dbuf_find+0x1d5/0x1e0 [zfs]
[  618.390494]  [<ffffffffc0604487>] ? tsd_get+0x37/0x60 [spl]
[  618.391235]  [<ffffffffc12c8a8f>] lod_qos_prep_create+0xc0f/0x1830 [lod]
[  618.392049]  [<ffffffffc12c989a>] ? lod_prepare_inuse+0x1ea/0x300 [lod]
[  618.392812]  [<ffffffffc12c9c0d>] lod_prepare_create+0x25d/0x360 [lod]
[  618.393661]  [<ffffffffc12bbdce>] lod_declare_striped_create+0x1ee/0x970 [lod]
[  618.394541]  [<ffffffffc12ca1dc>] ? lod_sub_declare_create+0xdc/0x210 [lod]
[  618.395386]  [<ffffffffc12c00e4>] lod_declare_create+0x204/0x590 [lod]
[  618.396203]  [<ffffffffc133139f>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
[  618.397112]  [<ffffffffc1321b53>] mdd_declare_create+0x53/0xe20 [mdd]
[  618.397925]  [<ffffffffc1325e69>] mdd_create+0x879/0x1410 [mdd]
[  618.398641]  [<ffffffffc11db305>] mdt_reint_open+0x1a45/0x2890 [mdt]
[  618.399495]  [<ffffffffc0c1e087>] ? upcall_cache_get_entry+0x3f7/0x8f0 [obdclass]
[  618.400369]  [<ffffffffc11beb53>] ? ucred_set_jobid+0x53/0x70 [mdt]
[  618.401176]  [<ffffffffc11cf410>] mdt_reint_rec+0x80/0x210 [mdt]
[  618.401861]  [<ffffffffc11aef8b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[  618.402628]  [<ffffffffc11bb457>] mdt_intent_reint+0x157/0x420 [mdt]
[  618.403504]  [<ffffffffc11b20b2>] mdt_intent_opc+0x442/0xad0 [mdt]
[  618.404332]  [<ffffffffc0e3bb90>] ? lustre_swab_ldlm_intent+0x0/0x20 [ptlrpc]
[  618.405167]  [<ffffffffc11b9c73>] mdt_intent_policy+0x1a3/0x360 [mdt]
[  618.405990]  [<ffffffffc0dea2fa>] ldlm_lock_enqueue+0x38a/0x970 [ptlrpc]
[  618.406786]  [<ffffffffc0e13a33>] ldlm_handle_enqueue0+0x8f3/0x1400 [ptlrpc]
[  618.407723]  [<ffffffffc0e3bc10>] ? lustre_swab_ldlm_request+0x0/0x30 [ptlrpc]
[  618.408607]  [<ffffffffc0e99752>] tgt_enqueue+0x62/0x210 [ptlrpc]
[  618.409414]  [<ffffffffc0ea1965>] tgt_request_handle+0x925/0x13b0 [ptlrpc]
[  618.410241]  [<ffffffffc0e45c7e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[  618.411197]  [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90
[  618.411936]  [<ffffffffc0e49422>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[  618.412664]  [<ffffffff810c0d30>] ? finish_task_switch+0x50/0x160
[  618.413477]  [<ffffffffc0e48990>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[  618.414215]  [<ffffffff810b252f>] kthread+0xcf/0xe0
[  618.414856]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
[  618.415446]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[  618.416169]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
[  618.416780] 
[  618.416974] LustreError: dumping log to /tmp/lustre-log.1516576127.12418
Generated at Sat Feb 10 02:36:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.