[LU-17377] racer test_1: Timeout occurred Created: 19/Dec/23  Updated: 25/Jan/24

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.4
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/36a3b3ec-c78e-4534-b151-7a623637c44b

test_1 failed with the following error:

Timeout occurred after 784 minutes, last suite running was racer

Test session details:
clients: https://build.whamcloud.com/job/lustre-b2_15/77 - 4.18.0-477.15.1.el8_8.x86_64
servers: https://build.whamcloud.com/job/lustre-b2_15/77 - 4.18.0-477.15.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>
MDS dmesg

[Tue Nov 28 11:19:35 2023] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == racer test 1: racer on clients: trevis-51vm4.trevis.whamcloud.com,trevis-51vm5 DURATION=900 ========================================================== 11:19:36 \(1701170376\)
[Tue Nov 28 11:19:35 2023] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-51vm4.trevis.whamcloud.com,trevis-51vm5 DURATION=900 ========================================================== 11:19:36 (1701170376)
[Tue Nov 28 11:19:36 2023] Lustre: 2189800:0:(mdt_recovery.c:200:mdt_req_from_lrd()) @@@ restoring transno  req@00000000780159d6 x1783787460031360/t399431959257(0) o101->8cf5fb4b-4b34-4192-888d-2669cccef68f@10.240.40.111@tcp:608/0 lens 376/816 e 0 to 0 dl 1701170383 ref 1 fl Interpret:H/2/0 rc 0/0 job:'dd.0'
[Tue Nov 28 11:20:31 2023] LustreError: 2200161:0:(mdt_handler.c:754:mdt_pack_acl2body()) lustre-MDT0000: unable to read [0x20002d2c1:0x431:0x0] ACL: rc = -2
[Tue Nov 28 11:20:37 2023] LustreError: 2200160:0:(mdt_handler.c:754:mdt_pack_acl2body()) lustre-MDT0000: unable to read [0x20002d2c1:0x540:0x0] ACL: rc = -2
[Tue Nov 28 11:30:07 2023] LustreError: 2188308:0:(osd_handler.c:224:osd_trans_start()) lustre-MDT0000: can't assign tx: rc = -2
[Tue Nov 28 11:31:32 2023] Lustre: mdt00_007: service thread pid 2200131 was inactive for 42.604 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Tue Nov 28 11:31:32 2023] Pid: 2200143, comm: mdt00_019 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Thu Nov 23 04:41:42 UTC 2023
[Tue Nov 28 11:31:32 2023] Lustre: Skipped 1 previous similar message
[Tue Nov 28 11:31:32 2023] Call Trace TBD:
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_completion_ast+0x7b0/0x900 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_cli_enqueue_local+0x30b/0x870 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_object_local_lock+0x50e/0xb30 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_object_lock_internal+0x191/0x4a0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_getattr_name_lock+0x8f3/0x2380 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_getattr+0x27e/0x470 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_opc+0x12c/0xbf0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_policy+0x20b/0x3a0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_lock_enqueue+0x4e8/0xaa0 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_handle_enqueue0+0x634/0x1750 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] tgt_enqueue+0xa4/0x220 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] tgt_request_handle+0xccd/0x1b10 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] kthread+0x134/0x150
[Tue Nov 28 11:31:32 2023] [<0>] ret_from_fork+0x35/0x40
[Tue Nov 28 11:31:32 2023] Pid: 2200131, comm: mdt00_007 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Thu Nov 23 04:41:42 UTC 2023
[Tue Nov 28 11:31:32 2023] Call Trace TBD:
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_completion_ast+0x7b0/0x900 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_cli_enqueue_local+0x30b/0x870 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_object_local_lock+0x50e/0xb30 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_object_lock_internal+0x191/0x4a0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_getattr_name_lock+0x8f3/0x2380 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_getattr+0x27e/0x470 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_opc+0x12c/0xbf0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_policy+0x20b/0x3a0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_lock_enqueue+0x4e8/0xaa0 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_handle_enqueue0+0x634/0x1750 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] tgt_enqueue+0xa4/0x220 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] tgt_request_handle+0xccd/0x1b10 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] kthread+0x134/0x150
[Tue Nov 28 11:31:32 2023] [<0>] ret_from_fork+0x35/0x40
[Tue Nov 28 11:31:32 2023] Pid: 2200133, comm: mdt00_009 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Thu Nov 23 04:41:42 UTC 2023
[Tue Nov 28 11:31:32 2023] Call Trace TBD:
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_completion_ast+0x7b0/0x900 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_cli_enqueue_local+0x30b/0x870 [ptlrpc]
[Tue Nov 28 11:31:32 2023] Lustre: mdt00_002: service thread pid 2188295 was inactive for 42.652 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[Tue Nov 28 11:31:32 2023] [<0>] mdt_object_local_lock+0x50e/0xb30 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_object_lock_internal+0x191/0x4a0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_getattr_name_lock+0x8f3/0x2380 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_getattr+0x27e/0x470 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_opc+0x12c/0xbf0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] mdt_intent_policy+0x20b/0x3a0 [mdt]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_lock_enqueue+0x4e8/0xaa0 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ldlm_handle_enqueue0+0x634/0x1750 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] tgt_enqueue+0xa4/0x220 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] tgt_request_handle+0xccd/0x1b10 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[Tue Nov 28 11:31:32 2023] [<0>] kthread+0x134/0x150
[Tue Nov 28 11:31:32 2023] [<0>] ret_from_fork+0x35/0x40
[Tue Nov 28 11:36:06 2023] LustreError: 2200143:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1701171050, 317s ago); not entering recovery in server code, just going back to sleep ns: mdt-lustre-MDT0000_UUID lock: 000000008db36cb4/0x6256bf2f7e02c3f1 lrc: 3/1,0 mode: --/PR res: [0x20002d2c2:0x9:0x0].0x0 bits 0x13/0x0 rrc: 32 type: IBT gid 0 flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 2200143 timeout: 0 lvb_type: 0
[Tue Nov 28 11:36:06 2023] LustreError: 2200143:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 1 previous similar message
[Tue Nov 28 11:36:06 2023] LustreError: dumping log to /tmp/lustre-log.1701171367.2200131
[Tue Nov 28 11:40:44 2023] Lustre: 2200163:0:(service.c:1437:ptlrpc_at_send_early_reply()) @@@ Could not add any time (4/4), not sending early reply  req@00000000e8b46235 x1783763196782336/t0(0) o101->e5dd43cf-d40b-4b07-91f8-6b6b6a2c6a1c@10.240.40.112@tcp:365/0 lens 576/3928 e 24 to 0 dl 1701171650 ref 2 fl Interpret:/0/0 rc 0/0 job:'ls.0'
[Tue Nov 28 11:40:50 2023] Lustre: lustre-MDT0000: Client 94c002db-c56e-4796-b0f1-e4af0f5485a1 (at 10.240.40.112@tcp) reconnecting
[Tue Nov 28 11:40:50 2023] Lustre: Skipped 1 previous similar message
[Tue Nov 28 11:50:51 2023] Lustre: lustre-MDT0000: Client 94c002db-c56e-4796-b0f1-e4af0f5485a1 (at 10.240.40.112@tcp) reconnecting
[Tue Nov 28 11:50:51 2023] Lustre: Skipped 3 previous similar messages
[Tue Nov 28 11:56:35 2023] Lustre: mdt00_031: service thread pid 2200158 was inactive for 42.147 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[Tue Nov 28 11:56:35 2023] ptlrpc_watchdog_fire: 19 callbacks suppressed
[Tue Nov 28 11:56:35 2023] Lustre: mdt00_036: service thread pid 2200163 was inactive for 42.094 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Tue Nov 28 11:56:35 2023] Lustre: Skipped 18 previous similar messages
[Tue Nov 28 11:56:35 2023] Lustre: Skipped 1 previous similar message
[Tue Nov 28 11:56:35 2023] Pid: 2200163, comm: mdt00_036 4.18.0-477.15.1.el8_lustre.x86_64 #1 SMP Thu Nov 23 04:41:42 UTC 2023
[Tue Nov 28 11:56:35 2023] Call Trace TBD:

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
racer test_1 - Timeout occurred after 784 minutes, last suite running was racer



 Comments   
Comment by Sarah Liu [ 19/Dec/23 ]

oss dmesg

[Tue Nov 28 11:19:35 2023] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == racer test 1: racer on clients: trevis-51vm4.trevis.whamcloud.com,trevis-51vm5 DURATION=900 ========================================================== 11:19:36 \(1701170376\)
[Tue Nov 28 11:19:35 2023] Lustre: DEBUG MARKER: == racer test 1: racer on clients: trevis-51vm4.trevis.whamcloud.com,trevis-51vm5 DURATION=900 ========================================================== 11:19:36 (1701170376)
[Tue Nov 28 11:31:28 2023] LustreError: 8:0:(ofd_obd.c:501:ofd_destroy_export()) lustre-OST0001: cli 8cf5fb4b-4b34-4192-888d-2669cccef68f/00000000d36a2675 has 1703936 pending on destroyed export
[Tue Nov 28 11:31:28 2023] LustreError: 8:0:(tgt_grant.c:257:tgt_grant_sanity_check()) ofd_destroy_export: tot_granted 1096876032 != fo_tot_granted 1098579968
[Tue Nov 28 11:31:28 2023] LustreError: 8:0:(tgt_grant.c:260:tgt_grant_sanity_check()) ofd_destroy_export: tot_pending 10223616 != fo_tot_pending 11927552
[Tue Nov 28 11:31:31 2023] LustreError: 1392352:0:(tgt_grant.c:257:tgt_grant_sanity_check()) ofd_statfs: tot_granted 1096876032 != fo_tot_granted 1098579968
[Tue Nov 28 11:31:31 2023] LustreError: 1392352:0:(tgt_grant.c:260:tgt_grant_sanity_check()) ofd_statfs: tot_pending 13631488 != fo_tot_pending 15335424
[Tue Nov 28 11:31:36 2023] LustreError: 1392351:0:(tgt_grant.c:257:tgt_grant_sanity_check()) ofd_statfs: tot_granted 1129250816 != fo_tot_granted 1130954752
[Tue Nov 28 11:31:36 2023] LustreError: 1392351:0:(tgt_grant.c:260:tgt_grant_sanity_check()) ofd_statfs: tot_pending 42598400 != fo_tot_pending 44302336
[Tue Nov 28 11:31:41 2023] LustreError: 1392352:0:(tgt_grant.c:257:tgt_grant_sanity_check()) ofd_statfs: tot_granted 1105395712 != fo_tot_granted 1107099648
Generated at Sat Feb 10 03:34:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.