[LU-16287] replay-single: test_102d timeout Created: 02/Nov/22  Updated: 03/Nov/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/aec823a8-3961-47ad-b2b5-d8a97f1a242f

[Wed Nov  2 05:35:20 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 102d: check replay & reconstruction with multiple mod RPCs in flight ========================================================== 05:35:36 \(1667367336\)
[Wed Nov  2 05:35:21 2022] Lustre: DEBUG MARKER: == replay-single test 102d: check replay
[Wed Nov  2 05:35:21 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x15a
[Wed Nov  2 05:35:21 2022] Lustre: *** cfs_fail_loc=15a, val=0***
[Wed Nov  2 05:35:21 2022] Lustre: Skipped 5 previous similar messages
[Wed Nov  2 05:35:23 2022] Lustre: DEBUG MARKER: sync; sync; sync
[Wed Nov  2 05:35:24 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
[Wed Nov  2 05:35:24 2022] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds4' ' /proc/mounts || true
[Wed Nov  2 05:35:25 2022] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds4
[Wed Nov  2 05:35:25 2022] Lustre: lustre-MDT0003: Not available for connect from 10.240.29.171@tcp (stopping)
[Wed Nov  2 05:35:25 2022] Lustre: Skipped 7 previous similar messages
[Wed Nov  2 05:35:29 2022] Lustre: lustre-MDT0003: Not available for connect from 10.240.29.168@tcp (stopping)
[Wed Nov  2 05:35:29 2022] Lustre: Skipped 10 previous similar messages
[Wed Nov  2 05:35:34 2022] Lustre: lustre-MDT0003: Not available for connect from 10.240.29.168@tcp (stopping)
[Wed Nov  2 05:35:34 2022] Lustre: Skipped 12 previous similar messages
[Wed Nov  2 05:35:42 2022] Lustre: lustre-MDT0003: Not available for connect from 0@lo (stopping)
[Wed Nov  2 05:35:42 2022] Lustre: Skipped 24 previous similar messages
[Wed Nov  2 05:35:45 2022] LustreError: 190662:0:(import.c:355:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: timeout waiting for callback (1 != 0)
[Wed Nov  2 05:35:45 2022] LustreError: 190662:0:(import.c:383:ptlrpc_invalidate_import()) @@@ still on delayed list  req@00000000e4286085 x1748352857944832/t0(0) o41->lustre-MDT0001-osp-MDT0003@0@lo:24/4 lens 224/224 e 0 to 0 dl 1670132019 ref 1 fl Rpc:RESQU/0/0 rc -5/-107 job:'osp-pre-1-3.0'
[Wed Nov  2 05:35:45 2022] LustreError: 190662:0:(import.c:389:ptlrpc_invalidate_import()) lustre-MDT0001_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.

The log shows umount /mnt/lustre-mdt4 is stuck in ptlrpc_invalidate_import(), and the remaining request is statfs between MDTs.



 Comments   
Comment by Andreas Dilger [ 03/Nov/22 ]

I think this failure is caused by the patch https://review.whamcloud.com/48584 "LU-16159 lod: cancel update llogs upon recovery abort" being tested.

In addition to 2x failures in replay-single for autotest, there were 20x failures in the same replay-single subtest from Gerrit Janitor:
https://testing.whamcloud.com/test_sessions/ecfb2f85-8d9a-4804-9ce0-ea9e55d4e253
https://testing.whamcloud.com/test_sessions/9afe0fe3-0729-46a8-89c7-b1bcce54a7c4
https://testing.whamcloud.com/gerrit-janitor/26180/results.html

Comment by Lai Siyao [ 03/Nov/22 ]

Indeed, but the Janitor failure is an exception: replay-single 100c should be skipped in 2.15.52 (which is seen in autotest), but not by Janitor.

Generated at Sat Feb 10 03:25:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.