[LU-16176] replay-dual test_26: timeout "Network is sluggish?" Created: 20/Sep/22  Updated: 20/Sep/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for eaujames <eaujames@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/aab59261-7e97-4100-80ca-8ce857853721

test_26 failed with the following error:

Timeout occurred after 125 minutes, last suite running was replay-dual

Test log:

11:57:35 (1663329455) targets are mounted
11:57:35 (1663329455) facet_failover done
CMD: onyx-99vm1.onyx.whamcloud.com,onyx-99vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/bin:/usr/share/Modules/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid 
onyx-99vm1: onyx-99vm1.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-99vm2: onyx-99vm2.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
onyx-99vm1: CMD: onyx-99vm1.onyx.whamcloud.com lctl get_param -n at_max
onyx-99vm1: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
onyx-99vm2: CMD: onyx-99vm2.onyx.whamcloud.com lctl get_param -n at_max
onyx-99vm2: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
/usr/lib64/lustre/tests/replay-dual.sh: line 936: 119703 Killed                  ( local tar_dir=$DIR/$tdir/run_tar; while true; do
    test_mkdir -p -c$MDSCOUNT $tar_dir || break; if [ $MDSCOUNT -ge 2 ]; then
        $LFS setdirstripe -D -c$MDSCOUNT $tar_dir || error "set default dirstripe failed";
    fi; cd $tar_dir || break; tar cf - /etc | tar xf - || error "tar failed"; cd $DIR/$tdir || break; rm -rf $tar_dir || break;
done )
/usr/lib64/lustre/tests/replay-dual.sh: line 936: 119704 Killed                  ( local dbench_dir=$DIR2/$tdir/run_dbench; while true; do
    test_mkdir -p -c$MDSCOUNT $dbench_dir || break; if [ $MDSCOUNT -ge 2 ]; then
        $LFS setdirstripe -D -c$MDSCOUNT $dbench_dir || error "set default dirstripe failed";
    fi; cd $dbench_dir || break; rundbench 1 -D $dbench_dir -t 100 &> /dev/null || break; cd $DIR/$tdir || break; rm -rf $dbench_dir || break;
done )

Client hang + evictions:

[Fri Sep 16 11:57:02 2022] Lustre: DEBUG MARKER: test_26 fail mds1 5 times
[Fri Sep 16 11:57:28 2022] LustreError: 128420:0:(ldlm_resource.c:1125:ldlm_resource_complain()) MGC10.240.28.47@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (0000000042d378ae) refcount nonzero (2) after lock cleanup; forcing cleanup.
[Fri Sep 16 11:57:35 2022] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
[Fri Sep 16 11:57:37 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-99vm1.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-\*.mds_server_uuid
[Fri Sep 16 11:57:37 2022] Lustre: DEBUG MARKER: onyx-99vm1.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
[Fri Sep 16 11:57:37 2022] Lustre: DEBUG MARKER: lctl get_param -n at_max
[Fri Sep 16 11:57:38 2022] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-\*.mds_server_uuid in FULL state after 0 sec
[Fri Sep 16 11:57:38 2022] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
[Fri Sep 16 11:57:58 2022] LustreError: 167-0: lustre-OST0000-osc-ffff970f6a66e000: This client was evicted by lustre-OST0000; in progress operations using this service will fail.
[Fri Sep 16 11:58:19 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 11:58:19 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'
[Fri Sep 16 11:58:19 2022] LustreError: 129930:0:(import.c:389:ptlrpc_invalidate_import()) lustre-OST0003_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[Fri Sep 16 11:58:27 2022] INFO: task kworker/u4:2:75980 blocked for more than 120 seconds.
[Fri Sep 16 11:58:27 2022]       Tainted: G           OE    --------- -  - 4.18.0-348.7.1.el8_5.x86_64 #1
[Fri Sep 16 11:58:27 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep 16 11:58:27 2022] task:kworker/u4:2    state:D stack:    0 pid:75980 ppid:     2 flags:0x80004080
[Fri Sep 16 11:58:27 2022] Workqueue: writeback wb_workfn (flush-lustre-ffff970f6a)
[Fri Sep 16 11:58:27 2022] Call Trace:
[Fri Sep 16 11:58:27 2022]  __schedule+0x2bd/0x760
[Fri Sep 16 11:58:27 2022]  schedule+0x37/0xa0
[Fri Sep 16 11:58:27 2022]  rwsem_down_read_slowpath+0x360/0x3d0
[Fri Sep 16 11:58:27 2022]  lov_io_init+0xbe/0x330 [lov]
[Fri Sep 16 11:58:27 2022]  cl_io_init0.isra.14+0x86/0x150 [obdclass]
[Fri Sep 16 11:58:27 2022]  cl_sync_file_range+0x247/0x340 [lustre]
[Fri Sep 16 11:58:27 2022]  ll_writepages+0x70/0x1e0 [lustre]
[Fri Sep 16 11:58:27 2022]  do_writepages+0x41/0xd0
[Fri Sep 16 11:58:27 2022]  __writeback_single_inode+0x39/0x2f0
[Fri Sep 16 11:58:27 2022]  writeback_sb_inodes+0x1e6/0x450
[Fri Sep 16 11:58:27 2022]  __writeback_inodes_wb+0x5f/0xc0
[Fri Sep 16 11:58:27 2022]  wb_writeback+0x25b/0x2f0
[Fri Sep 16 11:58:27 2022]  ? get_nr_inodes+0x35/0x50
[Fri Sep 16 11:58:27 2022]  wb_workfn+0x37a/0x4c0
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x35/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x41/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x35/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x41/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x35/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x41/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x35/0x70
[Fri Sep 16 11:58:27 2022]  ? __switch_to_asm+0x41/0x70
[Fri Sep 16 11:58:27 2022]  process_one_work+0x1a7/0x360
[Fri Sep 16 11:58:27 2022]  worker_thread+0x30/0x390
[Fri Sep 16 11:58:27 2022]  ? create_worker+0x1a0/0x1a0
[Fri Sep 16 11:58:27 2022]  kthread+0x116/0x130
[Fri Sep 16 11:58:27 2022]  ? kthread_flush_work_fn+0x10/0x10
[Fri Sep 16 11:58:27 2022]  ret_from_fork+0x35/0x40
[Fri Sep 16 11:58:39 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 11:58:39 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'
[Fri Sep 16 11:58:39 2022] LustreError: 129930:0:(import.c:389:ptlrpc_invalidate_import()) lustre-OST0003_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[Fri Sep 16 11:59:00 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 11:59:00 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'
[Fri Sep 16 11:59:00 2022] LustreError: 129930:0:(import.c:389:ptlrpc_invalidate_import()) lustre-OST0003_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[Fri Sep 16 11:59:15 2022] LustreError: 129934:0:(ldlm_resource.c:1125:ldlm_resource_complain()) MGC10.240.28.47@tcp: namespace resource [0x65727473756c:0x2:0x0].0x0 (0000000042d378ae) refcount nonzero (1) after lock cleanup; forcing cleanup.
[Fri Sep 16 11:59:20 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 11:59:20 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'
[Fri Sep 16 11:59:20 2022] LustreError: 129930:0:(import.c:389:ptlrpc_invalidate_import()) lustre-OST0003_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[Fri Sep 16 11:59:30 2022] LustreError: 167-0: lustre-OST0004-osc-ffff970f6a66e000: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
[Fri Sep 16 11:59:30 2022] LustreError: Skipped 3 previous similar messages
[Fri Sep 16 11:59:41 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 11:59:41 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'
[Fri Sep 16 11:59:41 2022] LustreError: 129930:0:(import.c:389:ptlrpc_invalidate_import()) lustre-OST0003_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[Fri Sep 16 12:00:01 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 12:00:01 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'
[Fri Sep 16 12:00:01 2022] LustreError: 129930:0:(import.c:389:ptlrpc_invalidate_import()) lustre-OST0003_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting for them to error out.
[Fri Sep 16 12:00:22 2022] LustreError: 129930:0:(import.c:355:ptlrpc_invalidate_import()) lustre-OST0003_UUID: timeout waiting for callback (1 != 0)
[Fri Sep 16 12:00:22 2022] LustreError: 129930:0:(import.c:378:ptlrpc_invalidate_import()) @@@ still on sending list  req@00000000bafea35b x1744123210643072/t0(0) o4->lustre-OST0003-osc-ffff970f6a66e000@10.240.28.46@tcp:6/4 lens 4584/448 e 0 to 0 dl 1663329411 ref 2 fl Rpc:REQ/0/ffffffff rc -5/-1 job:'dbench.0'

Some test sessions seem to have similar symptoms:
https://testing.whamcloud.com/sub_tests/c66f8021-f1c7-4a32-aeb4-c4089e285d40
https://testing.whamcloud.com/sub_tests/0cc71756-9e3d-435a-b9cc-abdfb7092970
https://testing.whamcloud.com/sub_tests/3ee2a550-fc29-4096-bb82-b15dc3dbb151
https://testing.whamcloud.com/sub_tests/aa2fae70-180d-491a-82d0-aa5b65482493
https://testing.whamcloud.com/sub_tests/89610b7b-87e2-48de-9f3d-ba73a946f08c

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_26 - Timeout occurred after 125 minutes, last suite running was replay-dual


Generated at Sat Feb 10 03:24:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.