Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.6, Lustre 2.12.1
-
3
-
9223372036854775807
Description
replay-single test_70f hangs in lctl. Looking at a recent hang, https://testing.whamcloud.com/test_sets/fd5c3ddc-557d-11e9-9720-52540065bddc , the last thing seem in the client test log is
… Started lustre-OST0000 CMD: onyx-46vm1.onyx.whamcloud.com,onyx-46vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST0000-osc-ffff*.ost_server_uuid onyx-46vm1: == rpc test complete, duration -o sec ================================================================ 05:09:05 (1554181745) onyx-46vm2: == rpc test complete, duration -o sec ================================================================ 05:09:05 (1554181745) onyx-46vm1: onyx-46vm1.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST0000-osc-ffff*.ost_server_uuid onyx-46vm2: onyx-46vm2.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST0000-osc-ffff*.ost_server_uuid onyx-46vm1: CMD: onyx-46vm1.onyx.whamcloud.com lctl get_param -n at_max onyx-46vm1: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec onyx-46vm2: CMD: onyx-46vm2.onyx.whamcloud.com lctl get_param -n at_max onyx-46vm2: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
Looking at the console log on the client 1 (vm1), we see lctl hanging
… [ 6712.545693] Lustre: DEBUG MARKER: lctl get_param -n at_max [ 6712.769788] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec [ 6712.985691] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec [ 6720.276410] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname) [ 6720.621479] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi [ 6723.273530] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000 [ 6724.493700] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 2 times [ 6724.691853] Lustre: DEBUG MARKER: test_70f failing OST 2 times [ 6727.305612] Lustre: lustre-OST0000-osc-ffff96191a900000: Connection to lustre-OST0000 (at 10.2.8.199@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 6727.307316] Lustre: Skipped 8 previous similar messages [ 6840.242867] INFO: task lctl:18670 blocked for more than 120 seconds. [ 6840.243565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6840.244388] lctl D ffff961939174100 0 18670 18669 0x00000080 [ 6840.245230] Call Trace: [ 6840.245537] [<ffffffffb5b94a9d>] ? list_del+0xd/0x30 [ 6840.246074] [<ffffffffb5f67bc9>] schedule+0x29/0x70 [ 6840.246655] [<ffffffffb5f656a1>] schedule_timeout+0x221/0x2d0 [ 6840.247264] [<ffffffffb58d64f0>] ? try_to_wake_up+0x190/0x390 [ 6840.247981] [<ffffffffb5f67f7d>] wait_for_completion+0xfd/0x140 [ 6840.248574] [<ffffffffb58d67b0>] ? wake_up_state+0x20/0x20 [ 6840.249239] [<ffffffffc0c9e6fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc] [ 6840.250015] [<ffffffffc0c9ed2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc] [ 6840.250706] [<ffffffffc0ca33d9>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc] [ 6840.251431] [<ffffffffc0c9d0c6>] ldlm_cancel_lru+0x76/0x170 [ptlrpc] [ 6840.252173] [<ffffffffc0c8dbe5>] lru_size_store+0x145/0x480 [ptlrpc] [ 6840.252910] [<ffffffffc0abc09a>] lustre_attr_store+0x1a/0x20 [obdclass] [ 6840.253661] [<ffffffffb5acb6e2>] sysfs_kf_write+0x42/0x50 [ 6840.254232] [<ffffffffb5acacbb>] kernfs_fop_write+0xeb/0x160 [ 6840.254926] [<ffffffffb5a410a0>] vfs_write+0xc0/0x1f0 [ 6840.255456] [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146 [ 6840.256119] [<ffffffffb5a41ebf>] SyS_write+0x7f/0xf0 [ 6840.256708] [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146 [ 6840.257385] [<ffffffffb5f74ddb>] system_call_fastpath+0x22/0x27 [ 6840.258055] [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146 … [ 7276.489872] LustreError: 6024:0:(osc_cache.c:955:osc_extent_wait()) extent ffff96191b551dc0@{[0 -> 0/1023], [3|0|+|rpc|wiuY|ffff961924371280], [28672|1|+|-|ffff96191d8b2fc0|1024|ffff96193a76d140]} lustre-OST0000-osc-ffff96191a900000: wait ext to 0 timedout, recovery in progress? [ 7276.492281] LustreError: 6024:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff96191b551dc0 ns: lustre-OST0000-osc-ffff96191a900000 lock: ffff96191d8b2fc0/0x40b5a0904351ea00 lrc: 3/0,0 mode: PW/PW res: [0x2782:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0xafa2f3157ef8581d expref: -99 pid: 13958 timeout: 0 lvb_type: 1 [ 7302.224848] Lustre: 14779:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554182067/real 1554182067] req@ffff96191b307600 x1629670629270288/t0(0) o4->lustre-OST0000-osc-ffff96191a900000@10.2.8.199@tcp:6/4 lens 4584/448 e 4 to 1 dl 1554182175 ref 2 fl Rpc:X/2/ffffffff rc 0/-1 [ 7302.227597] Lustre: 14779:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [ 7320.304861] INFO: task lctl:18670 blocked for more than 120 seconds. [ 7320.305552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 7320.306332] lctl D ffff961939174100 0 18670 18669 0x00000080 [ 7320.307084] Call Trace: [ 7320.307424] [<ffffffffb5b94a9d>] ? list_del+0xd/0x30 [ 7320.307938] [<ffffffffb5f67bc9>] schedule+0x29/0x70 [ 7320.308481] [<ffffffffb5f656a1>] schedule_timeout+0x221/0x2d0 [ 7320.309063] [<ffffffffb58d64f0>] ? try_to_wake_up+0x190/0x390 [ 7320.309782] [<ffffffffb5f67f7d>] wait_for_completion+0xfd/0x140 [ 7320.310382] [<ffffffffb58d67b0>] ? wake_up_state+0x20/0x20 [ 7320.311029] [<ffffffffc0c9e6fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc] [ 7320.311701] [<ffffffffc0c9ed2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc] [ 7320.312385] [<ffffffffc0ca33d9>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc] [ 7320.313158] [<ffffffffc0c9d0c6>] ldlm_cancel_lru+0x76/0x170 [ptlrpc] [ 7320.313793] [<ffffffffc0c8dbe5>] lru_size_store+0x145/0x480 [ptlrpc] [ 7320.314453] [<ffffffffc0abc09a>] lustre_attr_store+0x1a/0x20 [obdclass] [ 7320.315184] [<ffffffffb5acb6e2>] sysfs_kf_write+0x42/0x50 [ 7320.315720] [<ffffffffb5acacbb>] kernfs_fop_write+0xeb/0x160 [ 7320.316292] [<ffffffffb5a410a0>] vfs_write+0xc0/0x1f0 [ 7320.316961] [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146 [ 7320.317602] [<ffffffffb5a41ebf>] SyS_write+0x7f/0xf0 [ 7320.318106] [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146 [ 7320.318813] [<ffffffffb5f74ddb>] system_call_fastpath+0x22/0x27 [ 7320.319404] [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146 [ 7409.400854] Lustre: lustre-OST0000-osc-ffff96191a900000: Connection to lustre-OST0000 (at 10.2.8.199@tcp) was lost; in progress operations using this service will wait for recovery to complete [ 7409.402543] Lustre: Skipped 4 previous similar messages [ 7409.405522] Lustre: lustre-OST0000-osc-ffff96191a900000: Connection restored to 10.2.8.199@tcp (at 10.2.8.199@tcp) [ 7409.406583] Lustre: Skipped 5 previous similar messages [ 7440.319864] INFO: task lctl:18670 blocked for more than 120 seconds. [ 7440.320764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. …
The console log for the OSS (vm3) has many errors. Here are just a few
… [ 6629.884030] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec [ 6629.981017] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec [ 6630.095178] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec [ 6630.232869] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec [ 6636.601540] Lustre: DEBUG MARKER: sync; sync; sync [ 6638.070508] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-OST0000 notransno [ 6638.421691] Lustre: DEBUG MARKER: modprobe dm-flakey; [ 6638.421691] dmsetup targets | grep -q flakey [ 6638.775038] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey [ 6639.121066] Lustre: DEBUG MARKER: dmsetup suspend --nolockfs --noflush /dev/mapper/ost1_flakey [ 6639.467783] Lustre: DEBUG MARKER: dmsetup load /dev/mapper/ost1_flakey --table "0 4194304 flakey 252:0 0 0 1800 1 drop_writes" [ 6639.818180] Lustre: DEBUG MARKER: dmsetup resume /dev/mapper/ost1_flakey [ 6640.180823] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000 [ 6640.357370] Lustre: DEBUG MARKER: ost1 REPLAY BARRIER on lustre-OST0000 [ 6641.603284] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 2 times [ 6641.800703] Lustre: DEBUG MARKER: test_70f failing OST 2 times [ 6641.985657] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true [ 6642.334386] Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost1 [ 6642.516911] Lustre: Failing over lustre-OST0000 [ 6643.823863] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping) [ 6643.824741] Lustre: Skipped 1 previous similar message [ 6646.530498] Lustre: lustre-OST0000: Not available for connect from 10.2.8.200@tcp (stopping) [ 6646.531419] Lustre: Skipped 1 previous similar message [ 6653.837143] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping) [ 6653.838059] Lustre: Skipped 2 previous similar messages [ 6656.520352] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck? [ 6656.521512] Lustre: lustre-OST0000: UNLINKED ffff9d8d17319c00 f1ce44c5-b64b-e307-df96-edafd36f5e17 10.2.8.197@tcp 2 (1 0 0) 1 0 0 0: (null) 34359849384 stale:0 [ 6663.853170] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping) [ 6663.854170] Lustre: Skipped 5 previous similar messages [ 6666.624361] LNet: Service thread pid 23196 was inactive for 40.13s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 6666.626018] Pid: 23196, comm: ll_ost_io00_030 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [ 6666.626965] Call Trace: [ 6666.627234] [<ffffffffc0c3054e>] target_bulk_io+0x4ae/0xac0 [ptlrpc] [ 6666.627927] [<ffffffffc0cdc5df>] tgt_brw_write+0x11af/0x17d0 [ptlrpc] [ 6666.628618] [<ffffffffc0cd842a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [ 6666.629320] [<ffffffffc0c80e5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [ 6666.630107] [<ffffffffc0c845a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 6666.630764] [<ffffffffbd8c1c31>] kthread+0xd1/0xe0 [ 6666.631281] [<ffffffffbdf74c37>] ret_from_fork_nospec_end+0x0/0x39 [ 6666.632221] [<ffffffffffffffff>] 0xffffffffffffffff [ 6666.632797] LustreError: dumping log to /tmp/lustre-log.1554181623.23196 [ 6672.522355] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck? [ 6672.523567] Lustre: lustre-OST0000: UNLINKED ffff9d8d17319c00 f1ce44c5-b64b-e307-df96-edafd36f5e17 10.2.8.197@tcp 2 (1 0 0) 1 0 0 0: (null) 34359849384 stale:0 [ 6681.324388] Lustre: lustre-OST0000: Not available for connect from 10.2.8.200@tcp (stopping) [ 6681.325429] Lustre: Skipped 10 previous similar messages [ 6704.524362] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck? [ 6704.525590] Lustre: lustre-OST0000: UNLINKED ffff9d8d17319c00 f1ce44c5-b64b-e307-df96-edafd36f5e17 10.2.8.197@tcp 2 (1 0 0) 1 0 0 0: (null) 34359849384 stale:0 [ 6713.933190] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping) [ 6713.934636] Lustre: Skipped 18 previous similar messages [ 6726.487412] LNet: Service thread pid 23196 completed after 100.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). [ 6743.633474] LustreError: 25479:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff9d8d14d60f00 x1629670662416544/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.2.8.200@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [ 6743.635773] LustreError: 25479:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5 [ 6743.637213] LustreError: 25479:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 7 previous similar messages [ 6768.526352] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 2. Is it stuck? [ 6768.556672] Lustre: server umount lustre-OST0000 complete … [ 6826.496386] LNet: Service thread pid 23196 was inactive for 40.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 6826.498063] Pid: 23196, comm: ll_ost_io00_030 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018 [ 6826.499022] Call Trace: [ 6826.499284] [<ffffffffc0c3054e>] target_bulk_io+0x4ae/0xac0 [ptlrpc] [ 6826.499983] [<ffffffffc0cdc5df>] tgt_brw_write+0x11af/0x17d0 [ptlrpc] [ 6826.500678] [<ffffffffc0cd842a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [ 6826.501429] [<ffffffffc0c80e5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [ 6826.502405] [<ffffffffc0c845a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 6826.503075] [<ffffffffbd8c1c31>] kthread+0xd1/0xe0 [ 6826.503580] [<ffffffffbdf74c37>] ret_from_fork_nospec_end+0x0/0x39 [ 6826.504242] [<ffffffffffffffff>] 0xffffffffffffffff [ 6826.504841] LustreError: dumping log to /tmp/lustre-log.1554181783.23196 [ 6886.380373] LustreError: 23196:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s req@ffff9d8d1a5dac50 x1629670629270288/t0(0) o4->f1ce44c5-b64b-e307-df96-edafd36f5e17@10.2.8.197@tcp:5/0 lens 4584/448 e 0 to 0 dl 1554181850 ref 1 fl Interpret:/2/0 rc 0/0 [ 6886.382825] LustreError: 23196:0:(ldlm_lib.c:3237:target_bulk_io()) Skipped 1 previous similar message [ 6886.383776] Lustre: lustre-OST0000: Bulk IO write error with f1ce44c5-b64b-e307-df96-edafd36f5e17 (at 10.2.8.197@tcp), client will retry: rc = -110 …
So far we are only seeing replay-single hang in test 70f in interop testing (server/client):
2.10.6/2.12.0.52 - https://testing.whamcloud.com/test_sets/fd5c3ddc-557d-11e9-9720-52540065bddc
2.10.6/2.12.0.28 - https://testing.whamcloud.com/test_sets/890b500a-4e0a-11e9-b98a-52540065bddc
2.10.6/2.12.50.85 - https://testing.whamcloud.com/test_sets/6167f5d8-25a3-11e9-b901-52540065bddc
2.10.6/2.12.50.83 - https://testing.whamcloud.com/test_sets/8633b650-2552-11e9-b54c-52540065bddc
2.10.6/2.12.0 RC4 - https://testing.whamcloud.com/test_sets/2f9059a2-0635-11e9-a97c-52540065bddc
2.10.5/2.11.55.65 - https://testing.whamcloud.com/test_sets/93246f30-cbca-11e8-82f2-52540065bddc
2.10.4 RC2/2.11.53.32 - https://testing.whamcloud.com/test_sets/7060aef2-9815-11e8-b0aa-52540065bddc
Attachments
Issue Links
- mentioned in
-
Page Loading...