Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12156

interop: replay-single test 70f hangs in lctl

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.10.6, Lustre 2.12.1
    • 3
    • 9223372036854775807

    Description

      replay-single test_70f hangs in lctl. Looking at a recent hang, https://testing.whamcloud.com/test_sets/fd5c3ddc-557d-11e9-9720-52540065bddc , the last thing seem in the client test log is

      …
      Started lustre-OST0000
      CMD: onyx-46vm1.onyx.whamcloud.com,onyx-46vm2 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/qt-3.3/bin:/usr/lib64/compat-openmpi16/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/usr/sbin:/sbin:/bin::/sbin:/bin:/usr/sbin: NAME=autotest_config bash rpc.sh wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST0000-osc-ffff*.ost_server_uuid 
      onyx-46vm1: == rpc test complete, duration -o sec ================================================================ 05:09:05 (1554181745)
      onyx-46vm2: == rpc test complete, duration -o sec ================================================================ 05:09:05 (1554181745)
      onyx-46vm1: onyx-46vm1.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      onyx-46vm2: onyx-46vm2.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST0000-osc-ffff*.ost_server_uuid
      onyx-46vm1: CMD: onyx-46vm1.onyx.whamcloud.com lctl get_param -n at_max
      onyx-46vm1: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      onyx-46vm2: CMD: onyx-46vm2.onyx.whamcloud.com lctl get_param -n at_max
      onyx-46vm2: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      

      Looking at the console log on the client 1 (vm1), we see lctl hanging

      …
      [ 6712.545693] Lustre: DEBUG MARKER: lctl get_param -n at_max
      [ 6712.769788] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      [ 6712.985691] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      [ 6720.276410] Lustre: DEBUG MARKER: mcreate /mnt/lustre/fsa-$(hostname); rm /mnt/lustre/fsa-$(hostname)
      [ 6720.621479] Lustre: DEBUG MARKER: if [ -d /mnt/lustre2 ]; then mcreate /mnt/lustre2/fsa-$(hostname); rm /mnt/lustre2/fsa-$(hostname); fi
      [ 6723.273530] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000
      [ 6724.493700] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 2 times
      [ 6724.691853] Lustre: DEBUG MARKER: test_70f failing OST 2 times
      [ 6727.305612] Lustre: lustre-OST0000-osc-ffff96191a900000: Connection to lustre-OST0000 (at 10.2.8.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 6727.307316] Lustre: Skipped 8 previous similar messages
      [ 6840.242867] INFO: task lctl:18670 blocked for more than 120 seconds.
      [ 6840.243565] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 6840.244388] lctl            D ffff961939174100     0 18670  18669 0x00000080
      [ 6840.245230] Call Trace:
      [ 6840.245537]  [<ffffffffb5b94a9d>] ? list_del+0xd/0x30
      [ 6840.246074]  [<ffffffffb5f67bc9>] schedule+0x29/0x70
      [ 6840.246655]  [<ffffffffb5f656a1>] schedule_timeout+0x221/0x2d0
      [ 6840.247264]  [<ffffffffb58d64f0>] ? try_to_wake_up+0x190/0x390
      [ 6840.247981]  [<ffffffffb5f67f7d>] wait_for_completion+0xfd/0x140
      [ 6840.248574]  [<ffffffffb58d67b0>] ? wake_up_state+0x20/0x20
      [ 6840.249239]  [<ffffffffc0c9e6fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc]
      [ 6840.250015]  [<ffffffffc0c9ed2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc]
      [ 6840.250706]  [<ffffffffc0ca33d9>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc]
      [ 6840.251431]  [<ffffffffc0c9d0c6>] ldlm_cancel_lru+0x76/0x170 [ptlrpc]
      [ 6840.252173]  [<ffffffffc0c8dbe5>] lru_size_store+0x145/0x480 [ptlrpc]
      [ 6840.252910]  [<ffffffffc0abc09a>] lustre_attr_store+0x1a/0x20 [obdclass]
      [ 6840.253661]  [<ffffffffb5acb6e2>] sysfs_kf_write+0x42/0x50
      [ 6840.254232]  [<ffffffffb5acacbb>] kernfs_fop_write+0xeb/0x160
      [ 6840.254926]  [<ffffffffb5a410a0>] vfs_write+0xc0/0x1f0
      [ 6840.255456]  [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146
      [ 6840.256119]  [<ffffffffb5a41ebf>] SyS_write+0x7f/0xf0
      [ 6840.256708]  [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146
      [ 6840.257385]  [<ffffffffb5f74ddb>] system_call_fastpath+0x22/0x27
      [ 6840.258055]  [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146
      …
      [ 7276.489872] LustreError: 6024:0:(osc_cache.c:955:osc_extent_wait()) extent ffff96191b551dc0@{[0 -> 0/1023], [3|0|+|rpc|wiuY|ffff961924371280], [28672|1|+|-|ffff96191d8b2fc0|1024|ffff96193a76d140]} lustre-OST0000-osc-ffff96191a900000: wait ext to 0 timedout, recovery in progress?
      [ 7276.492281] LustreError: 6024:0:(osc_cache.c:955:osc_extent_wait()) ### extent: ffff96191b551dc0 ns: lustre-OST0000-osc-ffff96191a900000 lock: ffff96191d8b2fc0/0x40b5a0904351ea00 lrc: 3/0,0 mode: PW/PW res: [0x2782:0x0:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x29400000000 nid: local remote: 0xafa2f3157ef8581d expref: -99 pid: 13958 timeout: 0 lvb_type: 1
      [ 7302.224848] Lustre: 14779:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554182067/real 1554182067]  req@ffff96191b307600 x1629670629270288/t0(0) o4->lustre-OST0000-osc-ffff96191a900000@10.2.8.199@tcp:6/4 lens 4584/448 e 4 to 1 dl 1554182175 ref 2 fl Rpc:X/2/ffffffff rc 0/-1
      [ 7302.227597] Lustre: 14779:0:(client.c:2134:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
      [ 7320.304861] INFO: task lctl:18670 blocked for more than 120 seconds.
      [ 7320.305552] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 7320.306332] lctl            D ffff961939174100     0 18670  18669 0x00000080
      [ 7320.307084] Call Trace:
      [ 7320.307424]  [<ffffffffb5b94a9d>] ? list_del+0xd/0x30
      [ 7320.307938]  [<ffffffffb5f67bc9>] schedule+0x29/0x70
      [ 7320.308481]  [<ffffffffb5f656a1>] schedule_timeout+0x221/0x2d0
      [ 7320.309063]  [<ffffffffb58d64f0>] ? try_to_wake_up+0x190/0x390
      [ 7320.309782]  [<ffffffffb5f67f7d>] wait_for_completion+0xfd/0x140
      [ 7320.310382]  [<ffffffffb58d67b0>] ? wake_up_state+0x20/0x20
      [ 7320.311029]  [<ffffffffc0c9e6fd>] __ldlm_bl_to_thread+0xad/0x150 [ptlrpc]
      [ 7320.311701]  [<ffffffffc0c9ed2b>] ldlm_bl_to_thread+0x33b/0x510 [ptlrpc]
      [ 7320.312385]  [<ffffffffc0ca33d9>] ldlm_bl_to_thread_list+0x19/0x20 [ptlrpc]
      [ 7320.313158]  [<ffffffffc0c9d0c6>] ldlm_cancel_lru+0x76/0x170 [ptlrpc]
      [ 7320.313793]  [<ffffffffc0c8dbe5>] lru_size_store+0x145/0x480 [ptlrpc]
      [ 7320.314453]  [<ffffffffc0abc09a>] lustre_attr_store+0x1a/0x20 [obdclass]
      [ 7320.315184]  [<ffffffffb5acb6e2>] sysfs_kf_write+0x42/0x50
      [ 7320.315720]  [<ffffffffb5acacbb>] kernfs_fop_write+0xeb/0x160
      [ 7320.316292]  [<ffffffffb5a410a0>] vfs_write+0xc0/0x1f0
      [ 7320.316961]  [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146
      [ 7320.317602]  [<ffffffffb5a41ebf>] SyS_write+0x7f/0xf0
      [ 7320.318106]  [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146
      [ 7320.318813]  [<ffffffffb5f74ddb>] system_call_fastpath+0x22/0x27
      [ 7320.319404]  [<ffffffffb5f74d21>] ? system_call_after_swapgs+0xae/0x146
      [ 7409.400854] Lustre: lustre-OST0000-osc-ffff96191a900000: Connection to lustre-OST0000 (at 10.2.8.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [ 7409.402543] Lustre: Skipped 4 previous similar messages
      [ 7409.405522] Lustre: lustre-OST0000-osc-ffff96191a900000: Connection restored to 10.2.8.199@tcp (at 10.2.8.199@tcp)
      [ 7409.406583] Lustre: Skipped 5 previous similar messages
      [ 7440.319864] INFO: task lctl:18670 blocked for more than 120 seconds.
      [ 7440.320764] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      …
      

      The console log for the OSS (vm3) has many errors. Here are just a few

      …
      [ 6629.884030] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      [ 6629.981017] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      [ 6630.095178] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      [ 6630.232869] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-ffff*.ost_server_uuid in FULL state after 0 sec
      [ 6636.601540] Lustre: DEBUG MARKER: sync; sync; sync
      [ 6638.070508] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-OST0000 notransno
      [ 6638.421691] Lustre: DEBUG MARKER: modprobe dm-flakey;
      [ 6638.421691] 			 dmsetup targets | grep -q flakey
      [ 6638.775038] Lustre: DEBUG MARKER: dmsetup table /dev/mapper/ost1_flakey
      [ 6639.121066] Lustre: DEBUG MARKER: dmsetup suspend --nolockfs --noflush /dev/mapper/ost1_flakey
      [ 6639.467783] Lustre: DEBUG MARKER: dmsetup load /dev/mapper/ost1_flakey --table "0 4194304 flakey 252:0 0 0 1800 1 drop_writes"
      [ 6639.818180] Lustre: DEBUG MARKER: dmsetup resume /dev/mapper/ost1_flakey
      [ 6640.180823] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost1 REPLAY BARRIER on lustre-OST0000
      [ 6640.357370] Lustre: DEBUG MARKER: ost1 REPLAY BARRIER on lustre-OST0000
      [ 6641.603284] Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70f failing OST 2 times
      [ 6641.800703] Lustre: DEBUG MARKER: test_70f failing OST 2 times
      [ 6641.985657] Lustre: DEBUG MARKER: grep -c /mnt/lustre-ost1' ' /proc/mounts || true
      [ 6642.334386] Lustre: DEBUG MARKER: umount -d /mnt/lustre-ost1
      [ 6642.516911] Lustre: Failing over lustre-OST0000
      [ 6643.823863] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping)
      [ 6643.824741] Lustre: Skipped 1 previous similar message
      [ 6646.530498] Lustre: lustre-OST0000: Not available for connect from 10.2.8.200@tcp (stopping)
      [ 6646.531419] Lustre: Skipped 1 previous similar message
      [ 6653.837143] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping)
      [ 6653.838059] Lustre: Skipped 2 previous similar messages
      [ 6656.520352] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck?
      [ 6656.521512] Lustre: lustre-OST0000: UNLINKED ffff9d8d17319c00 f1ce44c5-b64b-e307-df96-edafd36f5e17 10.2.8.197@tcp 2 (1 0 0) 1 0 0 0:           (null)  34359849384 stale:0
      [ 6663.853170] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping)
      [ 6663.854170] Lustre: Skipped 5 previous similar messages
      [ 6666.624361] LNet: Service thread pid 23196 was inactive for 40.13s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 6666.626018] Pid: 23196, comm: ll_ost_io00_030 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
      [ 6666.626965] Call Trace:
      [ 6666.627234]  [<ffffffffc0c3054e>] target_bulk_io+0x4ae/0xac0 [ptlrpc]
      [ 6666.627927]  [<ffffffffc0cdc5df>] tgt_brw_write+0x11af/0x17d0 [ptlrpc]
      [ 6666.628618]  [<ffffffffc0cd842a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [ 6666.629320]  [<ffffffffc0c80e5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [ 6666.630107]  [<ffffffffc0c845a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [ 6666.630764]  [<ffffffffbd8c1c31>] kthread+0xd1/0xe0
      [ 6666.631281]  [<ffffffffbdf74c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 6666.632221]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 6666.632797] LustreError: dumping log to /tmp/lustre-log.1554181623.23196
      [ 6672.522355] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck?
      [ 6672.523567] Lustre: lustre-OST0000: UNLINKED ffff9d8d17319c00 f1ce44c5-b64b-e307-df96-edafd36f5e17 10.2.8.197@tcp 2 (1 0 0) 1 0 0 0:           (null)  34359849384 stale:0
      [ 6681.324388] Lustre: lustre-OST0000: Not available for connect from 10.2.8.200@tcp (stopping)
      [ 6681.325429] Lustre: Skipped 10 previous similar messages
      [ 6704.524362] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck?
      [ 6704.525590] Lustre: lustre-OST0000: UNLINKED ffff9d8d17319c00 f1ce44c5-b64b-e307-df96-edafd36f5e17 10.2.8.197@tcp 2 (1 0 0) 1 0 0 0:           (null)  34359849384 stale:0
      [ 6713.933190] Lustre: lustre-OST0000: Not available for connect from 10.2.8.198@tcp (stopping)
      [ 6713.934636] Lustre: Skipped 18 previous similar messages
      [ 6726.487412] LNet: Service thread pid 23196 completed after 100.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
      [ 6743.633474] LustreError: 25479:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff9d8d14d60f00 x1629670662416544/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.2.8.200@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
      [ 6743.635773] LustreError: 25479:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x20000:0x0], rc:-5
      [ 6743.637213] LustreError: 25479:0:(qsd_reint.c:56:qsd_reint_completion()) Skipped 7 previous similar messages
      [ 6768.526352] Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 2. Is it stuck?
      [ 6768.556672] Lustre: server umount lustre-OST0000 complete
      …
      [ 6826.496386] LNet: Service thread pid 23196 was inactive for 40.12s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 6826.498063] Pid: 23196, comm: ll_ost_io00_030 3.10.0-957.el7_lustre.x86_64 #1 SMP Wed Dec 12 15:03:08 UTC 2018
      [ 6826.499022] Call Trace:
      [ 6826.499284]  [<ffffffffc0c3054e>] target_bulk_io+0x4ae/0xac0 [ptlrpc]
      [ 6826.499983]  [<ffffffffc0cdc5df>] tgt_brw_write+0x11af/0x17d0 [ptlrpc]
      [ 6826.500678]  [<ffffffffc0cd842a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
      [ 6826.501429]  [<ffffffffc0c80e5b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
      [ 6826.502405]  [<ffffffffc0c845a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [ 6826.503075]  [<ffffffffbd8c1c31>] kthread+0xd1/0xe0
      [ 6826.503580]  [<ffffffffbdf74c37>] ret_from_fork_nospec_end+0x0/0x39
      [ 6826.504242]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 6826.504841] LustreError: dumping log to /tmp/lustre-log.1554181783.23196
      [ 6886.380373] LustreError: 23196:0:(ldlm_lib.c:3237:target_bulk_io()) @@@ timeout on bulk WRITE after 100+0s  req@ffff9d8d1a5dac50 x1629670629270288/t0(0) o4->f1ce44c5-b64b-e307-df96-edafd36f5e17@10.2.8.197@tcp:5/0 lens 4584/448 e 0 to 0 dl 1554181850 ref 1 fl Interpret:/2/0 rc 0/0
      [ 6886.382825] LustreError: 23196:0:(ldlm_lib.c:3237:target_bulk_io()) Skipped 1 previous similar message
      [ 6886.383776] Lustre: lustre-OST0000: Bulk IO write error with f1ce44c5-b64b-e307-df96-edafd36f5e17 (at 10.2.8.197@tcp), client will retry: rc = -110
      …
      

      So far we are only seeing replay-single hang in test 70f in interop testing (server/client):
      2.10.6/2.12.0.52 - https://testing.whamcloud.com/test_sets/fd5c3ddc-557d-11e9-9720-52540065bddc
      2.10.6/2.12.0.28 - https://testing.whamcloud.com/test_sets/890b500a-4e0a-11e9-b98a-52540065bddc
      2.10.6/2.12.50.85 - https://testing.whamcloud.com/test_sets/6167f5d8-25a3-11e9-b901-52540065bddc
      2.10.6/2.12.50.83 - https://testing.whamcloud.com/test_sets/8633b650-2552-11e9-b54c-52540065bddc
      2.10.6/2.12.0 RC4 - https://testing.whamcloud.com/test_sets/2f9059a2-0635-11e9-a97c-52540065bddc
      2.10.5/2.11.55.65 - https://testing.whamcloud.com/test_sets/93246f30-cbca-11e8-82f2-52540065bddc
      2.10.4 RC2/2.11.53.32 - https://testing.whamcloud.com/test_sets/7060aef2-9815-11e8-b0aa-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: