[LU-12220] replay-single test_130a: timeout, MDT umount stuck in obd_exports_barrier() Created: 24/Apr/19  Updated: 11/Apr/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.6
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Bruno Faccini (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for bfaccini <bruno.faccini@intel.com>

This issue relates to the following tests suites run:
https://testing.whamcloud.com/test_sets/ac9d40dc-62a0-11e9-92d8-52540065bddc
https://testing.whamcloud.com/sub_tests/6ef1565c-6619-11e9-a6f9-52540065bddc
https://testing.whamcloud.com/sub_tests/5ce256f8-66a2-11e9-a6f2-52540065bddc

test_130a failed with the following error:

Timeout occurred after 214 mins, last suite running was replay-single, restarting cluster to continue tests

MMDS Console :

[ 9269.958840] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 130a: DoM file create \(setstripe\) replay ======================================= 09:01:59 \(1555664519\)
[ 9270.176313] Lustre: DEBUG MARKER: == replay-single test 130a: DoM file create (setstripe) replay ======================================= 09:01:59 (1555664519)
[ 9270.379677] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
[ 9270.695930] Lustre: DEBUG MARKER: sync; sync; sync
[ 9271.703775] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 notransno
[ 9272.010818] Lustre: DEBUG MARKER: /usr/sbin/lctl --device lustre-MDT0000 readonly
[ 9272.361465] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mds1 REPLAY BARRIER on lustre-MDT0000
[ 9272.515462] Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
[ 9272.688112] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
[ 9272.992505] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
[ 9274.621764] LustreError: 18079:0:(client.c:1175:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff8b478eaa1f80 x1631222875733264/t0(0) o41->lustre-MDT0001-osp-MDT0000@10.9.25.5@tcp:24/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[ 9274.625498] LustreError: 18079:0:(client.c:1175:ptlrpc_import_delay_req()) Skipped 1 previous similar message
[ 9274.686206] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[ 9274.687681] Lustre: Skipped 24 previous similar messages
[ 9287.251726] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck?
[ 9287.253710] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[ 9301.677791] Lustre: mdt00_001: service thread pid 11852 was inactive for 40.031 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 9303.255723] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck?
[ 9303.257733] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[ 9335.259742] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 4. Is it stuck?
[ 9335.261875] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[ 9399.263733] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 4. Is it stuck?
[ 9399.265718] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[ 9527.267724] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 4. Is it stuck?
[ 9527.269735] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[ 9720.312753] INFO: task umount:29172 blocked for more than 120 seconds.
[ 9720.314046] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9720.315363] umount          D ffff8b47bac3f6c0     0 29172  29171 0x00000080
[ 9720.316759] Call Trace:
[ 9720.317225]  [<ffffffffc0ccfe94>] ? print_export_data.isra.18+0x224/0x240 [obdclass]
[ 9720.318535]  [<ffffffff9c367c49>] schedule+0x29/0x70
[ 9720.319550]  [<ffffffff9c365668>] schedule_timeout+0x168/0x2d0
[ 9720.320532]  [<ffffffff9bca98e0>] ? __internal_add_timer+0x130/0x130
[ 9720.321705]  [<ffffffffc0cd77f9>] obd_exports_barrier+0xa9/0x1a0 [obdclass]
[ 9720.322942]  [<ffffffffc1300e22>] mdt_device_fini+0xe2/0x930 [mdt]
[ 9720.324052]  [<ffffffffc0d01583>] ? lu_context_init+0xd3/0x1f0 [obdclass]
[ 9720.325301]  [<ffffffffc0ceead2>] class_cleanup+0x862/0xbd0 [obdclass]
[ 9720.326435]  [<ffffffffc0cefacc>] class_process_config+0x65c/0x2830 [obdclass]
[ 9720.327791]  [<ffffffffc0ba3f17>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 9720.328989]  [<ffffffffc0cf1e66>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[ 9720.330295]  [<ffffffffc0d2259e>] server_put_super+0x8de/0xcd0 [obdclass]
[ 9720.331480]  [<ffffffff9be43dbd>] generic_shutdown_super+0x6d/0x100
[ 9720.332660]  [<ffffffff9be441b2>] kill_anon_super+0x12/0x20
[ 9720.333624]  [<ffffffffc0cf4a82>] lustre_kill_super+0x32/0x50 [obdclass]
[ 9720.334830]  [<ffffffff9be4456e>] deactivate_locked_super+0x4e/0x70
[ 9720.335912]  [<ffffffff9be44cf6>] deactivate_super+0x46/0x60
[ 9720.336962]  [<ffffffff9be6327f>] cleanup_mnt+0x3f/0x80
[ 9720.337861]  [<ffffffff9be63312>] __cleanup_mnt+0x12/0x20
[ 9720.338903]  [<ffffffff9bcbe79b>] task_work_run+0xbb/0xe0
[ 9720.339830]  [<ffffffff9bc2bc65>] do_notify_resume+0xa5/0xc0
[ 9720.340869]  [<ffffffff9c375124>] int_signal+0x12/0x17
[ 9783.271763] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 4. Is it stuck?
[ 9783.273856] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[ 9875.645997] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[ 9875.647925] Lustre: Skipped 1919 previous similar messages
[ 9960.341752] INFO: task umount:29172 blocked for more than 120 seconds.
[ 9960.342930] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 9960.344222] umount          D ffff8b4794725140     0 29172  29171 0x00000080
[ 9960.345637] Call Trace:
[ 9960.346100]  [<ffffffffc0ccfe94>] ? print_export_data.isra.18+0x224/0x240 [obdclass]
[ 9960.347457]  [<ffffffff9c367c49>] schedule+0x29/0x70
[ 9960.348293]  [<ffffffff9c365668>] schedule_timeout+0x168/0x2d0
[ 9960.349327]  [<ffffffff9bca98e0>] ? __internal_add_timer+0x130/0x130
[ 9960.350416]  [<ffffffffc0cd77f9>] obd_exports_barrier+0xa9/0x1a0 [obdclass]
[ 9960.351657]  [<ffffffffc1300e22>] mdt_device_fini+0xe2/0x930 [mdt]
[ 9960.352701]  [<ffffffffc0d01583>] ? lu_context_init+0xd3/0x1f0 [obdclass]
[ 9960.353977]  [<ffffffffc0ceead2>] class_cleanup+0x862/0xbd0 [obdclass]
[ 9960.355070]  [<ffffffffc0cefacc>] class_process_config+0x65c/0x2830 [obdclass]
[ 9960.356351]  [<ffffffffc0ba3f17>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[ 9960.357456]  [<ffffffffc0cf1e66>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[ 9960.358726]  [<ffffffffc0d2259e>] server_put_super+0x8de/0xcd0 [obdclass]
[ 9960.359859]  [<ffffffff9be43dbd>] generic_shutdown_super+0x6d/0x100
[ 9960.360972]  [<ffffffff9be441b2>] kill_anon_super+0x12/0x20
[ 9960.361932]  [<ffffffffc0cf4a82>] lustre_kill_super+0x32/0x50 [obdclass]
[ 9960.363113]  [<ffffffff9be4456e>] deactivate_locked_super+0x4e/0x70
[ 9960.364150]  [<ffffffff9be44cf6>] deactivate_super+0x46/0x60
[ 9960.365255]  [<ffffffff9be6327f>] cleanup_mnt+0x3f/0x80
[ 9960.366130]  [<ffffffff9be63312>] __cleanup_mnt+0x12/0x20
[ 9960.367114]  [<ffffffff9bcbe79b>] task_work_run+0xbb/0xe0
[ 9960.368029]  [<ffffffff9bc2bc65>] do_notify_resume+0xa5/0xc0
[ 9960.369027]  [<ffffffff9c375124>] int_signal+0x12/0x17

...........................

[11040.569776] INFO: task umount:29172 blocked for more than 120 seconds.
[11040.571000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11040.572289] umount          D ffff8b4794725140     0 29172  29171 0x00000080
[11040.573636] Call Trace:
[11040.574114]  [<ffffffffc0ccfe94>] ? print_export_data.isra.18+0x224/0x240 [obdclass]
[11040.575532]  [<ffffffff9c367c49>] schedule+0x29/0x70
[11040.576402]  [<ffffffff9c365668>] schedule_timeout+0x168/0x2d0
[11040.577443]  [<ffffffff9bca98e0>] ? __internal_add_timer+0x130/0x130
[11040.578514]  [<ffffffffc0cd77f9>] obd_exports_barrier+0xa9/0x1a0 [obdclass]
[11040.579776]  [<ffffffffc1300e22>] mdt_device_fini+0xe2/0x930 [mdt]
[11040.580827]  [<ffffffffc0d01583>] ? lu_context_init+0xd3/0x1f0 [obdclass]
[11040.582065]  [<ffffffffc0ceead2>] class_cleanup+0x862/0xbd0 [obdclass]
[11040.583183]  [<ffffffffc0cefacc>] class_process_config+0x65c/0x2830 [obdclass]
[11040.584501]  [<ffffffffc0ba3f17>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
[11040.585630]  [<ffffffffc0cf1e66>] class_manual_cleanup+0x1c6/0x720 [obdclass]
[11040.586971]  [<ffffffffc0d2259e>] server_put_super+0x8de/0xcd0 [obdclass]
[11040.588287]  [<ffffffff9be43dbd>] generic_shutdown_super+0x6d/0x100
[11040.589406]  [<ffffffff9be441b2>] kill_anon_super+0x12/0x20
[11040.590422]  [<ffffffffc0cf4a82>] lustre_kill_super+0x32/0x50 [obdclass]
[11040.591706]  [<ffffffff9be4456e>] deactivate_locked_super+0x4e/0x70
[11040.592915]  [<ffffffff9be44cf6>] deactivate_super+0x46/0x60
[11040.593923]  [<ffffffff9be6327f>] cleanup_mnt+0x3f/0x80
[11040.594901]  [<ffffffff9be63312>] __cleanup_mnt+0x12/0x20
[11040.595850]  [<ffffffff9bcbe79b>] task_work_run+0xbb/0xe0
[11040.596956]  [<ffffffff9bc2bc65>] do_notify_resume+0xa5/0xc0
[11040.597930]  [<ffffffff9c375124>] int_signal+0x12/0x17
[11077.566009] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[11077.567489] Lustre: Skipped 1916 previous similar messages
[11319.279765] Lustre: lustre-MDT0000 is waiting for obd_unlinked_exports more than 1024 seconds. The obd refcount = 4. Is it stuck?
[11319.281778] Lustre: lustre-MDT0000: UNLINKED ffff8b479f68c800 2a0a0e03-7e4d-227a-03c2-bc66c4ed138c 10.9.25.1@tcp 1 (1 0 0) 1 0 1 0:           (null)  554050781187 stale:0
[11678.526106] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[11678.528152] Lustre: Skipped 1918 previous similar messages
[12279.485876] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[12279.487604] Lustre: Skipped 1916 previous similar messages
[12880.445916] Lustre: lustre-MDT0000: Not available for connect from 0@lo (stopping)
[12880.448085] Lustre: Skipped 1916 previous similar messages


 Comments   
Comment by Jian Yu [ 22/Sep/19 ]

+1 on master branch: https://testing.whamcloud.com/test_sets/3c709bba-dce9-11e9-b62b-52540065bddc

Comment by Etienne Aujames [ 04/Aug/21 ]

+1 on b2_12 branch: https://testing.whamcloud.com/sub_tests/4d6f16c3-1dc8-41c2-b557-9c41df1b34f4

Generated at Sat Feb 10 02:50:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.