Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.11.0, Lustre 2.12.0
-
None
-
3
-
9223372036854775807
Description
mds-survey test_2 hangs. The last thing we see in the test output is
== mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695) CMD: trevis-12vm12 /usr/sbin/lctl dl + file_count=94322 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=1 rslt_loc=/tmp targets="trevis-12vm12:lustre-MDT0000" /usr/bin/mds-survey Wed Jan 31 14:28:17 PST 2018 /usr/bin/mds-survey from trevis-12vm9 mdt 1 file 94322 dir 4 thr 4 create 11863.49 [ 0.00, 23978.87] lookup 372653.14 [ 372653.14, 372653.14] md_getattr
There are several examples of this hang in Maloo, but many of the call traces seem incomplete. Looking at the MDS console log for the test suite logs at https://testing.hpdd.intel.com/test_sets/00a4d694-2252-11e8-a4b1-52540065bddc, we see
[55686.709160] Lustre: DEBUG MARKER: == mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695) [55686.770937] Lustre: DEBUG MARKER: /usr/sbin/lctl dl [55687.328593] Lustre: Echo OBD driver; http://www.lustre.org/ [55688.487749] LustreError: 23896:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests: rc = -2 [55688.487759] LustreError: 23896:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2 [55689.008253] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests3: rc = -2 [55689.008264] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) Skipped 2 previous similar messages [55689.008267] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests3: rc = -2 [55689.008268] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Skipped 2 previous similar messages [55699.028489] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: [55699.032133] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: [55699.032134] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG [55699.032135] Pid: 24355, comm: lctl [55699.032135] [55699.032135] Call Trace: [55699.043010] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG [55699.044901] [<ffffffff81019b19>] dump_trace+0x59/0x310 [55699.046838] Pid: 24353, comm: lctl [55699.048565] [55699.048565] Call Trace: [55699.051619] [<ffffffff81019b19>] dump_trace+0x59/0x310 [55699.082944] [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs] [55699.082947] [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs] [55699.091042] [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs] [55699.091043] [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs] [55699.091056] [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho] [55699.091076] [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho] [55699.091085] [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho] [55699.091092] [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho] [55699.101840] [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho] [55699.103625] [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho] [55699.105445] [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho] [55699.107243] [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho] [55699.135047] [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass] [55699.135049] [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass] [55699.135114] [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass] [55699.140337] [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass] [55699.146489] [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0 [55699.146489] [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0 [55699.169646] [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80 [55699.169647] [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80 [55699.169684] [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa [55699.174486] [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa [55699.196143] (null) [55699.196146] (null)DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0xaa [55699.196147] [55699.196147] (null)Leftover inexact backtrace: [55699.196147] [55699.196156] [55699.196164] Kernel panic - not syncing: LBUG [55699.196172] CPU: 1 PID: 24355 Comm: lctl Tainted: G OE N 4.4.103-6.38_lustre-default #1 [55699.196173] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
This test started failing on 2018-01-31.
Here are a few of the logs for failures:
https://testing.hpdd.intel.com/test_sets/42ff8372-071d-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/110bab20-18a2-11e8-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/ff52db72-2762-11e8-b74b-52540065bddc