[LU-10818] mds-survey test 2 hangs with “ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed” Created: 14/Mar/18 Updated: 27/Feb/19 Resolved: 04/Sep/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0, Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0, Lustre 2.10.7 |
| Type: | Bug | Priority: | Critical |
| Reporter: | James Nunez (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
mds-survey test_2 hangs. The last thing we see in the test output is == mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695) CMD: trevis-12vm12 /usr/sbin/lctl dl + file_count=94322 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=1 rslt_loc=/tmp targets="trevis-12vm12:lustre-MDT0000" /usr/bin/mds-survey Wed Jan 31 14:28:17 PST 2018 /usr/bin/mds-survey from trevis-12vm9 mdt 1 file 94322 dir 4 thr 4 create 11863.49 [ 0.00, 23978.87] lookup 372653.14 [ 372653.14, 372653.14] md_getattr
There are several examples of this hang in Maloo, but many of the call traces seem incomplete. Looking at the MDS console log for the test suite logs at https://testing.hpdd.intel.com/test_sets/00a4d694-2252-11e8-a4b1-52540065bddc, we see [55686.709160] Lustre: DEBUG MARKER: == mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695) [55686.770937] Lustre: DEBUG MARKER: /usr/sbin/lctl dl [55687.328593] Lustre: Echo OBD driver; http://www.lustre.org/ [55688.487749] LustreError: 23896:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests: rc = -2 [55688.487759] LustreError: 23896:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2 [55689.008253] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests3: rc = -2 [55689.008264] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) Skipped 2 previous similar messages [55689.008267] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests3: rc = -2 [55689.008268] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Skipped 2 previous similar messages [55699.028489] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: [55699.032133] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: [55699.032134] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG [55699.032135] Pid: 24355, comm: lctl [55699.032135] [55699.032135] Call Trace: [55699.043010] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG [55699.044901] [<ffffffff81019b19>] dump_trace+0x59/0x310 [55699.046838] Pid: 24353, comm: lctl [55699.048565] [55699.048565] Call Trace: [55699.051619] [<ffffffff81019b19>] dump_trace+0x59/0x310 [55699.082944] [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs] [55699.082947] [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs] [55699.091042] [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs] [55699.091043] [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs] [55699.091056] [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho] [55699.091076] [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho] [55699.091085] [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho] [55699.091092] [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho] [55699.101840] [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho] [55699.103625] [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho] [55699.105445] [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho] [55699.107243] [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho] [55699.135047] [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass] [55699.135049] [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass] [55699.135114] [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass] [55699.140337] [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass] [55699.146489] [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0 [55699.146489] [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0 [55699.169646] [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80 [55699.169647] [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80 [55699.169684] [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa [55699.174486] [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa [55699.196143] (null) [55699.196146] (null)DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0xaa [55699.196147] [55699.196147] (null)Leftover inexact backtrace: [55699.196147] [55699.196156] [55699.196164] Kernel panic - not syncing: LBUG [55699.196172] CPU: 1 PID: 24355 Comm: lctl Tainted: G OE N 4.4.103-6.38_lustre-default #1 [55699.196173] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
This test started failing on 2018-01-31. Here are a few of the logs for failures: https://testing.hpdd.intel.com/test_sets/42ff8372-071d-11e8-a6ad-52540065bddc https://testing.hpdd.intel.com/test_sets/110bab20-18a2-11e8-bd00-52540065bddc https://testing.hpdd.intel.com/test_sets/ff52db72-2762-11e8-b74b-52540065bddc |
| Comments |
| Comment by Sarah Liu [ 23/Apr/18 ] |
|
This failure blocks master testing tag-2.11.51 on EL7 server/client https://testing.hpdd.intel.com/test_sets/d79b14a4-471b-11e8-95c0-52540065bddc on EL7 server/SLES12sp3 client https://testing.hpdd.intel.com/test_sets/7be267e6-4718-11e8-960d-52540065bddc |
| Comment by Peter Jones [ 29/Aug/18 ] |
|
Lai Could you please investigate? Thanks Peter |
| Comment by Gerrit Updater [ 30/Aug/18 ] |
|
Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33092 |
| Comment by Gerrit Updater [ 31/Aug/18 ] |
|
Nikitas Angelinas (nangelinas@cray.com) uploaded a new patch: https://review.whamcloud.com/33097 |
| Comment by Nikitas Angelinas [ 31/Aug/18 ] |
|
I was meaning to upload this patch we had for this issue, but I have been away from work, lately. We may not want to land it since Lai has already posted a patch, but I am uploading it since it has some additional code rework that we might want to merge into Lai's patch. |
| Comment by Gerrit Updater [ 04/Sep/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33097/ |
| Comment by Peter Jones [ 04/Sep/18 ] |
|
Landed for 2.12 |
| Comment by Gerrit Updater [ 07/Jan/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33976 |
| Comment by Gerrit Updater [ 16/Jan/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33976/ |