Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10818

mds-survey test 2 hangs with “ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed”

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.12.0, Lustre 2.10.7
    • Lustre 2.11.0, Lustre 2.12.0
    • None
    • 3
    • 9223372036854775807

    Description

      mds-survey test_2 hangs. The last thing we see in the test output is

      == mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695)
      CMD: trevis-12vm12 /usr/sbin/lctl dl
      + file_count=94322 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=1 rslt_loc=/tmp targets="trevis-12vm12:lustre-MDT0000" /usr/bin/mds-survey
      Wed Jan 31 14:28:17 PST 2018 /usr/bin/mds-survey from trevis-12vm9
      mdt 1 file   94322 dir    4 thr    4 create 11863.49 [    0.00, 23978.87] lookup 372653.14 [ 372653.14, 372653.14] md_getattr 
      

       

      There are several examples of this hang in Maloo, but many of the call traces seem incomplete. Looking at the MDS console log for the test suite logs at https://testing.hpdd.intel.com/test_sets/00a4d694-2252-11e8-a4b1-52540065bddc, we see

      [55686.709160] Lustre: DEBUG MARKER: == mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695)
      [55686.770937] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
      [55687.328593] Lustre: Echo OBD driver; http://www.lustre.org/
      [55688.487749] LustreError: 23896:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests: rc = -2
      [55688.487759] LustreError: 23896:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2
      [55689.008253] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests3: rc = -2
      [55689.008264] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) Skipped 2 previous similar messages
      [55689.008267] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests3: rc = -2
      [55689.008268] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Skipped 2 previous similar messages
      [55699.028489] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: 
      [55699.032133] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: 
      [55699.032134] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG
      [55699.032135] Pid: 24355, comm: lctl
      [55699.032135] 
      [55699.032135] Call Trace:
      [55699.043010] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG
      [55699.044901]  [<ffffffff81019b19>] dump_trace+0x59/0x310
      [55699.046838] Pid: 24353, comm: lctl
      [55699.048565] 
      [55699.048565] Call Trace:
      [55699.051619]  [<ffffffff81019b19>] dump_trace+0x59/0x310
      [55699.082944]  [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs]
      [55699.082947]  [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs]
      [55699.091042]  [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs]
      [55699.091043]  [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs]
      [55699.091056]  [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho]
      [55699.091076]  [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho]
      [55699.091085]  [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho]
      [55699.091092]  [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho]
      [55699.101840]  [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho]
      [55699.103625]  [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho]
      [55699.105445]  [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho]
      [55699.107243]  [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho]
      [55699.135047]  [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass]
      [55699.135049]  [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass]
      [55699.135114]  [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass]
      [55699.140337]  [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass]
      [55699.146489]  [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0
      [55699.146489]  [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0
      [55699.169646]  [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80
      [55699.169647]  [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80
      [55699.169684]  [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa
      [55699.174486]  [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa
      [55699.196143] (null)
      [55699.196146] (null)DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0xaa
      [55699.196147] 
      [55699.196147] (null)Leftover inexact backtrace:
      [55699.196147] 
      [55699.196156] 
      [55699.196164] Kernel panic - not syncing: LBUG
      [55699.196172] CPU: 1 PID: 24355 Comm: lctl Tainted: G           OE   N  4.4.103-6.38_lustre-default #1
      [55699.196173] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      

       

      This test started failing on 2018-01-31. 

      Here are a few of the logs for failures:

      https://testing.hpdd.intel.com/test_sets/42ff8372-071d-11e8-a6ad-52540065bddc

      https://testing.hpdd.intel.com/test_sets/110bab20-18a2-11e8-bd00-52540065bddc

      https://testing.hpdd.intel.com/test_sets/ff52db72-2762-11e8-b74b-52540065bddc 

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: