[LU-10818] mds-survey test 2 hangs with “ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed” Created: 14/Mar/18  Updated: 27/Feb/19  Resolved: 04/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.12.0
Fix Version/s: Lustre 2.12.0, Lustre 2.10.7

Type: Bug Priority: Critical
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

mds-survey test_2 hangs. The last thing we see in the test output is

== mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695)
CMD: trevis-12vm12 /usr/sbin/lctl dl
+ file_count=94322 thrlo=1 thrhi=8 dir_count=4 layer=mdd stripe_count=1 rslt_loc=/tmp targets="trevis-12vm12:lustre-MDT0000" /usr/bin/mds-survey
Wed Jan 31 14:28:17 PST 2018 /usr/bin/mds-survey from trevis-12vm9
mdt 1 file   94322 dir    4 thr    4 create 11863.49 [    0.00, 23978.87] lookup 372653.14 [ 372653.14, 372653.14] md_getattr 

 

There are several examples of this hang in Maloo, but many of the call traces seem incomplete. Looking at the MDS console log for the test suite logs at https://testing.hpdd.intel.com/test_sets/00a4d694-2252-11e8-a4b1-52540065bddc, we see

[55686.709160] Lustre: DEBUG MARKER: == mds-survey test 2: Metadata survey with stripe_count = 1 ========================================== 14:28:15 (1517437695)
[55686.770937] Lustre: DEBUG MARKER: /usr/sbin/lctl dl
[55687.328593] Lustre: Echo OBD driver; http://www.lustre.org/
[55688.487749] LustreError: 23896:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests: rc = -2
[55688.487759] LustreError: 23896:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2
[55689.008253] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0000-tests3: rc = -2
[55689.008264] LustreError: 24007:0:(echo_client.c:1795:echo_md_lookup()) Skipped 2 previous similar messages
[55689.008267] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0000-tests3: rc = -2
[55689.008268] LustreError: 24007:0:(echo_client.c:2027:echo_md_destroy_internal()) Skipped 2 previous similar messages
[55699.028489] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: 
[55699.032133] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) ASSERTION( ma->ma_need & (MA_LOV | MA_LMV) ) failed: 
[55699.032134] LustreError: 24355:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG
[55699.032135] Pid: 24355, comm: lctl
[55699.032135] 
[55699.032135] Call Trace:
[55699.043010] LustreError: 24353:0:(echo_client.c:1397:echo_big_lmm_get()) LBUG
[55699.044901]  [<ffffffff81019b19>] dump_trace+0x59/0x310
[55699.046838] Pid: 24353, comm: lctl
[55699.048565] 
[55699.048565] Call Trace:
[55699.051619]  [<ffffffff81019b19>] dump_trace+0x59/0x310
[55699.082944]  [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs]
[55699.082947]  [<ffffffffa08616ca>] libcfs_call_trace+0x4a/0x60 [libcfs]
[55699.091042]  [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs]
[55699.091043]  [<ffffffffa0861741>] lbug_with_loc+0x41/0xa0 [libcfs]
[55699.091056]  [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho]
[55699.091076]  [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho]
[55699.091085]  [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho]
[55699.091092]  [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho]
[55699.101840]  [<ffffffffa0833397>] echo_big_lmm_get+0x637/0x7a0 [obdecho]
[55699.103625]  [<ffffffffa0834a68>] echo_attr_get_complex+0x518/0x6b0 [obdecho]
[55699.105445]  [<ffffffffa0838dee>] echo_md_handler.isra.45+0x1a3e/0x2b30 [obdecho]
[55699.107243]  [<ffffffffa083aec9>] echo_client_iocontrol+0xfe9/0x1ab0 [obdecho]
[55699.135047]  [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass]
[55699.135049]  [<ffffffffa0bc9ed2>] class_handle_ioctl+0x1822/0x1d20 [obdclass]
[55699.135114]  [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass]
[55699.140337]  [<ffffffffa0bb054a>] obd_class_ioctl+0xba/0x150 [obdclass]
[55699.146489]  [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0
[55699.146489]  [<ffffffff81219f7d>] do_vfs_ioctl+0x2cd/0x4a0
[55699.169646]  [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80
[55699.169647]  [<ffffffff8121a1c4>] SyS_ioctl+0x74/0x80
[55699.169684]  [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa
[55699.174486]  [<ffffffff8160d1be>] entry_SYSCALL_64_fastpath+0x12/0xaa
[55699.196143] (null)
[55699.196146] (null)DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0xaa
[55699.196147] 
[55699.196147] (null)Leftover inexact backtrace:
[55699.196147] 
[55699.196156] 
[55699.196164] Kernel panic - not syncing: LBUG
[55699.196172] CPU: 1 PID: 24355 Comm: lctl Tainted: G           OE   N  4.4.103-6.38_lustre-default #1
[55699.196173] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007

 

This test started failing on 2018-01-31. 

Here are a few of the logs for failures:

https://testing.hpdd.intel.com/test_sets/42ff8372-071d-11e8-a6ad-52540065bddc

https://testing.hpdd.intel.com/test_sets/110bab20-18a2-11e8-bd00-52540065bddc

https://testing.hpdd.intel.com/test_sets/ff52db72-2762-11e8-b74b-52540065bddc 



 Comments   
Comment by Sarah Liu [ 23/Apr/18 ]

This failure blocks master testing

tag-2.11.51

on EL7 server/client

https://testing.hpdd.intel.com/test_sets/d79b14a4-471b-11e8-95c0-52540065bddc

on EL7 server/SLES12sp3 client

https://testing.hpdd.intel.com/test_sets/7be267e6-4718-11e8-960d-52540065bddc

Comment by Peter Jones [ 29/Aug/18 ]

Lai

Could you please investigate?

Thanks

Peter

Comment by Gerrit Updater [ 30/Aug/18 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33092
Subject: LU-10818 obdecho: ma_need is mistakenly set
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: db71e8411535635cadd9818ccbc87f8794b22835

Comment by Gerrit Updater [ 31/Aug/18 ]

Nikitas Angelinas (nangelinas@cray.com) uploaded a new patch: https://review.whamcloud.com/33097
Subject: LU-10818 obdecho: don't set ma_need in echo_attr_get_complex()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2b29f356a3b2696874eeb83a886fc693166cb7a4

Comment by Nikitas Angelinas [ 31/Aug/18 ]

I was meaning to upload this patch we had for this issue, but I have been away from work, lately. We may not want to land it since Lai has already posted a patch, but I am uploading it since it has some additional code rework that we might want to merge into Lai's patch.

Comment by Gerrit Updater [ 04/Sep/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33097/
Subject: LU-10818 obdecho: don't set ma_need in echo_attr_get_complex()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 40f70cd4cb1bb33c754607862dece7c6c1c30d38

Comment by Peter Jones [ 04/Sep/18 ]

Landed for 2.12

Comment by Gerrit Updater [ 07/Jan/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33976
Subject: LU-10818 obdecho: don't set ma_need in echo_attr_get_complex()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 4d2a41a71ceb306b09609be41ac4d96e39f07b9d

Comment by Gerrit Updater [ 16/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33976/
Subject: LU-10818 obdecho: don't set ma_need in echo_attr_get_complex()
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 0920f54a0866f77b49afd2308b798d4db3b69802

Generated at Sat Feb 10 02:38:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.