[LU-15846] sanity-quota interop fail: set_param: setting debug=+quota,trace: Invalid argument Created: 11/May/22  Updated: 07/Dec/23  Resolved: 04/Aug/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
is related to LU-13305 osc_cache_truncate_start()) ASSERTION... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Cliff White <cwhite@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/f26e6614-1b8b-4b79-a70f-ee4ce7252e9f

Client appears to have been evicted at start or prior to start of test.

[11019.259413] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 02:16:49 (1651976209)
[11019.601614] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
[11020.000907] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
[11020.920108] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-39vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[11021.156356] Lustre: DEBUG MARKER: trevis-39vm2.trevis.whamcloud.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
[11021.251916] LustreError: 7705:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 0@lo) returned error from blocking AST (req@ffff971c467da880 x1732211603062336 status -107 rc -107), evict it ns: mdt-lustre-MDT0000_UUID lock: ffff971ccc36d680/0x1c652b3f19039ba8 lrc: 4/0,0 mode: PR/PR res: [0x20001a9e2:0x5a6a:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 0@lo remote: 0x1c652b3f19039b9a expref: 934 pid: 10604 timeout: 11121 lvb_type: 0
[11021.259068] LustreError: 138-a: lustre-MDT0000: A client on nid 0@lo was evicted due to a lock blocking callback time out: rc -107
[11021.261382] LustreError: 10594:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 0@lo  ns: mdt-lustre-MDT0000_UUID lock: ffff971ccc36d680/0x1c652b3f19039ba8 lrc: 3/0,0 mode: PR/PR res: [0x20001a9e2:0x5a6a:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags: 0x60200400000020 nid: 0@lo remote: 0x1c652b3f19039b9a expref: 935 pid: 10604 timeout: 0 lvb_type: 0
[11021.369209] LustreError: 7705:0:(client.c:1213:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff971c467da880 x1732211603062400/t0(0) o104->lustre-MDT0000@0@lo:15/16 lens 296/224 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1
[11022.532353] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.mdt=ug
[11022.923428] Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.quota.ost=ug
[11023.418437] Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: sanity-quota ============----- Sun May  8 02:16:53 UTC 2022
[11023.673072] Lustre: DEBUG MARKER: -----============= acceptance-small: sanity-quota ============----- Sun May 8 02:16:53 UTC 2022
[11022.550571] LustreError: 17675:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 10.240.39.127@tcp) returned error from blocking AST (req@ffff8a9638ad8d80 x1732211273986240 status -107 rc -107), evict it ns: filter-lustre-OST0002_UUID lock: ffff8a95b6db2240/0x1ad8c47013a5028e lrc: 4/0,0 mode: PW/PW res: [0x161ba:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400010020 nid: 10.240.39.127@tcp remote: 0x1c652b3f190530c1 expref: 63 pid: 12899 timeout: 11122 lvb_type: 0
[11022.558744] LustreError: 138-a: lustre-OST0002: A client on nid 10.240.39.127@tcp was evicted due to a lock blocking callback time out: rc -107
[11022.563250] LustreError: 20721:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 10.240.39.127@tcp  ns: filter-lustre-OST0003_UUID lock: ffff8a960f060b40/0x1ad8c47013a4c8f8 lrc: 3/0,0 mode: PW/PW res: [0x1615e:0x0:0x0].0x0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000400030020 nid: 10.240.39.127@tcp remote: 0x1c652b3f18fe90fa expref: 64 pid: 17700 timeout: 0 lvb_type: 0


 Comments   
Comment by Andreas Dilger [ 12/May/22 ]

This may also be an interop/test-framework.sh bug when running on older servers (2.12.6.ddn66 in this case) introduced by patch https://review.whamcloud.com/43741 "LU-13055 libcfs: allow comma-separated masks" that was landed as 2.14.0-ddn37-36-gba4d37b9fc.

The last lines of the test log are:

trevis-39vm4: error: set_param: setting /sys/kernel/debug/lnet/debug=+quota,trace: Invalid argument
trevis-39vm3: error: set_param: setting /sys/kernel/debug/lnet/debug=+quota,trace: Invalid argument

which is caused by a change in sanity-quota.sh:

quota_init() {
        do_nodes $(comma_list $(nodes_list)) \
                "$LCTL set_param -n debug=+quota,trace"
}

And indeed, running a Maloo search for b_es6_0 clients and b_es5_2 servers shows that sanity-quota.sh has been failing since 2022-03-10, which is the day after that patch landed.

Comment by Gerrit Updater [ 12/May/22 ]

"Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47308
Subject: LU-15846 tests: don't use comma-separated debug flags
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 12282314e2567973b1a7cb4d6806e987bf1a6755

Comment by Gerrit Updater [ 04/Aug/22 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47308/
Subject: LU-15846 tests: don't use comma-separated debug flags
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fe8315c25ed093d77a6f366e2a4849aba008b680

Comment by Peter Jones [ 04/Aug/22 ]

Landed for 2.16

Generated at Sat Feb 10 03:21:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.