[LU-7802] set_param lru_size fails with 'error: set_param: setting /proc/fs/lustre/ldlm/namespaces/lustre-OST0000-osc-*/lru_size=clear: Invalid argument' Created: 22/Feb/16  Updated: 24/Oct/17  Resolved: 22/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: patch
Environment:

autotest and manual testing


Issue Links:
Related
is related to LU-7437 "lctl list_param -R" can't list the p... Resolved
is related to LU-7796 "lctl set_param jobid_var" should ret... Resolved
is related to LU-9438 sanity-lfsck test_17: (1.2) f1 (wrong... Resolved
is related to LU-8276 Make lru clear always discard read lo... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lctl set_param -n ldlm.namespaces.*$1*.lru_size=clear fails with error message

error: set_param: setting /proc/fs/lustre/ldlm/namespaces/lustre-OST0000-osc-ffff880077f04000/lru_size=clear: Invalid argument

I've seen this error message in the test_log for a few sanity tests. The error does not seem to make the test fail (should it?) and the error is not consistent meaning that a test could hit the error on one test run and not experience the error the next.

Here are a few instances of this error I've come across:
sanity test_127a at https://testing.hpdd.intel.com/test_sets/2f35cef8-d8c8-11e5-83e2-5254006e85c2
sanity test_241 hits this a little more regularly https://testing.hpdd.intel.com/sub_tests/79078936-d8e1-11e5-83e2-5254006e85c2.

The error comes from a call to 'cancel_lru_locks osc'. From tests/test-framework.sh, we see

cancel_lru_locks() {
#$LCTL mark "cancel_lru_locks $1 start"
$LCTL set_param -n ldlm.namespaces.*$1*.lru_size=clear
$LCTL get_param ldlm.namespaces.*$1*.lock_unused_count | grep -v '=0'
#$LCTL mark "cancel_lru_locks $1 stop"

It's not clear what is causing this error. Since this error does not cause the test to fail, it's hard to find other occurrences of this error and when it first started.



 Comments   
Comment by Oleg Drokin [ 22/Feb/16 ]
        if (strncmp(dummy, "clear", 5) == 0) {
                CDEBUG(D_DLMTRACE,
                       "dropping all unused locks from namespace %s\n",
                       ldlm_ns_name(ns));
                if (ns_connect_lru_resize(ns)) {
                        int canceled, unused  = ns->ns_nr_unused;

                        /* Try to cancel all @ns_nr_unused locks. */
                        canceled = ldlm_cancel_lru(ns, unused, 0,
                                                   LDLM_LRU_FLAG_PASSED);
                        if (canceled < unused) {
                                CDEBUG(D_DLMTRACE,
                                       "not all requested locks are canceled, "
                                       "requested: %d, canceled: %d\n", unused,
                                       canceled);
                                return -EINVAL;
                        }

This seems racy and perhaps there were other cancellers in parallel or something? Probaly need to revisit taht code?

Comment by James A Simmons [ 09/Feb/17 ]

With the migration to sysfs I can take a look at it.

Comment by Sarah Liu [ 27/Mar/17 ]

another instance on master branch: https://testing.hpdd.intel.com/test_sets/b24483ae-0a02-11e7-9053-5254006e85c2
tag-2.9.54 server: el7 client: sles12sp2

Comment by Bob Glossman (Inactive) [ 14/Jul/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/76d956d6-68d6-11e7-baf7-5254006e85c2

Comment by James Nunez (Inactive) [ 28/Jul/17 ]

It looks like sanity test 101g also suffers from this issue and, from the test log, fails with

...
error: set_param: setting /sys/fs/lustre/ldlm/namespaces/lustre-OST0000-osc-ffff880068917800/lru_size=clear: Invalid argument
ldlm.namespaces.lustre-OST0000-osc-ffff880068917800.lock_unused_count=1
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.00800729 s, 5.2 GB/s
 sanity test_101g: @@@@@@ FAIL: 0 != 10 read RPCs 

Logs are at:
https://testing.hpdd.intel.com/test_sets/1e288200-7281-11e7-a0a2-5254006e85c2

Comment by Bob Glossman (Inactive) [ 07/Aug/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/5e303066-7aef-11e7-9b8f-5254006e85c2

Comment by James A Simmons [ 14/Aug/17 ]

Removed LU-8066 link since this is a race condition and not a sysfs issue. What I do see is a potential patch from LU-8276 that might fix this issue. I added a link to LU-8276 to here.

Comment by Steve Guminski (Inactive) [ 15/Aug/17 ]

Another on master:

https://testing.hpdd.intel.com/test_sessions/d7870a08-73b3-4f95-898b-f4f0908c9214

Comment by Patrick Farrell (Inactive) [ 15/Aug/17 ]

This isn't racy so much as just wrong. Sometimes locks are in use, so we don't cancel them. That's intended behavior.

The fix for this is just not to return -EINVAL. This isn't a condition that should generate that sort of error.

I'll push a patch.

Comment by Gerrit Updater [ 15/Aug/17 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/28560
Subject: LU-7802 ldlm: No -EINVAL for lock in use
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 70b92f98c56a128894009aa608dcfa589836fe47

Comment by Bob Glossman (Inactive) [ 24/Aug/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/f5187be6-8878-11e7-b3ca-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 30/Aug/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/40e74f6a-8cb2-11e7-b4ee-5254006e85c2

Comment by Gerrit Updater [ 13/Sep/17 ]

Patrick Farrell (paf@cray.com) uploaded a new patch: https://review.whamcloud.com/28975
Subject: LU-7802 ldlm: No -EINVAL for lock in use
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 01add10e6eab833054fb232d9e12cb48b5a63301

Comment by Gerrit Updater [ 22/Sep/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28560/
Subject: LU-7802 ldlm: No -EINVAL for canceled != unused
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a5081b7362e44b8d38aee1112f9a7d3aae1642c0

Comment by Peter Jones [ 22/Sep/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 24/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28975/
Subject: LU-7802 ldlm: No -EINVAL for canceled != unused
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 9a38fcb07dadc6f6b4c55e24feae004175c906e9

Generated at Sat Feb 10 02:12:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.