[LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed'' Created: 27/Sep/17  Updated: 17/Jul/18  Resolved: 28/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: WC Triage
Resolution: Fixed Votes: 0
Labels: tests

Issue Links:
Related
is related to LU-9700 Interop 2.9.0<->master sanity test_13... Open
is related to LU-8066 Move lustre procfs handling to sysfs ... Open
is related to LU-11152 sanity test_133g: ost1 find /proc/fs... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_133g fails on the call to find on the MDS. From the test_log, we see the call to find

mds1_proc_dirs='/proc/fs/lustre/
/proc/sys/lnet/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
 sanity test_133g: @@@@@@ FAIL: mds1 find /proc/fs/lustre/
/proc/sys/lnet/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/ failed 

Looking at the client2 dmesg, we see the output from 133g

[ 3573.106070] Lustre: DEBUG MARKER: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 21:44:29 (1504907069)
[ 3573.249527] Lustre: 18911:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre: failed to set root_squash due to bad address, rc = -14
[ 3573.254669] Lustre: 18911:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre: failed to set root_squash to "", needs uid:gid format, rc = -22
[ 3573.263113] Lustre: 18916:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids to "", bad address rc = -14
[ 3573.268474] Lustre: 18916:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids due to string too long rc = -22
[ 3573.349716] LustreError: 18934:0:(gss_cli_upcall.c:240:gss_do_ctx_init_rpc()) ioctl size 5, expect 80, please check lgss_keyring version
[ 3573.379338] LustreError: 18984:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
[ 3573.422121] Lustre: 19067:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
mask usage: [+|-]<all|type> ...
[ 3574.553074] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/

Looking at the dmesg log on the MDS1, we see similar output, but a few more lines

[ 3479.763450] Lustre: DEBUG MARKER: find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
[ 3479.911266] Lustre: 17021:0:(mdt_coordinator.c:1944:mdt_hsm_policy_seq_write()) lustre-MDT0000: ' ' is unknown, supported policies are:
[ 3479.944620] LustreError: 17069:0:(mdt_coordinator.c:2097:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3479.950356] Lustre: 17071:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash due to bad address, rc = -14
[ 3479.954690] Lustre: 17071:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash to "", needs uid:gid format, rc = -22
[ 3479.960431] LustreError: 17072:0:(genops.c:1540:obd_export_evict_by_uuid()) lustre-MDT0000: can't disconnect : no exports found
[ 3479.965980] LustreError: 17074:0:(mdt_lproc.c:366:lprocfs_identity_info_seq_write()) lustre-MDT0000: invalid data count = 5, size = 1048
[ 3479.970389] LustreError: 17074:0:(mdt_lproc.c:383:lprocfs_identity_info_seq_write()) lustre-MDT0000: MDS identity downcall bad params
[ 3479.975746] Lustre: 17075:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "", bad address rc = -14
[ 3479.980510] Lustre: 17075:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids due to string too long rc = -22
[ 3479.988148] LustreError: 17079:0:(mdt_lproc.c:298:mdt_identity_upcall_seq_write()) lustre-MDT0000: identity upcall too long
[ 3480.066568] LustreError: 17187:0:(lproc_fid.c:177:lprocfs_server_fid_width_seq_write()) ctl-lustre-MDT0000: invalid FID sequence width: rc = -14
[ 3480.104133] LustreError: 17240:0:(ldlm_resource.c:104:seq_watermark_write()) Failed to set lock_limit_mb, rc = -14.
[ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
[ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
[ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
[ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
mask usage: [+|-]<all|type> ...
[ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/

This ticket was opened because we are seeing this in testing a separate MGS and MDS. Logs for these failures are at:
https://testing.hpdd.intel.com/test_sets/308ca8f6-97d7-11e7-b944-5254006e85c2
https://testing.hpdd.intel.com/test_sets/774c9f26-97d7-11e7-b944-5254006e85c2

We have also seen this test fail frequently during interop testing.



 Comments   
Comment by Oleg Drokin [ 27/Sep/17 ]

So /proc/sys/lustre /proc/sys/lnet no longer exist and the logic in the test cannot catch it it seems?

this bit:

        local proc_dirs=$(eval \ls -d $proc_regexp 2>/dev/null)

is supposed to reduce the list from proc_regexp="/{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/" to:

# ls -ld /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/ 2>/dev/null
dr-xr-xr-x 26 root root 0 Sep 27 14:13 /proc/fs/lustre/
drwxr-xr-x 21 root root 0 Sep 27 14:02 /sys/fs/lustre/
drwxr-xr-x  2 root root 0 Sep 27 14:02 /sys/kernel/debug/lnet/
drwxr-xr-x  2 root root 0 Sep 27 14:02 /sys/kernel/debug/lustre/

but in the test output we see /proc/sys/lnet somehow made it to the list too.

Interop testing should not be affected because we recheck the list of dirs from every facet separately (since --LU-7092-- )

Comment by Andreas Dilger [ 27/Sep/17 ]

It's possible that the pathnames are being expanded on the client instead of on the server?  Looking at the debug logs, it looks like the "ls -d ..." is working properly, but it is possible. 

In any case, this seems like a bug in the test and not a functionality problem.  I suspect that sufficient escaping of $proc_regexp might help,

Comment by James Nunez (Inactive) [ 27/Sep/17 ]

I show that /proc/sys/lnet does have files and /proc/sys/lustre does not make it into the list of dirs to run find in. On the MDS on a test system:

# ls -d /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}
ls: cannot access /proc/fs/lnet: No such file or directory
ls: cannot access /proc/sys/lustre: No such file or directory
ls: cannot access /proc/kernel/debug/lustre: No such file or directory
ls: cannot access /proc/kernel/debug/lnet: No such file or directory
ls: cannot access /sys/fs/lnet: No such file or directory
ls: cannot access /sys/sys/lustre: No such file or directory
ls: cannot access /sys/sys/lnet: No such file or directory
/proc/fs/lustre  /sys/fs/lustre          /sys/kernel/debug/lustre
/proc/sys/lnet   /sys/kernel/debug/lnet

I'm running on latest master 2.10.53 3.10.0-693.1.1.el7_lustre.x86_64.

I'm a little confused by the output from the find command in this test. Why or where are we getting those Lustre and LustreError messages:

...
[ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
[ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
[ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
[ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
mask usage: [+|-]<all|type> ...
[ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/

I get the same output, those errors in the MDS dmesg log, in my test system but the test doesn't fail.

Comment by James A Simmons [ 28/Sep/17 ]

One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.

Comment by Oleg Drokin [ 28/Sep/17 ]

Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files.
The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

Comment by Andreas Dilger [ 04/Nov/17 ]

Hit this again:
https://testing.hpdd.intel.com/test_sets/5cdcc7f8-c08e-11e7-8cd9-52540065bddc

Comment by Gerrit Updater [ 15/Nov/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105
Subject: LU-10038 test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

Comment by Gerrit Updater [ 22/Nov/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219
Subject: LU-10038 test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: flr
Current Patch Set: 1
Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

Comment by Gerrit Updater [ 23/Nov/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/
Subject: LU-10038 test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: flr
Current Patch Set:
Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

Comment by Gerrit Updater [ 29/Nov/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/
Subject: LU-10038 test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

Comment by John Hammond [ 08/Dec/17 ]

Here's a failure that occurred after the error reporting patch:
https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc

It looks like one of the export directories is going away while find is running on the mdt:

== sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326)
proc_dirs='/proc/fs/lustre/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
mds1_proc_dirs='/proc/fs/lustre/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory
Comment by John Hammond [ 08/Dec/17 ]

This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.

Comment by Gerrit Updater [ 08/Dec/17 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451
Subject: LU-10038 test: ignore readdir races in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26

Comment by Andreas Dilger [ 08/Dec/17 ]

Very interesting. I wouldn’t have thought of this as the cause.

Comment by Gerrit Updater [ 17/Dec/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30451/
Subject: LU-10038 test: ignore readdir races in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 45f99f6562349f77be6b47bf1bc5a94abf9fd11d

Comment by Joseph Gmitter (Inactive) [ 28/Mar/18 ]

Patches have landed to master for 2.11.0

Generated at Sat Feb 10 02:31:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.