[LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed'' Created: 27/Sep/17 Updated: 17/Jul/18 Resolved: 28/Mar/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | tests | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
sanity test_133g fails on the call to find on the MDS. From the test_log, we see the call to find mds1_proc_dirs='/proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' sanity test_133g: @@@@@@ FAIL: mds1 find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ failed Looking at the client2 dmesg, we see the output from 133g [ 3573.106070] Lustre: DEBUG MARKER: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 21:44:29 (1504907069) [ 3573.249527] Lustre: 18911:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre: failed to set root_squash due to bad address, rc = -14 [ 3573.254669] Lustre: 18911:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre: failed to set root_squash to "", needs uid:gid format, rc = -22 [ 3573.263113] Lustre: 18916:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids to "", bad address rc = -14 [ 3573.268474] Lustre: 18916:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids due to string too long rc = -22 [ 3573.349716] LustreError: 18934:0:(gss_cli_upcall.c:240:gss_do_ctx_init_rpc()) ioctl size 5, expect 80, please check lgss_keyring version [ 3573.379338] LustreError: 18984:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written [ 3573.422121] Lustre: 19067:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '. mask usage: [+|-]<all|type> ... [ 3574.553074] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/ Looking at the dmesg log on the MDS1, we see similar output, but a few more lines [ 3479.763450] Lustre: DEBUG MARKER: find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
[ 3479.911266] Lustre: 17021:0:(mdt_coordinator.c:1944:mdt_hsm_policy_seq_write()) lustre-MDT0000: ' ' is unknown, supported policies are:
[ 3479.944620] LustreError: 17069:0:(mdt_coordinator.c:2097:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3479.950356] Lustre: 17071:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash due to bad address, rc = -14
[ 3479.954690] Lustre: 17071:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash to "", needs uid:gid format, rc = -22
[ 3479.960431] LustreError: 17072:0:(genops.c:1540:obd_export_evict_by_uuid()) lustre-MDT0000: can't disconnect : no exports found
[ 3479.965980] LustreError: 17074:0:(mdt_lproc.c:366:lprocfs_identity_info_seq_write()) lustre-MDT0000: invalid data count = 5, size = 1048
[ 3479.970389] LustreError: 17074:0:(mdt_lproc.c:383:lprocfs_identity_info_seq_write()) lustre-MDT0000: MDS identity downcall bad params
[ 3479.975746] Lustre: 17075:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "", bad address rc = -14
[ 3479.980510] Lustre: 17075:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids due to string too long rc = -22
[ 3479.988148] LustreError: 17079:0:(mdt_lproc.c:298:mdt_identity_upcall_seq_write()) lustre-MDT0000: identity upcall too long
[ 3480.066568] LustreError: 17187:0:(lproc_fid.c:177:lprocfs_server_fid_width_seq_write()) ctl-lustre-MDT0000: invalid FID sequence width: rc = -14
[ 3480.104133] LustreError: 17240:0:(ldlm_resource.c:104:seq_watermark_write()) Failed to set lock_limit_mb, rc = -14.
[ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
[ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
[ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
[ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
mask usage: [+|-]<all|type> ...
[ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
This ticket was opened because we are seeing this in testing a separate MGS and MDS. Logs for these failures are at: We have also seen this test fail frequently during interop testing. |
| Comments |
| Comment by Oleg Drokin [ 27/Sep/17 ] |
|
So /proc/sys/lustre /proc/sys/lnet no longer exist and the logic in the test cannot catch it it seems? this bit: local proc_dirs=$(eval \ls -d $proc_regexp 2>/dev/null)
is supposed to reduce the list from proc_regexp="/{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/" to: # ls -ld /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/ 2>/dev/null
dr-xr-xr-x 26 root root 0 Sep 27 14:13 /proc/fs/lustre/
drwxr-xr-x 21 root root 0 Sep 27 14:02 /sys/fs/lustre/
drwxr-xr-x 2 root root 0 Sep 27 14:02 /sys/kernel/debug/lnet/
drwxr-xr-x 2 root root 0 Sep 27 14:02 /sys/kernel/debug/lustre/
but in the test output we see /proc/sys/lnet somehow made it to the list too. Interop testing should not be affected because we recheck the list of dirs from every facet separately (since -- |
| Comment by Andreas Dilger [ 27/Sep/17 ] |
|
It's possible that the pathnames are being expanded on the client instead of on the server? Looking at the debug logs, it looks like the "ls -d ..." is working properly, but it is possible. In any case, this seems like a bug in the test and not a functionality problem. I suspect that sufficient escaping of $proc_regexp might help, |
| Comment by James Nunez (Inactive) [ 27/Sep/17 ] |
|
I show that /proc/sys/lnet does have files and /proc/sys/lustre does not make it into the list of dirs to run find in. On the MDS on a test system: # ls -d /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}
ls: cannot access /proc/fs/lnet: No such file or directory
ls: cannot access /proc/sys/lustre: No such file or directory
ls: cannot access /proc/kernel/debug/lustre: No such file or directory
ls: cannot access /proc/kernel/debug/lnet: No such file or directory
ls: cannot access /sys/fs/lnet: No such file or directory
ls: cannot access /sys/sys/lustre: No such file or directory
ls: cannot access /sys/sys/lnet: No such file or directory
/proc/fs/lustre /sys/fs/lustre /sys/kernel/debug/lustre
/proc/sys/lnet /sys/kernel/debug/lnet
I'm running on latest master 2.10.53 3.10.0-693.1.1.el7_lustre.x86_64. I'm a little confused by the output from the find command in this test. Why or where are we getting those Lustre and LustreError messages: ... [ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS. [ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22 [ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written [ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '. mask usage: [+|-]<all|type> ... [ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/ I get the same output, those errors in the MDS dmesg log, in my test system but the test doesn't fail. |
| Comment by James A Simmons [ 28/Sep/17 ] |
|
One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly. |
| Comment by Oleg Drokin [ 28/Sep/17 ] |
|
Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files. |
| Comment by Andreas Dilger [ 04/Nov/17 ] |
|
Hit this again: |
| Comment by Gerrit Updater [ 15/Nov/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105 |
| Comment by Gerrit Updater [ 22/Nov/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219 |
| Comment by Gerrit Updater [ 23/Nov/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/ |
| Comment by Gerrit Updater [ 29/Nov/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/ |
| Comment by John Hammond [ 08/Dec/17 ] |
|
Here's a failure that occurred after the error reporting patch: It looks like one of the export directories is going away while find is running on the mdt: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326)
proc_dirs='/proc/fs/lustre/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null ||
/usr/sbin/lctl lustre_build_version 2>/dev/null ||
/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
mds1_proc_dirs='/proc/fs/lustre/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory
|
| Comment by John Hammond [ 08/Dec/17 ] |
|
This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this. |
| Comment by Gerrit Updater [ 08/Dec/17 ] |
|
John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451 |
| Comment by Andreas Dilger [ 08/Dec/17 ] |
|
Very interesting. I wouldn’t have thought of this as the cause. |
| Comment by Gerrit Updater [ 17/Dec/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30451/ |
| Comment by Joseph Gmitter (Inactive) [ 28/Mar/18 ] |
|
Patches have landed to master for 2.11.0 |