[LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed'' - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.11.0
Affects Version/s: Lustre 2.11.0
Labels:
- tests

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

sanity test_133g fails on the call to find on the MDS. From the test_log, we see the call to find

mds1_proc_dirs='/proc/fs/lustre/
/proc/sys/lnet/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
 sanity test_133g: @@@@@@ FAIL: mds1 find /proc/fs/lustre/
/proc/sys/lnet/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/ failed

Looking at the client2 dmesg, we see the output from 133g

[ 3573.106070] Lustre: DEBUG MARKER: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 21:44:29 (1504907069)
[ 3573.249527] Lustre: 18911:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre: failed to set root_squash due to bad address, rc = -14
[ 3573.254669] Lustre: 18911:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre: failed to set root_squash to "", needs uid:gid format, rc = -22
[ 3573.263113] Lustre: 18916:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids to "", bad address rc = -14
[ 3573.268474] Lustre: 18916:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids due to string too long rc = -22
[ 3573.349716] LustreError: 18934:0:(gss_cli_upcall.c:240:gss_do_ctx_init_rpc()) ioctl size 5, expect 80, please check lgss_keyring version
[ 3573.379338] LustreError: 18984:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
[ 3573.422121] Lustre: 19067:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
mask usage: [+|-]<all|type> ...
[ 3574.553074] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/

Looking at the dmesg log on the MDS1, we see similar output, but a few more lines

[ 3479.763450] Lustre: DEBUG MARKER: find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
[ 3479.911266] Lustre: 17021:0:(mdt_coordinator.c:1944:mdt_hsm_policy_seq_write()) lustre-MDT0000: ' ' is unknown, supported policies are:
[ 3479.944620] LustreError: 17069:0:(mdt_coordinator.c:2097:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
[ 3479.950356] Lustre: 17071:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash due to bad address, rc = -14
[ 3479.954690] Lustre: 17071:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash to "", needs uid:gid format, rc = -22
[ 3479.960431] LustreError: 17072:0:(genops.c:1540:obd_export_evict_by_uuid()) lustre-MDT0000: can't disconnect : no exports found
[ 3479.965980] LustreError: 17074:0:(mdt_lproc.c:366:lprocfs_identity_info_seq_write()) lustre-MDT0000: invalid data count = 5, size = 1048
[ 3479.970389] LustreError: 17074:0:(mdt_lproc.c:383:lprocfs_identity_info_seq_write()) lustre-MDT0000: MDS identity downcall bad params
[ 3479.975746] Lustre: 17075:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "", bad address rc = -14
[ 3479.980510] Lustre: 17075:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids due to string too long rc = -22
[ 3479.988148] LustreError: 17079:0:(mdt_lproc.c:298:mdt_identity_upcall_seq_write()) lustre-MDT0000: identity upcall too long
[ 3480.066568] LustreError: 17187:0:(lproc_fid.c:177:lprocfs_server_fid_width_seq_write()) ctl-lustre-MDT0000: invalid FID sequence width: rc = -14
[ 3480.104133] LustreError: 17240:0:(ldlm_resource.c:104:seq_watermark_write()) Failed to set lock_limit_mb, rc = -14.
[ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
[ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
[ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
[ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
mask usage: [+|-]<all|type> ...
[ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/

This ticket was opened because we are seeing this in testing a separate MGS and MDS. Logs for these failures are at:
https://testing.hpdd.intel.com/test_sets/308ca8f6-97d7-11e7-b944-5254006e85c2
https://testing.hpdd.intel.com/test_sets/774c9f26-97d7-11e7-b944-5254006e85c2

We have also seen this test fail frequently during interop testing.

Attachments

Issue Links

is related to

LU-11152 sanity test_133g: ost1 find /proc/fs/lustre/ /proc/sys/lnet/ /proc/sys/lustre/ failed

Resolved

is related to

LU-9700 Interop 2.9.0<->master sanity test_133f: proc file read failed

Open

LU-8066 Move lustre procfs handling to sysfs and debugfs.

Open

Activity

[LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

Andreas Dilger added a comment - 08/Dec/17 5:42 PM

Very interesting. I wouldn’t have thought of this as the cause.

Andreas Dilger added a comment - 08/Dec/17 5:42 PM Very interesting. I wouldn’t have thought of this as the cause.

Gerrit Updater added a comment - 08/Dec/17 4:34 PM

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451
Subject: ~~LU-10038~~ test: ignore readdir races in sanity 133g
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26

Gerrit Updater added a comment - 08/Dec/17 4:34 PM John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451 Subject: LU-10038 test: ignore readdir races in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26

John Hammond added a comment - 08/Dec/17 3:50 PM

This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.

John Hammond added a comment - 08/Dec/17 3:50 PM This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.

John Hammond added a comment - 08/Dec/17 2:31 PM

Here's a failure that occurred after the error reporting patch:
https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc

It looks like one of the export directories is going away while find is running on the mdt:

== sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326)
proc_dirs='/proc/fs/lustre/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
mds1_proc_dirs='/proc/fs/lustre/
/sys/fs/lustre/
/sys/kernel/debug/lnet/
/sys/kernel/debug/lustre/'
CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory

John Hammond added a comment - 08/Dec/17 2:31 PM Here's a failure that occurred after the error reporting patch: https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc It looks like one of the export directories is going away while find is running on the mdt: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326) proc_dirs='/proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 mds1_proc_dirs='/proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \; onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory

Gerrit Updater added a comment - 29/Nov/17 5:59 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/
Subject: ~~LU-10038~~ test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

Gerrit Updater added a comment - 29/Nov/17 5:59 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

Gerrit Updater added a comment - 23/Nov/17 12:47 AM

Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/
Subject: ~~LU-10038~~ test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: flr
Current Patch Set:
Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

Gerrit Updater added a comment - 23/Nov/17 12:47 AM Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

Gerrit Updater added a comment - 22/Nov/17 8:49 PM

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219
Subject: ~~LU-10038~~ test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: flr
Current Patch Set: 1
Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

Gerrit Updater added a comment - 22/Nov/17 8:49 PM Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: 1 Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

Gerrit Updater added a comment - 15/Nov/17 4:12 PM

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105
Subject: ~~LU-10038~~ test: improve error reporting in sanity test_133g()
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

Gerrit Updater added a comment - 15/Nov/17 4:12 PM John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

Andreas Dilger added a comment - 04/Nov/17 2:42 PM

Hit this again:
https://testing.hpdd.intel.com/test_sets/5cdcc7f8-c08e-11e7-8cd9-52540065bddc

Andreas Dilger added a comment - 04/Nov/17 2:42 PM Hit this again: https://testing.hpdd.intel.com/test_sets/5cdcc7f8-c08e-11e7-8cd9-52540065bddc

Oleg Drokin added a comment - 28/Sep/17 1:46 AM

Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files.
The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

Oleg Drokin added a comment - 28/Sep/17 1:46 AM Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files. The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

James A Simmons added a comment - 28/Sep/17 12:02 AM

One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.

James A Simmons added a comment - 28/Sep/17 12:02 AM One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.

People

Assignee:: WC Triage

Reporter:: James Nunez (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 27/Sep/17 6:00 PM

Updated:: 17/Jul/18 6:29 PM

Resolved:: 28/Mar/18 2:46 PM