Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10038

sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      sanity test_133g fails on the call to find on the MDS. From the test_log, we see the call to find

      mds1_proc_dirs='/proc/fs/lustre/
      /proc/sys/lnet/
      /sys/fs/lustre/
      /sys/kernel/debug/lnet/
      /sys/kernel/debug/lustre/'
       sanity test_133g: @@@@@@ FAIL: mds1 find /proc/fs/lustre/
      /proc/sys/lnet/
      /sys/fs/lustre/
      /sys/kernel/debug/lnet/
      /sys/kernel/debug/lustre/ failed 
      

      Looking at the client2 dmesg, we see the output from 133g

      [ 3573.106070] Lustre: DEBUG MARKER: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 21:44:29 (1504907069)
      [ 3573.249527] Lustre: 18911:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre: failed to set root_squash due to bad address, rc = -14
      [ 3573.254669] Lustre: 18911:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre: failed to set root_squash to "", needs uid:gid format, rc = -22
      [ 3573.263113] Lustre: 18916:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids to "", bad address rc = -14
      [ 3573.268474] Lustre: 18916:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids due to string too long rc = -22
      [ 3573.349716] LustreError: 18934:0:(gss_cli_upcall.c:240:gss_do_ctx_init_rpc()) ioctl size 5, expect 80, please check lgss_keyring version
      [ 3573.379338] LustreError: 18984:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
      [ 3573.422121] Lustre: 19067:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
      mask usage: [+|-]<all|type> ...
      [ 3574.553074] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
      

      Looking at the dmesg log on the MDS1, we see similar output, but a few more lines

      [ 3479.763450] Lustre: DEBUG MARKER: find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
      [ 3479.911266] Lustre: 17021:0:(mdt_coordinator.c:1944:mdt_hsm_policy_seq_write()) lustre-MDT0000: ' ' is unknown, supported policies are:
      [ 3479.944620] LustreError: 17069:0:(mdt_coordinator.c:2097:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
      [ 3479.950356] Lustre: 17071:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash due to bad address, rc = -14
      [ 3479.954690] Lustre: 17071:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash to "", needs uid:gid format, rc = -22
      [ 3479.960431] LustreError: 17072:0:(genops.c:1540:obd_export_evict_by_uuid()) lustre-MDT0000: can't disconnect : no exports found
      [ 3479.965980] LustreError: 17074:0:(mdt_lproc.c:366:lprocfs_identity_info_seq_write()) lustre-MDT0000: invalid data count = 5, size = 1048
      [ 3479.970389] LustreError: 17074:0:(mdt_lproc.c:383:lprocfs_identity_info_seq_write()) lustre-MDT0000: MDS identity downcall bad params
      [ 3479.975746] Lustre: 17075:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "", bad address rc = -14
      [ 3479.980510] Lustre: 17075:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids due to string too long rc = -22
      [ 3479.988148] LustreError: 17079:0:(mdt_lproc.c:298:mdt_identity_upcall_seq_write()) lustre-MDT0000: identity upcall too long
      [ 3480.066568] LustreError: 17187:0:(lproc_fid.c:177:lprocfs_server_fid_width_seq_write()) ctl-lustre-MDT0000: invalid FID sequence width: rc = -14
      [ 3480.104133] LustreError: 17240:0:(ldlm_resource.c:104:seq_watermark_write()) Failed to set lock_limit_mb, rc = -14.
      [ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
      [ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
      [ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
      [ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
      mask usage: [+|-]<all|type> ...
      [ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
      

      This ticket was opened because we are seeing this in testing a separate MGS and MDS. Logs for these failures are at:
      https://testing.hpdd.intel.com/test_sets/308ca8f6-97d7-11e7-b944-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/774c9f26-97d7-11e7-b944-5254006e85c2

      We have also seen this test fail frequently during interop testing.

      Attachments

        Issue Links

          Activity

            [LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

            Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: flr
            Current Patch Set:
            Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: flr
            Current Patch Set: 1
            Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: 1 Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993
            adilger Andreas Dilger added a comment - Hit this again: https://testing.hpdd.intel.com/test_sets/5cdcc7f8-c08e-11e7-8cd9-52540065bddc
            green Oleg Drokin added a comment -

            Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files.
            The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

            green Oleg Drokin added a comment - Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files. The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

            One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.

            simmonsja James A Simmons added a comment - One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.
            jamesanunez James Nunez (Inactive) added a comment - - edited

            I show that /proc/sys/lnet does have files and /proc/sys/lustre does not make it into the list of dirs to run find in. On the MDS on a test system:

            # ls -d /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}
            ls: cannot access /proc/fs/lnet: No such file or directory
            ls: cannot access /proc/sys/lustre: No such file or directory
            ls: cannot access /proc/kernel/debug/lustre: No such file or directory
            ls: cannot access /proc/kernel/debug/lnet: No such file or directory
            ls: cannot access /sys/fs/lnet: No such file or directory
            ls: cannot access /sys/sys/lustre: No such file or directory
            ls: cannot access /sys/sys/lnet: No such file or directory
            /proc/fs/lustre  /sys/fs/lustre          /sys/kernel/debug/lustre
            /proc/sys/lnet   /sys/kernel/debug/lnet
            

            I'm running on latest master 2.10.53 3.10.0-693.1.1.el7_lustre.x86_64.

            I'm a little confused by the output from the find command in this test. Why or where are we getting those Lustre and LustreError messages:

            ...
            [ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
            [ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
            [ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
            [ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
            mask usage: [+|-]<all|type> ...
            [ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
            

            I get the same output, those errors in the MDS dmesg log, in my test system but the test doesn't fail.

            jamesanunez James Nunez (Inactive) added a comment - - edited I show that /proc/sys/lnet does have files and /proc/sys/lustre does not make it into the list of dirs to run find in. On the MDS on a test system: # ls -d /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet} ls: cannot access /proc/fs/lnet: No such file or directory ls: cannot access /proc/sys/lustre: No such file or directory ls: cannot access /proc/kernel/debug/lustre: No such file or directory ls: cannot access /proc/kernel/debug/lnet: No such file or directory ls: cannot access /sys/fs/lnet: No such file or directory ls: cannot access /sys/sys/lustre: No such file or directory ls: cannot access /sys/sys/lnet: No such file or directory /proc/fs/lustre /sys/fs/lustre /sys/kernel/debug/lustre /proc/sys/lnet /sys/kernel/debug/lnet I'm running on latest master 2.10.53 3.10.0-693.1.1.el7_lustre.x86_64. I'm a little confused by the output from the find command in this test. Why or where are we getting those Lustre and LustreError messages: ... [ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS. [ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22 [ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written [ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '. mask usage: [+|-]<all|type> ... [ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/ I get the same output, those errors in the MDS dmesg log, in my test system but the test doesn't fail.

            It's possible that the pathnames are being expanded on the client instead of on the server?  Looking at the debug logs, it looks like the "ls -d ..." is working properly, but it is possible. 

            In any case, this seems like a bug in the test and not a functionality problem.  I suspect that sufficient escaping of $proc_regexp might help,

            adilger Andreas Dilger added a comment - It's possible that the pathnames are being expanded on the client instead of on the server?  Looking at the debug logs, it looks like the "ls -d ..." is working properly, but it is possible.  In any case, this seems like a bug in the test and not a functionality problem.  I suspect that sufficient escaping of $proc_regexp might help,
            green Oleg Drokin added a comment - - edited

            So /proc/sys/lustre /proc/sys/lnet no longer exist and the logic in the test cannot catch it it seems?

            this bit:

                    local proc_dirs=$(eval \ls -d $proc_regexp 2>/dev/null)
            
            

            is supposed to reduce the list from proc_regexp="/{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/" to:

            # ls -ld /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/ 2>/dev/null
            dr-xr-xr-x 26 root root 0 Sep 27 14:13 /proc/fs/lustre/
            drwxr-xr-x 21 root root 0 Sep 27 14:02 /sys/fs/lustre/
            drwxr-xr-x  2 root root 0 Sep 27 14:02 /sys/kernel/debug/lnet/
            drwxr-xr-x  2 root root 0 Sep 27 14:02 /sys/kernel/debug/lustre/
            
            

            but in the test output we see /proc/sys/lnet somehow made it to the list too.

            Interop testing should not be affected because we recheck the list of dirs from every facet separately (since --LU-7092-- )

            green Oleg Drokin added a comment - - edited So /proc/sys/lustre /proc/sys/lnet no longer exist and the logic in the test cannot catch it it seems? this bit: local proc_dirs=$(eval \ls -d $proc_regexp 2>/dev/ null ) is supposed to reduce the list from proc_regexp="/{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/" to: # ls -ld /{proc,sys}/{fs,sys,kernel/debug}/{lustre,lnet}/ 2>/dev/null dr-xr-xr-x 26 root root 0 Sep 27 14:13 /proc/fs/lustre/ drwxr-xr-x 21 root root 0 Sep 27 14:02 /sys/fs/lustre/ drwxr-xr-x 2 root root 0 Sep 27 14:02 /sys/kernel/debug/lnet/ drwxr-xr-x 2 root root 0 Sep 27 14:02 /sys/kernel/debug/lustre/ but in the test output we see  /proc/sys/lnet somehow made it to the list too. Interop testing should not be affected because we recheck the list of dirs from every facet separately (since -- LU-7092 -- )

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: