Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10038

sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      sanity test_133g fails on the call to find on the MDS. From the test_log, we see the call to find

      mds1_proc_dirs='/proc/fs/lustre/
      /proc/sys/lnet/
      /sys/fs/lustre/
      /sys/kernel/debug/lnet/
      /sys/kernel/debug/lustre/'
       sanity test_133g: @@@@@@ FAIL: mds1 find /proc/fs/lustre/
      /proc/sys/lnet/
      /sys/fs/lustre/
      /sys/kernel/debug/lnet/
      /sys/kernel/debug/lustre/ failed 
      

      Looking at the client2 dmesg, we see the output from 133g

      [ 3573.106070] Lustre: DEBUG MARKER: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 21:44:29 (1504907069)
      [ 3573.249527] Lustre: 18911:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre: failed to set root_squash due to bad address, rc = -14
      [ 3573.254669] Lustre: 18911:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre: failed to set root_squash to "", needs uid:gid format, rc = -22
      [ 3573.263113] Lustre: 18916:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids to "", bad address rc = -14
      [ 3573.268474] Lustre: 18916:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids due to string too long rc = -22
      [ 3573.349716] LustreError: 18934:0:(gss_cli_upcall.c:240:gss_do_ctx_init_rpc()) ioctl size 5, expect 80, please check lgss_keyring version
      [ 3573.379338] LustreError: 18984:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
      [ 3573.422121] Lustre: 19067:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
      mask usage: [+|-]<all|type> ...
      [ 3574.553074] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
      

      Looking at the dmesg log on the MDS1, we see similar output, but a few more lines

      [ 3479.763450] Lustre: DEBUG MARKER: find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
      [ 3479.911266] Lustre: 17021:0:(mdt_coordinator.c:1944:mdt_hsm_policy_seq_write()) lustre-MDT0000: ' ' is unknown, supported policies are:
      [ 3479.944620] LustreError: 17069:0:(mdt_coordinator.c:2097:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
      [ 3479.950356] Lustre: 17071:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash due to bad address, rc = -14
      [ 3479.954690] Lustre: 17071:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash to "", needs uid:gid format, rc = -22
      [ 3479.960431] LustreError: 17072:0:(genops.c:1540:obd_export_evict_by_uuid()) lustre-MDT0000: can't disconnect : no exports found
      [ 3479.965980] LustreError: 17074:0:(mdt_lproc.c:366:lprocfs_identity_info_seq_write()) lustre-MDT0000: invalid data count = 5, size = 1048
      [ 3479.970389] LustreError: 17074:0:(mdt_lproc.c:383:lprocfs_identity_info_seq_write()) lustre-MDT0000: MDS identity downcall bad params
      [ 3479.975746] Lustre: 17075:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "", bad address rc = -14
      [ 3479.980510] Lustre: 17075:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids due to string too long rc = -22
      [ 3479.988148] LustreError: 17079:0:(mdt_lproc.c:298:mdt_identity_upcall_seq_write()) lustre-MDT0000: identity upcall too long
      [ 3480.066568] LustreError: 17187:0:(lproc_fid.c:177:lprocfs_server_fid_width_seq_write()) ctl-lustre-MDT0000: invalid FID sequence width: rc = -14
      [ 3480.104133] LustreError: 17240:0:(ldlm_resource.c:104:seq_watermark_write()) Failed to set lock_limit_mb, rc = -14.
      [ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
      [ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
      [ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
      [ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
      mask usage: [+|-]<all|type> ...
      [ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
      

      This ticket was opened because we are seeing this in testing a separate MGS and MDS. Logs for these failures are at:
      https://testing.hpdd.intel.com/test_sets/308ca8f6-97d7-11e7-b944-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/774c9f26-97d7-11e7-b944-5254006e85c2

      We have also seen this test fail frequently during interop testing.

      Attachments

        Issue Links

          Activity

            [LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

            Very interesting. I wouldn’t have thought of this as the cause.

            adilger Andreas Dilger added a comment - Very interesting. I wouldn’t have thought of this as the cause.

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451
            Subject: LU-10038 test: ignore readdir races in sanity 133g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451 Subject: LU-10038 test: ignore readdir races in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26
            jhammond John Hammond added a comment -

            This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.

            jhammond John Hammond added a comment - This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.
            jhammond John Hammond added a comment -

            Here's a failure that occurred after the error reporting patch:
            https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc

            It looks like one of the export directories is going away while find is running on the mdt:

            == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326)
            proc_dirs='/proc/fs/lustre/
            /sys/fs/lustre/
            /sys/kernel/debug/lnet/
            /sys/kernel/debug/lustre/'
            CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null ||
            				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
            				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
            mds1_proc_dirs='/proc/fs/lustre/
            /sys/fs/lustre/
            /sys/kernel/debug/lnet/
            /sys/kernel/debug/lustre/'
            CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
            onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory
            
            jhammond John Hammond added a comment - Here's a failure that occurred after the error reporting patch: https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc It looks like one of the export directories is going away while find is running on the mdt: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326) proc_dirs='/proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 mds1_proc_dirs='/proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \; onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

            Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: flr
            Current Patch Set:
            Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: flr
            Current Patch Set: 1
            Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: 1 Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993
            adilger Andreas Dilger added a comment - Hit this again: https://testing.hpdd.intel.com/test_sets/5cdcc7f8-c08e-11e7-8cd9-52540065bddc
            green Oleg Drokin added a comment -

            Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files.
            The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

            green Oleg Drokin added a comment - Yes, like James Simmons said, the errors are normal - this is because we write garbage to random files. The problem for the failure is that when you do find i na list of directories soe of which don't exist, you get an error. So need to find which dir is that that does not exist and why it was not filtered out.

            One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.

            simmonsja James A Simmons added a comment - One of the test is to scribble random data into the proc/sysfs/debugfs entries. This is to ensure we don't oops or touch user space memory incorrectly.

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: