Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10038

sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.11.0
    • Lustre 2.11.0
    • 3
    • 9223372036854775807

    Description

      sanity test_133g fails on the call to find on the MDS. From the test_log, we see the call to find

      mds1_proc_dirs='/proc/fs/lustre/
      /proc/sys/lnet/
      /sys/fs/lustre/
      /sys/kernel/debug/lnet/
      /sys/kernel/debug/lustre/'
       sanity test_133g: @@@@@@ FAIL: mds1 find /proc/fs/lustre/
      /proc/sys/lnet/
      /sys/fs/lustre/
      /sys/kernel/debug/lnet/
      /sys/kernel/debug/lustre/ failed 
      

      Looking at the client2 dmesg, we see the output from 133g

      [ 3573.106070] Lustre: DEBUG MARKER: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 21:44:29 (1504907069)
      [ 3573.249527] Lustre: 18911:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre: failed to set root_squash due to bad address, rc = -14
      [ 3573.254669] Lustre: 18911:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre: failed to set root_squash to "", needs uid:gid format, rc = -22
      [ 3573.263113] Lustre: 18916:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids to "", bad address rc = -14
      [ 3573.268474] Lustre: 18916:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre: failed to set nosquash_nids due to string too long rc = -22
      [ 3573.349716] LustreError: 18934:0:(gss_cli_upcall.c:240:gss_do_ctx_init_rpc()) ioctl size 5, expect 80, please check lgss_keyring version
      [ 3573.379338] LustreError: 18984:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
      [ 3573.422121] Lustre: 19067:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
      mask usage: [+|-]<all|type> ...
      [ 3574.553074] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
      

      Looking at the dmesg log on the MDS1, we see similar output, but a few more lines

      [ 3479.763450] Lustre: DEBUG MARKER: find /proc/fs/lustre/ /proc/sys/lnet/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
      [ 3479.911266] Lustre: 17021:0:(mdt_coordinator.c:1944:mdt_hsm_policy_seq_write()) lustre-MDT0000: ' ' is unknown, supported policies are:
      [ 3479.944620] LustreError: 17069:0:(mdt_coordinator.c:2097:mdt_hsm_cdt_control_seq_write()) lustre-MDT0000: Valid coordinator control commands are: enabled shutdown disabled purge help
      [ 3479.950356] Lustre: 17071:0:(lprocfs_status.c:2483:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash due to bad address, rc = -14
      [ 3479.954690] Lustre: 17071:0:(lprocfs_status.c:2479:lprocfs_wr_root_squash()) lustre-MDT0000: failed to set root_squash to "", needs uid:gid format, rc = -22
      [ 3479.960431] LustreError: 17072:0:(genops.c:1540:obd_export_evict_by_uuid()) lustre-MDT0000: can't disconnect : no exports found
      [ 3479.965980] LustreError: 17074:0:(mdt_lproc.c:366:lprocfs_identity_info_seq_write()) lustre-MDT0000: invalid data count = 5, size = 1048
      [ 3479.970389] LustreError: 17074:0:(mdt_lproc.c:383:lprocfs_identity_info_seq_write()) lustre-MDT0000: MDS identity downcall bad params
      [ 3479.975746] Lustre: 17075:0:(lprocfs_status.c:2551:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids to "", bad address rc = -14
      [ 3479.980510] Lustre: 17075:0:(lprocfs_status.c:2555:lprocfs_wr_nosquash_nids()) lustre-MDT0000: failed to set nosquash_nids due to string too long rc = -22
      [ 3479.988148] LustreError: 17079:0:(mdt_lproc.c:298:mdt_identity_upcall_seq_write()) lustre-MDT0000: identity upcall too long
      [ 3480.066568] LustreError: 17187:0:(lproc_fid.c:177:lprocfs_server_fid_width_seq_write()) ctl-lustre-MDT0000: invalid FID sequence width: rc = -14
      [ 3480.104133] LustreError: 17240:0:(ldlm_resource.c:104:seq_watermark_write()) Failed to set lock_limit_mb, rc = -14.
      [ 3480.122289] LustreError: 17250:0:(nodemap_storage.c:420:nodemap_idx_nodemap_update()) cannot add nodemap config to non-existing MGS.
      [ 3480.128801] LustreError: 17252:0:(nodemap_handler.c:1049:nodemap_create()) cannot add nodemap: ' ': rc = -22
      [ 3480.250476] LustreError: 17363:0:(ldlm_resource.c:355:lru_size_store()) lru_size: invalid value written
      [ 3480.333333] Lustre: 17493:0:(libcfs_string.c:127:cfs_str2mask()) unknown mask ' '.
      mask usage: [+|-]<all|type> ...
      [ 3480.468566] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  sanity test_133g: @@@@@@ FAIL: mds1 find \/proc\/fs\/lustre\/
      

      This ticket was opened because we are seeing this in testing a separate MGS and MDS. Logs for these failures are at:
      https://testing.hpdd.intel.com/test_sets/308ca8f6-97d7-11e7-b944-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/774c9f26-97d7-11e7-b944-5254006e85c2

      We have also seen this test fail frequently during interop testing.

      Attachments

        Issue Links

          Activity

            [LU-10038] sanity test 133g fails with “ '$'mds1 find /proc/fs/lustre/n/proc/sys/lnet/n/sys/fs/lustre/n/sys/kernel/debug/lnet/n/sys/kernel/debug/lustre/ failed''

            Patches have landed to master for 2.11.0

            jgmitter Joseph Gmitter (Inactive) added a comment - Patches have landed to master for 2.11.0

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30451/
            Subject: LU-10038 test: ignore readdir races in sanity 133g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 45f99f6562349f77be6b47bf1bc5a94abf9fd11d

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30451/ Subject: LU-10038 test: ignore readdir races in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: Commit: 45f99f6562349f77be6b47bf1bc5a94abf9fd11d

            Very interesting. I wouldn’t have thought of this as the cause.

            adilger Andreas Dilger added a comment - Very interesting. I wouldn’t have thought of this as the cause.

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451
            Subject: LU-10038 test: ignore readdir races in sanity 133g
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30451 Subject: LU-10038 test: ignore readdir races in sanity 133g Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 7e1322643470688cb0c306dc866e3c2f84ad4c26
            jhammond John Hammond added a comment -

            This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.

            jhammond John Hammond added a comment - This must be because we are writing to the .../exports/clear file between readdir and accessing the 10.2.8.104@tcp subdir. find has a -ignore_readdir_race that should address this.
            jhammond John Hammond added a comment -

            Here's a failure that occurred after the error reporting patch:
            https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc

            It looks like one of the export directories is going away while find is running on the mdt:

            == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326)
            proc_dirs='/proc/fs/lustre/
            /sys/fs/lustre/
            /sys/kernel/debug/lnet/
            /sys/kernel/debug/lustre/'
            CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null ||
            				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
            				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
            mds1_proc_dirs='/proc/fs/lustre/
            /sys/fs/lustre/
            /sys/kernel/debug/lnet/
            /sys/kernel/debug/lustre/'
            CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \;
            onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory
            
            jhammond John Hammond added a comment - Here's a failure that occurred after the error reporting patch: https://testing.hpdd.intel.com/sub_tests/ed882ff6-d54a-11e7-a066-52540065bddc It looks like one of the export directories is going away while find is running on the mdt: == sanity test 133g: Check for Oopses on bad io area writes/reads in /proc =========================== 20:28:46 (1511987326) proc_dirs='/proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' CMD: onyx-38vm9 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 mds1_proc_dirs='/proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/' CMD: onyx-38vm9 find /proc/fs/lustre/ /sys/fs/lustre/ /sys/kernel/debug/lnet/ /sys/kernel/debug/lustre/ -type f -not -name force_lbug -not -name changelog_mask -exec badarea_io {} \; onyx-38vm9: find: ‘/proc/fs/lustre/mdt/lustre-MDT0000/exports/10.2.8.104@tcp’: No such file or directory

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30105/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: Commit: e5eaaff6e378b8c95d0a809f4dd3b4817d9fd492

            Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: flr
            Current Patch Set:
            Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/30219/ Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: Commit: f1420059ac7d33cba65ab1b14fd5eade3c889684

            Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: flr
            Current Patch Set: 1
            Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

            gerrit Gerrit Updater added a comment - Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/30219 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: flr Current Patch Set: 1 Commit: 4cef38d724a05ce0ae386cb9d2a618d187cbe8d1

            John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105
            Subject: LU-10038 test: improve error reporting in sanity test_133g()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/30105 Subject: LU-10038 test: improve error reporting in sanity test_133g() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 4bfa1ce0881d3dbe737a2a1e5b2d85679fb41993

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: