Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2407

Interop 2.1.3<->2.4 Failure on test suite conf-sanity test_35a: conf_param: No such device

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.4.1, Lustre 2.5.0
    • Lustre 2.4.0
    • None
    • server: 2.3 RHEL6
      client: lustre master build #1065 RHEL6
    • 3
    • 5713

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/013a4c58-3980-11e2-9fda-52540035b04c.

      The sub-test test_35a failed with the following error:

      test_35a failed with 4

      test log shows:

      Set up a fake failnode for the MDS
      CMD: client-26vm7 lctl get_param -n devices
      CMD: client-26vm7 /usr/sbin/lctl conf_param lustre-MDT0000.failover.node= 127.0.0.2@tcp
      client-26vm7: error: conf_param: No such device
       conf-sanity test_35a: @@@@@@ FAIL: test_35a failed with 4 
      

      MDS dmesg:

      Lustre: DEBUG MARKER: Set up a fake failnode for the MDS
      Lustre: DEBUG MARKER: lctl get_param -n devices
      Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre-MDT0000.failover.node= 127.0.0.2@tcp
      LustreError: 11886:0:(mgs_llog.c:2684:mgs_write_log_param()) err -19 on param 'failover.node='
      LustreError: 11886:0:(mgs_handler.c:1147:mgs_iocontrol()) setparam err -19
      Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_35a: @@@@@@ FAIL: test_35a failed with 4 
      

      Attachments

        Activity

          [LU-2407] Interop 2.1.3<->2.4 Failure on test suite conf-sanity test_35a: conf_param: No such device
          yujian Jian Yu added a comment -

          Patch was cherry-picked to Lustre b2_4 branch.

          yujian Jian Yu added a comment - Patch was cherry-picked to Lustre b2_4 branch.
          yujian Jian Yu added a comment -

          Patch for Lustre master branch to remove the extra space: http://review.whamcloud.com/6779. The patch also needs to be cherry-picked to Lustre b2_4 branch.

          Hi Oleg,
          Could you please cherry-pick the above patch to Lustre b2_4 branch? Thanks.

          yujian Jian Yu added a comment - Patch for Lustre master branch to remove the extra space: http://review.whamcloud.com/6779 . The patch also needs to be cherry-picked to Lustre b2_4 branch. Hi Oleg, Could you please cherry-pick the above patch to Lustre b2_4 branch? Thanks.

          Patches merged to master

          utopiabound Nathaniel Clark added a comment - Patches merged to master
          yujian Jian Yu added a comment -

          Patch for Lustre master branch to remove the extra space: http://review.whamcloud.com/6779. The patch also needs to be cherry-picked to Lustre b2_4 branch.

          yujian Jian Yu added a comment - Patch for Lustre master branch to remove the extra space: http://review.whamcloud.com/6779 . The patch also needs to be cherry-picked to Lustre b2_4 branch.
          yujian Jian Yu added a comment -

          The issue was introduced by the change of http://review.whamcloud.com/4247, which added an extra white space between "=" and "$(h2$NETTYPE $FAKENID)". This made the "lctl conf_param" command become:

          /usr/sbin/lctl conf_param lustre-MDT0000.failover.node= 127.0.0.2@tcp
          

          As we can see, the fake failover NID was not really set to "failover.node" parameter. So, we need create a patch on master and b2_4 branches to fix this script issue.

          On the other hand, although the test script had issue, why the same test passed on master and b2_4 branches but failed on master/b2_4<->b2_3/b2_1 interop combinations?
          This is because master and b2_4 branches have the patch of http://review.whamcloud.com/3670, which improves the error handling codes in mgs_modify() and mgs_write_log_failnid_internal() to make sure that setting an empty value to "failover.node" means to remove all failover NIDs. If "failover.node" was empty before, mgs_modify() will return 1 which means no modification is done. However, on b2_3 and b2_1 branches, without the patch, setting an empty value will make mgs_modify() return -ENODEV if "failover.node" had no value before.

          So, to fix the interop issue, we need backport the error handling codes from master branch to b2_3 and b2_1 branches, or need wait for fixing the test script issue on master and b2_4 branches.

          yujian Jian Yu added a comment - The issue was introduced by the change of http://review.whamcloud.com/4247 , which added an extra white space between "=" and "$(h2$NETTYPE $FAKENID)". This made the "lctl conf_param" command become: /usr/sbin/lctl conf_param lustre-MDT0000.failover.node= 127.0.0.2@tcp As we can see, the fake failover NID was not really set to "failover.node" parameter. So, we need create a patch on master and b2_4 branches to fix this script issue. On the other hand, although the test script had issue, why the same test passed on master and b2_4 branches but failed on master/b2_4<->b2_3/b2_1 interop combinations? This is because master and b2_4 branches have the patch of http://review.whamcloud.com/3670 , which improves the error handling codes in mgs_modify() and mgs_write_log_failnid_internal() to make sure that setting an empty value to "failover.node" means to remove all failover NIDs. If "failover.node" was empty before, mgs_modify() will return 1 which means no modification is done. However, on b2_3 and b2_1 branches, without the patch, setting an empty value will make mgs_modify() return -ENODEV if "failover.node" had no value before. So, to fix the interop issue, we need backport the error handling codes from master branch to b2_3 and b2_1 branches, or need wait for fixing the test script issue on master and b2_4 branches.
          sarah Sarah Liu added a comment - also seen in 2.1.4 server vs 2.4 client: https://maloo.whamcloud.com/test_sets/ae414df4-5f35-11e2-b507-52540035b04c
          sarah Sarah Liu added a comment -

          another instance seen in 2.3.0 server vs 2.4 client:
          https://maloo.whamcloud.com/test_sets/19d3d0ea-5b54-11e2-8985-52540035b04c

          sarah Sarah Liu added a comment - another instance seen in 2.3.0 server vs 2.4 client: https://maloo.whamcloud.com/test_sets/19d3d0ea-5b54-11e2-8985-52540035b04c
          sarah Sarah Liu added a comment -

          Hit the same issue when doing interop between 2.1.3 server and 2.4 client. The client build is lustre-master #1127 which should include the fix of LU-2406.

          https://maloo.whamcloud.com/test_sets/351a6028-4a86-11e2-8a7b-52540035b04c

          sarah Sarah Liu added a comment - Hit the same issue when doing interop between 2.1.3 server and 2.4 client. The client build is lustre-master #1127 which should include the fix of LU-2406 . https://maloo.whamcloud.com/test_sets/351a6028-4a86-11e2-8a7b-52540035b04c
          pjones Peter Jones added a comment -

          duplicate of LU-2406

          pjones Peter Jones added a comment - duplicate of LU-2406

          People

            utopiabound Nathaniel Clark
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: