Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4243

multiple servicenodes or failnids: wrong client llog registration

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.4.1, Lustre 2.5.0
    • None
    • failover MDS/MGS, failover OSTs
    • 3
    • 11555

    Description

      Since Lustre 2.4.0 we had problems with clients that could not connect after eg. the MGS was failing over. Most experiments we did with a client on the standby MGS server, the symptom was that the client only worked from the active MGS/MDS node, not from the passive one!

      The reason for the problem seems to be commit d9d27cad and the following hunk of the patch:
      @@ -1447,13 +1481,11 @@ static int mgs_write_log_failnids(const struct lu_env *env,
      failnodeuuid, cliname);
      rc = record_add_uuid(env, llh, nid, failnodeuuid);
      }

      • if (failnodeuuid) { + if (failnodeuuid) rc = record_add_conn(env, llh, cliname, failnodeuuid); - name_destroy(&failnodeuuid); - failnodeuuid = NULL; - }

        }

      + name_destroy(&failnodeuuid);
      return rc;
      }

      This leads to a wrong lustre client llog when the lustre block devices are formated with multiple --servicenode options! Here is an example:

      #09 (224)marker 6 (flags=0x01, v2.5.0.0) lnec-MDT0000 'add mdc' Mon Nov 11 17:03:21 2013-
      #10 (080)add_uuid nid=10.3.0.34@o2ib(0x500000a030022) 0: 1:10.3.0.34@o2ib
      #11 (128)attach 0:lnec-MDT0000-mdc 1:mdc 2:lnec-clilmv_UUID
      #12 (136)setup 0:lnec-MDT0000-mdc 1:lnec-MDT0000_UUID 2:10.3.0.34@o2ib
      #13 (080)add_uuid nid=10.3.0.34@o2ib(0x500000a030022) 0: 1:10.3.0.34@o2ib
      #14 (104)add_conn 0:lnec-MDT0000-mdc 1:10.3.0.34@o2ib
      #15 (080)add_uuid nid=10.3.0.35@o2ib(0x500000a030023) 0: 1:10.3.0.34@o2ib
      #16 (104)add_conn 0:lnec-MDT0000-mdc 1:10.3.0.34@o2ib
      #17 (160)modify_mdc_tgts add 0:lnec-clilmv 1:lnec-MDT0000_UUID 2:0 3:1 4:lnec-MDT0000-mdc_UUID
      #18 (224)marker 6 (flags=0x02, v2.5.0.0) lnec-MDT0000 'add mdc' Mon Nov 11 17:03:21 2013-

      The last add_uuid should have 1:10.3.0.35@o2ib instead of 1:10.3.0.34@o2ib.

      And the reason is that only the first nid of the first --servicenode AKA --failnode entry is considered.

      Please revert that little patch of mgs_llog.c.

      Regards,
      Erich

      Attachments

        Issue Links

          Activity

            People

              hongchao.zhang Hongchao Zhang
              efocht Erich Focht
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: