Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4722

IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.2
    • SLES 11 SP2
      Lustre 2.4.2
    • 3
    • 12978

    Description

      We have applied the patch provided in teh LU-3645. And still the customer complains that the issue is can be reproduced.

      Attaching the latest set of logs.

      The issue re-occured on 18th Feb.

      Attachments

        Activity

          [LU-4722] IO Errors during the failover - SLES 11 SP2 - Lustre 2.4.2
          pjones Peter Jones added a comment -

          Thanks Rajesh. My concern is that this patch has not yet been landed to newer versions of Lustre so if the customer were to upgrade it might mean that this issue reoccurs for them.

          Hongchao

          This fix was for SLES11 SP2 clients. Is it still necessary for SLES11 SP3 clients which is what is supported on master and b2_5? If so, please could you push this fix firstly against master to get it finalized and landed. If this issue is specific to SLES11 SP2 only then I agree that it is ok to mark the ticket as Resolved.

          Thanks

          Peter

          pjones Peter Jones added a comment - Thanks Rajesh. My concern is that this patch has not yet been landed to newer versions of Lustre so if the customer were to upgrade it might mean that this issue reoccurs for them. Hongchao This fix was for SLES11 SP2 clients. Is it still necessary for SLES11 SP3 clients which is what is supported on master and b2_5? If so, please could you push this fix firstly against master to get it finalized and landed. If this issue is specific to SLES11 SP2 only then I agree that it is ok to mark the ticket as Resolved. Thanks Peter

          We have upgraded both server side and client side.

          1. On the server side customer upgraded into 2.4.3 with the Patch

          And now they don't see the issue, and it can be closed.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - We have upgraded both server side and client side. 1. On the server side customer upgraded into 2.4.3 with the Patch And now they don't see the issue, and it can be closed.
          pjones Peter Jones added a comment -

          Rajesh?

          pjones Peter Jones added a comment - Rajesh?
          pjones Peter Jones added a comment -

          Rajesh

          To be clear, do you mean upgraded to a newer Lustre version or upgraded to use the patch supplied?

          Peter

          pjones Peter Jones added a comment - Rajesh To be clear, do you mean upgraded to a newer Lustre version or upgraded to use the patch supplied? Peter

          We can close this LU. Customer had upgraded the Server and Clients and they don't see this issue.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - We can close this LU. Customer had upgraded the Server and Clients and they don't see this issue.
          pjones Peter Jones added a comment -

          Any update Rajesh?

          pjones Peter Jones added a comment - Any update Rajesh?

          We are in the process of applying the patch. I will get back to you with the results.

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - We are in the process of applying the patch. I will get back to you with the results.

          Hi Rajesh,

          What is the result of the test?

          Thanks.

          hongchao.zhang Hongchao Zhang added a comment - Hi Rajesh, What is the result of the test? Thanks.

          there is a bug in obd_str2uuid,

           static inline void obd_str2uuid(struct obd_uuid *uuid, const char *tmp)
           {
                  strncpy((char *)uuid->uuid, tmp, sizeof(*uuid));
                  uuid->uuid[sizeof(*uuid) - 1] = '\0';
           }
          

          it take "tmp" also as a implicit "obd_uuid" type, but it isn't in all cases, such as in "class_add_uuid", the "tmp" is
          "lustre_cfg_string(lcfg, 1)", and obd_str2uuid will copy some undefined data beyond the "tmp" to "uuid" and could cause two same
          "uuid" in config were thought to be different.

          the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10269/

          Hi Rajesh,
          Could you please try the patch in your site?
          Thanks!

          hongchao.zhang Hongchao Zhang added a comment - there is a bug in obd_str2uuid, static inline void obd_str2uuid(struct obd_uuid *uuid, const char *tmp) { strncpy(( char *)uuid->uuid, tmp, sizeof(*uuid)); uuid->uuid[sizeof(*uuid) - 1] = '\0' ; } it take "tmp" also as a implicit "obd_uuid" type, but it isn't in all cases, such as in "class_add_uuid", the "tmp" is "lustre_cfg_string(lcfg, 1)", and obd_str2uuid will copy some undefined data beyond the "tmp" to "uuid" and could cause two same "uuid" in config were thought to be different. the patch against b2_4 is tracked at http://review.whamcloud.com/#/c/10269/ Hi Rajesh, Could you please try the patch in your site? Thanks!

          Hello Hongchao,

          I have uploaded the requested log files into ftp.whamcloud.com:/uploads/LU-4722

          2014-05-08-SR30502_pfs2n17.llog.gz

          Thanks,
          Rajesh

          rganesan@ddn.com Rajeshwaran Ganesan added a comment - Hello Hongchao, I have uploaded the requested log files into ftp.whamcloud.com:/uploads/ LU-4722 2014-05-08-SR30502_pfs2n17.llog.gz Thanks, Rajesh

          there is no error in these configs.

          Could you please collect the debug logs(lctl dk >XXX.log) at the problematic node just after mounting the client (make sure the "ha" is contained in "/proc/sys/lnet/debug")?
          Thanks very much!

          hongchao.zhang Hongchao Zhang added a comment - there is no error in these configs. Could you please collect the debug logs(lctl dk >XXX.log) at the problematic node just after mounting the client (make sure the "ha" is contained in "/proc/sys/lnet/debug")? Thanks very much!

          People

            hongchao.zhang Hongchao Zhang
            rganesan@ddn.com Rajeshwaran Ganesan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: