Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8498

configuration from log 'nodemap' failed (-22)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • None
    • 1
    • 9223372036854775807

    Description

      In current master 2.8.56, a newly created file system failed at mounting OST due to nodemap log error.

      From the log message:

      00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:1052:ptlrpc_set_destroy()) Process leaving
      00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:2896:ptlrpc_queue_wait()) Process leaving (rc=0 : 0 : 0)
      10000000:00000001:3.0:1470938259.243551:0:9524:0:(mgc_request.c:1716:mgc_process_recover_nodemap_log()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
      

      it looks like the corresponding log has a zero size that triggered this error.

              if (ealen == 0) { /* no logs transferred */
      #ifdef HAVE_SERVER_SUPPORT
                      /* config changed since first read RPC */
                      if (cld_is_nodemap(cld) && config_read_offset == 0) {
                              recent_nodemap = NULL;
                              nodemap_config_dealloc(new_config);
                              new_config = NULL;
      
                              CDEBUG(D_INFO, "nodemap config changed in transit, retrying\n");
      
                              /* setting eof to false, we request config again */
                              eof = false;
                              GOTO(out, rc = 0);
                      }
      #endif
                      if (!eof)
                              rc = -EINVAL;
                      GOTO(out, rc);
              }
      

      We have a debug log and will attach it soon.

      Attachments

        1. log_failedmount_vanilla_master
          0.2 kB
        2. log.Jinshan
          0.2 kB
        3. log.mds.LU-8498
          0.2 kB
        4. mds_lctl_log_afterpatch
          0.2 kB
        5. oss_lctl_log.afterpatch
          0.2 kB

        Issue Links

          Activity

            [LU-8498] configuration from log 'nodemap' failed (-22)

            Hi John,

            Thanks for the logs. I took a quick look, but there's nothing obvious. The MDS says it's sending over a 1MB config RPC, so I'm not sure why the MGC thinks it's not getting anything. I'll take a closer look tomorrow.

            Can you confirm you are just running straight master, no patches? FWIW line 1716 doesn't correspond to a GOTO statement on the tip of master for me (hash 6fad3ab).

            You could try changing the return code from -EINVAL to 0 on that eof check as a workaround. It shouldn't cause any problems to receive a 0 length RPC if you aren't using nodemap, but it also shouldn't happen as far as I understand it. Here's the eof check I mean:

                            if (!eof)
                                    rc = 0;
            

            What does your test setup look like? Is there any way to reproduce the failure in maloo?

            When you say "since this feature landed I have been unable to use master" can you clarify which feature you mean? There have been a number of patches related to nodemap config transfer that have landed in the past couple of months. If you could specify the last version (commit hash) that worked, and the first version that didn't, that would be helpful.

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hi John, Thanks for the logs. I took a quick look, but there's nothing obvious. The MDS says it's sending over a 1MB config RPC, so I'm not sure why the MGC thinks it's not getting anything. I'll take a closer look tomorrow. Can you confirm you are just running straight master, no patches? FWIW line 1716 doesn't correspond to a GOTO statement on the tip of master for me (hash 6fad3ab). You could try changing the return code from -EINVAL to 0 on that eof check as a workaround. It shouldn't cause any problems to receive a 0 length RPC if you aren't using nodemap, but it also shouldn't happen as far as I understand it. Here's the eof check I mean: if (!eof) rc = 0; What does your test setup look like? Is there any way to reproduce the failure in maloo? When you say "since this feature landed I have been unable to use master" can you clarify which feature you mean? There have been a number of patches related to nodemap config transfer that have landed in the past couple of months. If you could specify the last version (commit hash) that worked, and the first version that didn't, that would be helpful. Thanks, Kit

            This seems like a pretty serious bug since this feature landed I have been unable to use master at all as I cannot get any OST to connect to MDS. Is there any work around? CORAL testing has completely stopped because of this.

            jsalians_intel John Salinas (Inactive) added a comment - This seems like a pretty serious bug since this feature landed I have been unable to use master at all as I cannot get any OST to connect to MDS. Is there any work around? CORAL testing has completely stopped because of this.

            This log covers from: mount -t lustre zfspool/mdt1 /mnt/mdt to when the OST fails to connect.

            jsalians_intel John Salinas (Inactive) added a comment - This log covers from: mount -t lustre zfspool/mdt1 /mnt/mdt to when the OST fails to connect.
            pjones Peter Jones added a comment -

            Kit

            Don't worry - 21891 is just a port to of the LU-8460 fix to another branch so not important

            Peter

            pjones Peter Jones added a comment - Kit Don't worry - 21891 is just a port to of the LU-8460 fix to another branch so not important Peter

            It looks like the MGS hasn't created the nodemap config when the OST is mounting. Can you get the MGS logs as well? Or is there a maloo link for this failure? I'm not able to see patch 21891 for some reason.

            kit.westneat Kit Westneat (Inactive) added a comment - It looks like the MGS hasn't created the nodemap config when the OST is mounting. Can you get the MGS logs as well? Or is there a maloo link for this failure? I'm not able to see patch 21891 for some reason.
            jsalians_intel John Salinas (Inactive) added a comment - - edited

            Currently this is blocking testing of: LU-8460 and http://review.whamcloud.com/#/c/21891/

            1. lctl set_param debug=-1
              debug=-1
            2. lctl dk clear
              Debug log: 4741 lines, 4741 kept, 0 dropped, 0 bad.
            3. mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0

            Permanent disk data:
            Target: lsdraid-OST0000
            Index: 0
            Lustre FS: lsdraid
            Mount type: zfs
            Flags: 0x42
            (OST update )
            Persistent mount opts:
            Parameters: mgsnode=192.168.1.5@o2ib

            mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0
            Writing ost0/ost0 properties
            lustre:version=1
            lustre:flags=66
            lustre:index=0
            lustre:fsname=lsdraid
            lustre:svname=lsdraid-OST0000
            lustre:mgsnode=192.168.1.5@o2ib

            1. lctl dk clear
              Debug log: 7281 lines, 7281 kept, 0 dropped, 0 bad.
            1. mount -t lustre ost0/ost0 /mnt/lustre/ost0
              mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
              Is the MGS specification correct?
              Is the filesystem name correct?
              If upgrading, is the copied client log valid? (see upgrade docs)
            jsalians_intel John Salinas (Inactive) added a comment - - edited Currently this is blocking testing of: LU-8460 and http://review.whamcloud.com/#/c/21891/ lctl set_param debug=-1 debug=-1 lctl dk clear Debug log: 4741 lines, 4741 kept, 0 dropped, 0 bad. mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0 Permanent disk data: Target: lsdraid-OST0000 Index: 0 Lustre FS: lsdraid Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=192.168.1.5@o2ib mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0 Writing ost0/ost0 properties lustre:version=1 lustre:flags=66 lustre:index=0 lustre:fsname=lsdraid lustre:svname=lsdraid-OST0000 lustre:mgsnode=192.168.1.5@o2ib lctl dk clear Debug log: 7281 lines, 7281 kept, 0 dropped, 0 bad. mount -t lustre ost0/ost0 /mnt/lustre/ost0 mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs)

            People

              kit.westneat Kit Westneat (Inactive)
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: