Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8498

configuration from log 'nodemap' failed (-22)

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • None
    • 1
    • 9223372036854775807

    Description

      In current master 2.8.56, a newly created file system failed at mounting OST due to nodemap log error.

      From the log message:

      00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:1052:ptlrpc_set_destroy()) Process leaving
      00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:2896:ptlrpc_queue_wait()) Process leaving (rc=0 : 0 : 0)
      10000000:00000001:3.0:1470938259.243551:0:9524:0:(mgc_request.c:1716:mgc_process_recover_nodemap_log()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)
      

      it looks like the corresponding log has a zero size that triggered this error.

              if (ealen == 0) { /* no logs transferred */
      #ifdef HAVE_SERVER_SUPPORT
                      /* config changed since first read RPC */
                      if (cld_is_nodemap(cld) && config_read_offset == 0) {
                              recent_nodemap = NULL;
                              nodemap_config_dealloc(new_config);
                              new_config = NULL;
      
                              CDEBUG(D_INFO, "nodemap config changed in transit, retrying\n");
      
                              /* setting eof to false, we request config again */
                              eof = false;
                              GOTO(out, rc = 0);
                      }
      #endif
                      if (!eof)
                              rc = -EINVAL;
                      GOTO(out, rc);
              }
      

      We have a debug log and will attach it soon.

      Attachments

        1. log_failedmount_vanilla_master
          0.2 kB
        2. log.Jinshan
          0.2 kB
        3. log.mds.LU-8498
          0.2 kB
        4. mds_lctl_log_afterpatch
          0.2 kB
        5. oss_lctl_log.afterpatch
          0.2 kB

        Issue Links

          Activity

            [LU-8498] configuration from log 'nodemap' failed (-22)

            Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/21939
            Subject: LU-8498 nodemap: new zfs index files not properly initialized
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e5097f35594e265803a363f4e2e6fc6ea62f62fc

            gerrit Gerrit Updater added a comment - Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/21939 Subject: LU-8498 nodemap: new zfs index files not properly initialized Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e5097f35594e265803a363f4e2e6fc6ea62f62fc

            I spent some time digging into it. It looks like there's a slight difference in the way new ldiskfs index files and new ZFS index files work. I'm working on a patch to fix it, but in the meantime I think that workaround I talked about earlier should work if you aren't using nodemap. Just change the return code from -EINVAL to 0.

            kit.westneat Kit Westneat (Inactive) added a comment - I spent some time digging into it. It looks like there's a slight difference in the way new ldiskfs index files and new ZFS index files work. I'm working on a patch to fix it, but in the meantime I think that workaround I talked about earlier should work if you aren't using nodemap. Just change the return code from -EINVAL to 0.

            I should say the lustre messages look the same to me still: [ 861.004988] LustreError: 33234:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22
            [ 861.020257] LustreError: 13a-8: Failed to get MGS log lsdraid-OST0000 and no local copy.
            [ 861.030603] LustreError: 15c-8: MGC192.168.1.5@o2ib: The configuration from log 'lsdraid-OST0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
            [ 861.060862] LustreError: 33234:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server lsdraid-OST0000: -2
            [ 861.075061] LustreError: 33234:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -2
            [ 861.087424] LustreError: 33234:0:(obd_mount_server.c:1558:server_put_super()) no obd lsdraid-OST0000
            [ 861.236785] Lustre: server umount lsdraid-OST0000 complete
            [ 861.244448] LustreError: 33234:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-2)

            jsalians_intel John Salinas (Inactive) added a comment - I should say the lustre messages look the same to me still: [ 861.004988] LustreError: 33234:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22 [ 861.020257] LustreError: 13a-8: Failed to get MGS log lsdraid-OST0000 and no local copy. [ 861.030603] LustreError: 15c-8: MGC192.168.1.5@o2ib: The configuration from log 'lsdraid-OST0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 861.060862] LustreError: 33234:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server lsdraid-OST0000: -2 [ 861.075061] LustreError: 33234:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -2 [ 861.087424] LustreError: 33234:0:(obd_mount_server.c:1558:server_put_super()) no obd lsdraid-OST0000 [ 861.236785] Lustre: server umount lsdraid-OST0000 complete [ 861.244448] LustreError: 33234:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-2)
            jsalians_intel John Salinas (Inactive) added a comment - - edited

            I tried this on master:

            (no draid – this is not using our prototype branch)
            zpool create -f -o cachefile=none -O recordsize=16MB ost0 raidz1 /dev/mapper/mpathaj /dev/mapper/mpathai /dev/mapper/mpathah /dev/mapper/mpathag /dev/mapper/mpathaq /dev/mapper/mpathap /dev/mapper/mpathak /dev/mapper/mpathz /dev/mapper/mpatham /dev/mapper/mpathal /dev/mapper/mpathao
            mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0
            mount -t lustre ost0/ost0 /mnt/lustre/ost0
            mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
            Is the MGS specification correct?
            Is the filesystem name correct?
            If upgrading, is the copied client log valid? (see upgrade docs)

            I repeated the mount with lcdl debug enabled and attached here. This build was with lustre master:

            Git Build Data

            Revision: 6fad3abf6f962d04989422cb44dfb7aa0835ad07
            refs/remotes/origin/master
            Built Branches

            refs/remotes/origin/master: Build #170 of Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master)

            Fetching changes from the remote Git repository
            > git config remote.origin.url ssh://hudson@review.whamcloud.com:29418/fs/lustre-release # timeout=10
            Fetching upstream changes from ssh://hudson@review.whamcloud.com:29418/fs/lustre-release
            > git --version # timeout=10
            > git -c core.askpass=true fetch --tags --progress ssh://hudson@review.whamcloud.com:29418/fs/lustre-release +refs/heads/:refs/remotes/origin/ --depth=1
            Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master)
            > git config core.sparsecheckout # timeout=10
            > git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07
            > git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10
            > git tag -a -f -m Jenkins Build #170 jenkins-arch=x86_64,build_type=client,distro=el7.2,ib_stack=inkernel-170 # timeout=10
            Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master)
            > git config core.sparsecheckout # timeout=10
            > git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07
            > git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10
            > git tag -a -f -m Jenkins Build #170 jenkins-zfs-lustre-master-vanilla-170 # timeout=10

            Our build engineer noted: build 170 was triggered by LU-7899, which was the last SCM change.

            jsalians_intel John Salinas (Inactive) added a comment - - edited I tried this on master: (no draid – this is not using our prototype branch) zpool create -f -o cachefile=none -O recordsize=16MB ost0 raidz1 /dev/mapper/mpathaj /dev/mapper/mpathai /dev/mapper/mpathah /dev/mapper/mpathag /dev/mapper/mpathaq /dev/mapper/mpathap /dev/mapper/mpathak /dev/mapper/mpathz /dev/mapper/mpatham /dev/mapper/mpathal /dev/mapper/mpathao mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0 mount -t lustre ost0/ost0 /mnt/lustre/ost0 mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) I repeated the mount with lcdl debug enabled and attached here. This build was with lustre master: Git Build Data Revision: 6fad3abf6f962d04989422cb44dfb7aa0835ad07 refs/remotes/origin/master Built Branches refs/remotes/origin/master: Build #170 of Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master) Fetching changes from the remote Git repository > git config remote.origin.url ssh://hudson@review.whamcloud.com:29418/fs/lustre-release # timeout=10 Fetching upstream changes from ssh://hudson@review.whamcloud.com:29418/fs/lustre-release > git --version # timeout=10 > git -c core.askpass=true fetch --tags --progress ssh://hudson@review.whamcloud.com:29418/fs/lustre-release +refs/heads/ :refs/remotes/origin/ --depth=1 Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07 > git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10 > git tag -a -f -m Jenkins Build #170 jenkins-arch=x86_64,build_type=client,distro=el7.2,ib_stack=inkernel-170 # timeout=10 Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07 > git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10 > git tag -a -f -m Jenkins Build #170 jenkins-zfs-lustre-master-vanilla-170 # timeout=10 Our build engineer noted: build 170 was triggered by LU-7899 , which was the last SCM change.

            Good question – I reproduced this issue this morning on the following:
            -b 0 -j zfs-c-p-lustre-patch-vanillais review which my build 21907.3" or build 116 – however I believe this builds what ever was submitted for review last.

            mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
            Is the MGS specification correct?
            Is the filesystem name correct?

            I am installing the tip of master now to see if it has the same issue.

            jsalians_intel John Salinas (Inactive) added a comment - Good question – I reproduced this issue this morning on the following: -b 0 -j zfs-c-p-lustre-patch-vanillais review which my build 21907.3" or build 116 – however I believe this builds what ever was submitted for review last. mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? I am installing the tip of master now to see if it has the same issue.

            Hi John,

            Thanks for the logs. I took a quick look, but there's nothing obvious. The MDS says it's sending over a 1MB config RPC, so I'm not sure why the MGC thinks it's not getting anything. I'll take a closer look tomorrow.

            Can you confirm you are just running straight master, no patches? FWIW line 1716 doesn't correspond to a GOTO statement on the tip of master for me (hash 6fad3ab).

            You could try changing the return code from -EINVAL to 0 on that eof check as a workaround. It shouldn't cause any problems to receive a 0 length RPC if you aren't using nodemap, but it also shouldn't happen as far as I understand it. Here's the eof check I mean:

                            if (!eof)
                                    rc = 0;
            

            What does your test setup look like? Is there any way to reproduce the failure in maloo?

            When you say "since this feature landed I have been unable to use master" can you clarify which feature you mean? There have been a number of patches related to nodemap config transfer that have landed in the past couple of months. If you could specify the last version (commit hash) that worked, and the first version that didn't, that would be helpful.

            Thanks,
            Kit

            kit.westneat Kit Westneat (Inactive) added a comment - Hi John, Thanks for the logs. I took a quick look, but there's nothing obvious. The MDS says it's sending over a 1MB config RPC, so I'm not sure why the MGC thinks it's not getting anything. I'll take a closer look tomorrow. Can you confirm you are just running straight master, no patches? FWIW line 1716 doesn't correspond to a GOTO statement on the tip of master for me (hash 6fad3ab). You could try changing the return code from -EINVAL to 0 on that eof check as a workaround. It shouldn't cause any problems to receive a 0 length RPC if you aren't using nodemap, but it also shouldn't happen as far as I understand it. Here's the eof check I mean: if (!eof) rc = 0; What does your test setup look like? Is there any way to reproduce the failure in maloo? When you say "since this feature landed I have been unable to use master" can you clarify which feature you mean? There have been a number of patches related to nodemap config transfer that have landed in the past couple of months. If you could specify the last version (commit hash) that worked, and the first version that didn't, that would be helpful. Thanks, Kit

            This seems like a pretty serious bug since this feature landed I have been unable to use master at all as I cannot get any OST to connect to MDS. Is there any work around? CORAL testing has completely stopped because of this.

            jsalians_intel John Salinas (Inactive) added a comment - This seems like a pretty serious bug since this feature landed I have been unable to use master at all as I cannot get any OST to connect to MDS. Is there any work around? CORAL testing has completely stopped because of this.

            This log covers from: mount -t lustre zfspool/mdt1 /mnt/mdt to when the OST fails to connect.

            jsalians_intel John Salinas (Inactive) added a comment - This log covers from: mount -t lustre zfspool/mdt1 /mnt/mdt to when the OST fails to connect.
            pjones Peter Jones added a comment -

            Kit

            Don't worry - 21891 is just a port to of the LU-8460 fix to another branch so not important

            Peter

            pjones Peter Jones added a comment - Kit Don't worry - 21891 is just a port to of the LU-8460 fix to another branch so not important Peter

            It looks like the MGS hasn't created the nodemap config when the OST is mounting. Can you get the MGS logs as well? Or is there a maloo link for this failure? I'm not able to see patch 21891 for some reason.

            kit.westneat Kit Westneat (Inactive) added a comment - It looks like the MGS hasn't created the nodemap config when the OST is mounting. Can you get the MGS logs as well? Or is there a maloo link for this failure? I'm not able to see patch 21891 for some reason.
            jsalians_intel John Salinas (Inactive) added a comment - - edited

            Currently this is blocking testing of: LU-8460 and http://review.whamcloud.com/#/c/21891/

            1. lctl set_param debug=-1
              debug=-1
            2. lctl dk clear
              Debug log: 4741 lines, 4741 kept, 0 dropped, 0 bad.
            3. mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0

            Permanent disk data:
            Target: lsdraid-OST0000
            Index: 0
            Lustre FS: lsdraid
            Mount type: zfs
            Flags: 0x42
            (OST update )
            Persistent mount opts:
            Parameters: mgsnode=192.168.1.5@o2ib

            mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0
            Writing ost0/ost0 properties
            lustre:version=1
            lustre:flags=66
            lustre:index=0
            lustre:fsname=lsdraid
            lustre:svname=lsdraid-OST0000
            lustre:mgsnode=192.168.1.5@o2ib

            1. lctl dk clear
              Debug log: 7281 lines, 7281 kept, 0 dropped, 0 bad.
            1. mount -t lustre ost0/ost0 /mnt/lustre/ost0
              mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
              Is the MGS specification correct?
              Is the filesystem name correct?
              If upgrading, is the copied client log valid? (see upgrade docs)
            jsalians_intel John Salinas (Inactive) added a comment - - edited Currently this is blocking testing of: LU-8460 and http://review.whamcloud.com/#/c/21891/ lctl set_param debug=-1 debug=-1 lctl dk clear Debug log: 4741 lines, 4741 kept, 0 dropped, 0 bad. mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0 Permanent disk data: Target: lsdraid-OST0000 Index: 0 Lustre FS: lsdraid Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=192.168.1.5@o2ib mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0 Writing ost0/ost0 properties lustre:version=1 lustre:flags=66 lustre:index=0 lustre:fsname=lsdraid lustre:svname=lsdraid-OST0000 lustre:mgsnode=192.168.1.5@o2ib lctl dk clear Debug log: 7281 lines, 7281 kept, 0 dropped, 0 bad. mount -t lustre ost0/ost0 /mnt/lustre/ost0 mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs)

            People

              kit.westneat Kit Westneat (Inactive)
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: