[LU-8498] configuration from log 'nodemap' failed (-22) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- LS_RZ

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

In current master 2.8.56, a newly created file system failed at mounting OST due to nodemap log error.

From the log message:

00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:1052:ptlrpc_set_destroy()) Process leaving
00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:2896:ptlrpc_queue_wait()) Process leaving (rc=0 : 0 : 0)
10000000:00000001:3.0:1470938259.243551:0:9524:0:(mgc_request.c:1716:mgc_process_recover_nodemap_log()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

it looks like the corresponding log has a zero size that triggered this error.

        if (ealen == 0) { /* no logs transferred */
#ifdef HAVE_SERVER_SUPPORT
                /* config changed since first read RPC */
                if (cld_is_nodemap(cld) && config_read_offset == 0) {
                        recent_nodemap = NULL;
                        nodemap_config_dealloc(new_config);
                        new_config = NULL;

                        CDEBUG(D_INFO, "nodemap config changed in transit, retrying\n");

                        /* setting eof to false, we request config again */
                        eof = false;
                        GOTO(out, rc = 0);
                }
#endif
                if (!eof)
                        rc = -EINVAL;
                GOTO(out, rc);
        }

We have a debug log and will attach it soon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log_failedmount_vanilla_master
0.2 kB
15/Aug/16 6:53 PM
log.Jinshan
0.2 kB
11/Aug/16 7:08 PM
log.mds.LU-8498
0.2 kB
12/Aug/16 9:51 PM
mds_lctl_log_afterpatch
0.2 kB
17/Aug/16 3:55 PM
oss_lctl_log.afterpatch
0.2 kB
17/Aug/16 3:51 PM

Issue Links

is related to

LU-3291 IU UID/GID Mapping Feature

Resolved

Activity

[LU-8498] configuration from log 'nodemap' failed (-22)

Gerrit Updater added a comment - 16/Aug/16 4:28 AM

Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/21939
Subject: ~~LU-8498~~ nodemap: new zfs index files not properly initialized
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e5097f35594e265803a363f4e2e6fc6ea62f62fc

Gerrit Updater added a comment - 16/Aug/16 4:28 AM Kit Westneat (kit.westneat@gmail.com) uploaded a new patch: http://review.whamcloud.com/21939 Subject: LU-8498 nodemap: new zfs index files not properly initialized Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e5097f35594e265803a363f4e2e6fc6ea62f62fc

Kit Westneat (Inactive) added a comment - 16/Aug/16 4:04 AM

I spent some time digging into it. It looks like there's a slight difference in the way new ldiskfs index files and new ZFS index files work. I'm working on a patch to fix it, but in the meantime I think that workaround I talked about earlier should work if you aren't using nodemap. Just change the return code from -EINVAL to 0.

Kit Westneat (Inactive) added a comment - 16/Aug/16 4:04 AM I spent some time digging into it. It looks like there's a slight difference in the way new ldiskfs index files and new ZFS index files work. I'm working on a patch to fix it, but in the meantime I think that workaround I talked about earlier should work if you aren't using nodemap. Just change the return code from -EINVAL to 0.

John Salinas (Inactive) added a comment - 15/Aug/16 8:03 PM

I should say the lustre messages look the same to me still: [ 861.004988] LustreError: 33234:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22
[ 861.020257] LustreError: 13a-8: Failed to get MGS log lsdraid-OST0000 and no local copy.
[ 861.030603] LustreError: 15c-8: MGC192.168.1.5@o2ib: The configuration from log 'lsdraid-OST0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[ 861.060862] LustreError: 33234:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server lsdraid-OST0000: -2
[ 861.075061] LustreError: 33234:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -2
[ 861.087424] LustreError: 33234:0:(obd_mount_server.c:1558:server_put_super()) no obd lsdraid-OST0000
[ 861.236785] Lustre: server umount lsdraid-OST0000 complete
[ 861.244448] LustreError: 33234:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-2)

John Salinas (Inactive) added a comment - 15/Aug/16 8:03 PM I should say the lustre messages look the same to me still: [ 861.004988] LustreError: 33234:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22 [ 861.020257] LustreError: 13a-8: Failed to get MGS log lsdraid-OST0000 and no local copy. [ 861.030603] LustreError: 15c-8: MGC192.168.1.5@o2ib: The configuration from log 'lsdraid-OST0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. [ 861.060862] LustreError: 33234:0:(obd_mount_server.c:1352:server_start_targets()) failed to start server lsdraid-OST0000: -2 [ 861.075061] LustreError: 33234:0:(obd_mount_server.c:1844:server_fill_super()) Unable to start targets: -2 [ 861.087424] LustreError: 33234:0:(obd_mount_server.c:1558:server_put_super()) no obd lsdraid-OST0000 [ 861.236785] Lustre: server umount lsdraid-OST0000 complete [ 861.244448] LustreError: 33234:0:(obd_mount.c:1453:lustre_fill_super()) Unable to mount (-2)

John Salinas (Inactive) added a comment - 15/Aug/16 6:53 PM - edited

I tried this on master:

(no draid – this is not using our prototype branch)
zpool create -f -o cachefile=none -O recordsize=16MB ost0 raidz1 /dev/mapper/mpathaj /dev/mapper/mpathai /dev/mapper/mpathah /dev/mapper/mpathag /dev/mapper/mpathaq /dev/mapper/mpathap /dev/mapper/mpathak /dev/mapper/mpathz /dev/mapper/mpatham /dev/mapper/mpathal /dev/mapper/mpathao
mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0
mount -t lustre ost0/ost0 /mnt/lustre/ost0
mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

I repeated the mount with lcdl debug enabled and attached here. This build was with lustre master:

Git Build Data

Revision: 6fad3abf6f962d04989422cb44dfb7aa0835ad07
refs/remotes/origin/master
Built Branches

refs/remotes/origin/master: Build #170 of Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master)

Fetching changes from the remote Git repository
> git config remote.origin.url ssh://hudson@review.whamcloud.com:29418/fs/lustre-release # timeout=10
Fetching upstream changes from ssh://hudson@review.whamcloud.com:29418/fs/lustre-release
> git --version # timeout=10
> git -c core.askpass=true fetch --tags --progress ssh://hudson@review.whamcloud.com:29418/fs/lustre-release +refs/heads/:refs/remotes/origin/ --depth=1
Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master)
> git config core.sparsecheckout # timeout=10
> git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07
> git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10
> git tag -a -f -m Jenkins Build #170 jenkins-arch=x86_64,build_type=client,distro=el7.2,ib_stack=inkernel-170 # timeout=10
Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master)
> git config core.sparsecheckout # timeout=10
> git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07
> git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10
> git tag -a -f -m Jenkins Build #170 jenkins-zfs-lustre-master-vanilla-170 # timeout=10

Our build engineer noted: build 170 was triggered by ~~LU-7899~~, which was the last SCM change.

John Salinas (Inactive) added a comment - 15/Aug/16 6:53 PM - edited I tried this on master: (no draid – this is not using our prototype branch) zpool create -f -o cachefile=none -O recordsize=16MB ost0 raidz1 /dev/mapper/mpathaj /dev/mapper/mpathai /dev/mapper/mpathah /dev/mapper/mpathag /dev/mapper/mpathaq /dev/mapper/mpathap /dev/mapper/mpathak /dev/mapper/mpathz /dev/mapper/mpatham /dev/mapper/mpathal /dev/mapper/mpathao mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0 mount -t lustre ost0/ost0 /mnt/lustre/ost0 mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs) I repeated the mount with lcdl debug enabled and attached here. This build was with lustre master: Git Build Data Revision: 6fad3abf6f962d04989422cb44dfb7aa0835ad07 refs/remotes/origin/master Built Branches refs/remotes/origin/master: Build #170 of Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master) Fetching changes from the remote Git repository > git config remote.origin.url ssh://hudson@review.whamcloud.com:29418/fs/lustre-release # timeout=10 Fetching upstream changes from ssh://hudson@review.whamcloud.com:29418/fs/lustre-release > git --version # timeout=10 > git -c core.askpass=true fetch --tags --progress ssh://hudson@review.whamcloud.com:29418/fs/lustre-release +refs/heads/ :refs/remotes/origin/ --depth=1 Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07 > git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10 > git tag -a -f -m Jenkins Build #170 jenkins-arch=x86_64,build_type=client,distro=el7.2,ib_stack=inkernel-170 # timeout=10 Checking out Revision 6fad3abf6f962d04989422cb44dfb7aa0835ad07 (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f 6fad3abf6f962d04989422cb44dfb7aa0835ad07 > git rev-list 3ed9f9a0b43bc48cf778539f1281cd60332b99d3 # timeout=10 > git tag -a -f -m Jenkins Build #170 jenkins-zfs-lustre-master-vanilla-170 # timeout=10 Our build engineer noted: build 170 was triggered by LU-7899 , which was the last SCM change.

John Salinas (Inactive) added a comment - 15/Aug/16 3:40 PM

Good question – I reproduced this issue this morning on the following:
-b 0 -j zfs-c-p-lustre-patch-vanillais review which my build 21907.3" or build 116 – however I believe this builds what ever was submitted for review last.

mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?

I am installing the tip of master now to see if it has the same issue.

John Salinas (Inactive) added a comment - 15/Aug/16 3:40 PM Good question – I reproduced this issue this morning on the following: -b 0 -j zfs-c-p-lustre-patch-vanillais review which my build 21907.3" or build 116 – however I believe this builds what ever was submitted for review last. mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? I am installing the tip of master now to see if it has the same issue.

Kit Westneat (Inactive) added a comment - 15/Aug/16 4:37 AM

Hi John,

Thanks for the logs. I took a quick look, but there's nothing obvious. The MDS says it's sending over a 1MB config RPC, so I'm not sure why the MGC thinks it's not getting anything. I'll take a closer look tomorrow.

Can you confirm you are just running straight master, no patches? FWIW line 1716 doesn't correspond to a GOTO statement on the tip of master for me (hash 6fad3ab).

You could try changing the return code from -EINVAL to 0 on that eof check as a workaround. It shouldn't cause any problems to receive a 0 length RPC if you aren't using nodemap, but it also shouldn't happen as far as I understand it. Here's the eof check I mean:

                if (!eof)
                        rc = 0;

What does your test setup look like? Is there any way to reproduce the failure in maloo?

When you say "since this feature landed I have been unable to use master" can you clarify which feature you mean? There have been a number of patches related to nodemap config transfer that have landed in the past couple of months. If you could specify the last version (commit hash) that worked, and the first version that didn't, that would be helpful.

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 15/Aug/16 4:37 AM Hi John, Thanks for the logs. I took a quick look, but there's nothing obvious. The MDS says it's sending over a 1MB config RPC, so I'm not sure why the MGC thinks it's not getting anything. I'll take a closer look tomorrow. Can you confirm you are just running straight master, no patches? FWIW line 1716 doesn't correspond to a GOTO statement on the tip of master for me (hash 6fad3ab). You could try changing the return code from -EINVAL to 0 on that eof check as a workaround. It shouldn't cause any problems to receive a 0 length RPC if you aren't using nodemap, but it also shouldn't happen as far as I understand it. Here's the eof check I mean: if (!eof) rc = 0; What does your test setup look like? Is there any way to reproduce the failure in maloo? When you say "since this feature landed I have been unable to use master" can you clarify which feature you mean? There have been a number of patches related to nodemap config transfer that have landed in the past couple of months. If you could specify the last version (commit hash) that worked, and the first version that didn't, that would be helpful. Thanks, Kit

John Salinas (Inactive) added a comment - 13/Aug/16 12:24 AM

This seems like a pretty serious bug since this feature landed I have been unable to use master at all as I cannot get any OST to connect to MDS. Is there any work around? CORAL testing has completely stopped because of this.

John Salinas (Inactive) added a comment - 13/Aug/16 12:24 AM This seems like a pretty serious bug since this feature landed I have been unable to use master at all as I cannot get any OST to connect to MDS. Is there any work around? CORAL testing has completely stopped because of this.

John Salinas (Inactive) added a comment - 12/Aug/16 9:51 PM

This log covers from: mount -t lustre zfspool/mdt1 /mnt/mdt to when the OST fails to connect.

John Salinas (Inactive) added a comment - 12/Aug/16 9:51 PM This log covers from: mount -t lustre zfspool/mdt1 /mnt/mdt to when the OST fails to connect.

Peter Jones added a comment - 12/Aug/16 5:58 PM

Kit

Don't worry - 21891 is just a port to of the ~~LU-8460~~ fix to another branch so not important

Peter

Peter Jones added a comment - 12/Aug/16 5:58 PM Kit Don't worry - 21891 is just a port to of the LU-8460 fix to another branch so not important Peter

Kit Westneat (Inactive) added a comment - 12/Aug/16 2:58 PM

It looks like the MGS hasn't created the nodemap config when the OST is mounting. Can you get the MGS logs as well? Or is there a maloo link for this failure? I'm not able to see patch 21891 for some reason.

Kit Westneat (Inactive) added a comment - 12/Aug/16 2:58 PM It looks like the MGS hasn't created the nodemap config when the OST is mounting. Can you get the MGS logs as well? Or is there a maloo link for this failure? I'm not able to see patch 21891 for some reason.

John Salinas (Inactive) added a comment - 11/Aug/16 6:29 PM - edited

Currently this is blocking testing of: ~~LU-8460~~ and http://review.whamcloud.com/#/c/21891/

lctl set_param debug=-1
debug=-1
lctl dk clear
Debug log: 4741 lines, 4741 kept, 0 dropped, 0 bad.
mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0

Permanent disk data:
Target: lsdraid-OST0000
Index: 0
Lustre FS: lsdraid
Mount type: zfs
Flags: 0x42
(OST update )
Persistent mount opts:
Parameters: mgsnode=192.168.1.5@o2ib

mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0
Writing ost0/ost0 properties
lustre:version=1
lustre:flags=66
lustre:index=0
lustre:fsname=lsdraid
lustre:svname=lsdraid-OST0000
lustre:mgsnode=192.168.1.5@o2ib

lctl dk clear
Debug log: 7281 lines, 7281 kept, 0 dropped, 0 bad.

mount -t lustre ost0/ost0 /mnt/lustre/ost0
mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

John Salinas (Inactive) added a comment - 11/Aug/16 6:29 PM - edited Currently this is blocking testing of: LU-8460 and http://review.whamcloud.com/#/c/21891/ lctl set_param debug=-1 debug=-1 lctl dk clear Debug log: 4741 lines, 4741 kept, 0 dropped, 0 bad. mkfs.lustre --backfstype=zfs --reformat --replace --fsname=lsdraid --ost --index=0 --mgsnode=192.168.1.5@o2ib ost0/ost0 Permanent disk data: Target: lsdraid-OST0000 Index: 0 Lustre FS: lsdraid Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=192.168.1.5@o2ib mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0 Writing ost0/ost0 properties lustre:version=1 lustre:flags=66 lustre:index=0 lustre:fsname=lsdraid lustre:svname=lsdraid-OST0000 lustre:mgsnode=192.168.1.5@o2ib lctl dk clear Debug log: 7281 lines, 7281 kept, 0 dropped, 0 bad. mount -t lustre ost0/ost0 /mnt/lustre/ost0 mount.lustre: mount ost0/ost0 at /mnt/lustre/ost0 failed: No such file or directory Is the MGS specification correct? Is the filesystem name correct? If upgrading, is the copied client log valid? (see upgrade docs)

People

Assignee:: Kit Westneat (Inactive)

Reporter:: Jinshan Xiong (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Aug/16 6:11 PM

Updated:: 14/Jun/18 9:39 PM

Resolved:: 08/Sep/16 4:19 AM