[LU-8498] configuration from log 'nodemap' failed (-22) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.9.0
Affects Version/s: None
Labels:
- LS_RZ

Severity:
1
Rank (Obsolete):
9223372036854775807

Description

In current master 2.8.56, a newly created file system failed at mounting OST due to nodemap log error.

From the log message:

00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:1052:ptlrpc_set_destroy()) Process leaving
00000100:00000001:3.0:1470938259.243549:0:9524:0:(client.c:2896:ptlrpc_queue_wait()) Process leaving (rc=0 : 0 : 0)
10000000:00000001:3.0:1470938259.243551:0:9524:0:(mgc_request.c:1716:mgc_process_recover_nodemap_log()) Process leaving via out (rc=18446744073709551594 : -22 : 0xffffffffffffffea)

it looks like the corresponding log has a zero size that triggered this error.

        if (ealen == 0) { /* no logs transferred */
#ifdef HAVE_SERVER_SUPPORT
                /* config changed since first read RPC */
                if (cld_is_nodemap(cld) && config_read_offset == 0) {
                        recent_nodemap = NULL;
                        nodemap_config_dealloc(new_config);
                        new_config = NULL;

                        CDEBUG(D_INFO, "nodemap config changed in transit, retrying\n");

                        /* setting eof to false, we request config again */
                        eof = false;
                        GOTO(out, rc = 0);
                }
#endif
                if (!eof)
                        rc = -EINVAL;
                GOTO(out, rc);
        }

We have a debug log and will attach it soon.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

log_failedmount_vanilla_master
0.2 kB
15/Aug/16 6:53 PM
log.Jinshan
0.2 kB
11/Aug/16 7:08 PM
log.mds.LU-8498
0.2 kB
12/Aug/16 9:51 PM
mds_lctl_log_afterpatch
0.2 kB
17/Aug/16 3:55 PM
oss_lctl_log.afterpatch
0.2 kB
17/Aug/16 3:51 PM

Issue Links

is related to

LU-3291 IU UID/GID Mapping Feature

Resolved

Activity

[LU-8498] configuration from log 'nodemap' failed (-22)

Peter Jones added a comment - 08/Sep/16 4:19 AM

Landed for 2.9

Peter Jones added a comment - 08/Sep/16 4:19 AM Landed for 2.9

Gerrit Updater added a comment - 08/Sep/16 2:05 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21939/
Subject: ~~LU-8498~~ nodemap: new zfs index files not properly initialized
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: dd69c1da4ab22c66801df6a6e316d210ea71454d

Gerrit Updater added a comment - 08/Sep/16 2:05 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21939/ Subject: LU-8498 nodemap: new zfs index files not properly initialized Project: fs/lustre-release Branch: master Current Patch Set: Commit: dd69c1da4ab22c66801df6a6e316d210ea71454d

John Salinas (Inactive) added a comment - 26/Aug/16 11:44 PM

It appears sometimes the ost still does connect – I just tried now and of course it connected because I was trying to collect more information. I will lower priority for now.

John Salinas (Inactive) added a comment - 26/Aug/16 11:44 PM It appears sometimes the ost still does connect – I just tried now and of course it connected because I was trying to collect more information. I will lower priority for now.

Kit Westneat (Inactive) added a comment - 20/Aug/16 11:32 PM

Hey John,

I see that there's a nodemap error, which is definitely a problem, but it looks like the OST still connects ok:
Aug 20 19:55:46 wolf-4 kernel: Lustre: lsdraid-OST0001: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib)

Was there a problem on the clients as well? Do you happen to have the debug logs for this time period for the OSS and MDS?

BTW there's another nodemap related bug / patch here:
http://review.whamcloud.com/#/c/22004/

I don't think it's related to what you're seeing, but it could be worth adding to the mix.

Kit Westneat (Inactive) added a comment - 20/Aug/16 11:32 PM Hey John, I see that there's a nodemap error, which is definitely a problem, but it looks like the OST still connects ok: Aug 20 19:55:46 wolf-4 kernel: Lustre: lsdraid-OST0001: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib) Was there a problem on the clients as well? Do you happen to have the debug logs for this time period for the OSS and MDS? BTW there's another nodemap related bug / patch here: http://review.whamcloud.com/#/c/22004/ I don't think it's related to what you're seeing, but it could be worth adding to the mix.

John Salinas (Inactive) added a comment - 20/Aug/16 8:00 PM

It does indeed appear that I can only connect 1 ost – when the second ost tries to connect I see this:

[root@wolf-4 mdisolation_step2_draid_testing]# zpool create -f -o cachefile=none -O recordsize=16MB ost1 draid1 /dev/mapper/mpathl /dev/mapper/mpathk /dev/mapper/mpathj /dev/mapper/mpathi /dev/mapper/mpathr /dev/mapper/mpathm /dev/mapper/mpathb /dev/mapper/mpatha /dev/mapper/mpatho /dev/mapper/mpathn /dev/mapper/mpathq spare '$draid1-0-s0' metadata mirror /dev/mapper/mpathg /dev/mapper/mpathx
[root@wolf-4 mdisolation_step2_draid_testing]# zpool feature@large_blocks=enabled ost1
[root@wolf-4 mdisolation_step2_draid_testing]# mkfs.lustre --backfstype=zfs --reformat --fsname=lsdraid --ost --index=1 --mgsnode=192.168.1.5@o2ib ost1/ost1

Permanent disk data:
Target: lsdraid:OST0001
Index: 1
Lustre FS: lsdraid
Mount type: zfs
Flags: 0x62
(OST first_time update )
Persistent mount opts:
Parameters: mgsnode=192.168.1.5@o2ib

mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost1/ost1
Writing ost1/ost1 properties
lustre:version=1
lustre:flags=98
lustre:index=1
lustre:fsname=lsdraid
lustre:svname=lsdraid:OST0001
lustre:mgsnode=192.168.1.5@o2ib
[root@wolf-4 mdisolation_step2_draid_testing]# mount -t lustre ost1/ost1 /mnt/lustre/ost1

Aug 20 19:55:13 wolf-4 kernel: ZFS: spa metadata allocation class feature not enabled (ddt,dmu,mos)
Aug 20 19:55:13 wolf-4 kernel: ZFS: vdev_alloc primary 'ddt,dmu,mos'
Aug 20 19:55:14 wolf-4 kernel: ZFS: vdev_construct_zaps assign primary 'ddt,dmu,mos'
Aug 20 19:55:27 wolf-4 ntpd_intres[2643]: ntp_intres.request: permission denied
Aug 20 19:55:27 wolf-4 ntpd_intres[2643]: ntp_intres.request: permission denied
Aug 20 19:55:41 wolf-4 kernel: LustreError: 41114:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22
Aug 20 19:55:41 wolf-4 kernel: Lustre: lsdraid-OST0001: new disk, initializing
Aug 20 19:55:41 wolf-4 kernel: Lustre: srv-lsdraid-OST0001: No data found on store. Initialize space
Aug 20 19:55:46 wolf-4 kernel: Lustre: lsdraid-OST0001: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib)
Aug 20 19:55:46 wolf-4 kernel: Lustre: Skipped 2 previous similar messages

[root@wolf-4 mdisolation_step2_draid_testing]# mount |grep lustre
ost1/ost1 on /mnt/lustre/ost1 type lustre (ro)

from mds:
[root@wolf-5 mdisolation_step2_draid_testing]# lctl dl
0 UP osd-zfs lsdraid-MDT0000-osd lsdraid-MDT0000-osd_UUID 10
1 UP mgs MGS MGS 9
2 UP mgc MGC192.168.1.5@o2ib ac271bb5-f2ee-996f-94cf-6c9cd20a078a 5
3 UP mds MDS MDS_uuid 3
4 UP lod lsdraid-MDT0000-mdtlov lsdraid-MDT0000-mdtlov_UUID 4
5 UP mdt lsdraid-MDT0000 lsdraid-MDT0000_UUID 9
6 UP mdd lsdraid-MDD0000 lsdraid-MDD0000_UUID 4
7 UP qmt lsdraid-QMT0000 lsdraid-QMT0000_UUID 4
8 UP lwp lsdraid-MDT0000-lwp-MDT0000 lsdraid-MDT0000-lwp-MDT0000_UUID 5
9 UP osp lsdraid-OST0000-osc-MDT0000 lsdraid-MDT0000-mdtlov_UUID 5
10 UP osp lsdraid-OST0001-osc-MDT0000 lsdraid-MDT0000-mdtlov_UUID 5

from oss1
[root@wolf-3 mdisolation_step2_draid_testing]# lctl dl
0 UP osd-zfs lsdraid-OST0000-osd lsdraid-OST0000-osd_UUID 5
1 UP mgc MGC192.168.1.5@o2ib c2ddb1b6-eda7-06a1-c4c7-7a10c79169f2 5
2 UP ost OSS OSS_uuid 3
3 UP obdfilter lsdraid-OST0000 lsdraid-OST0000_UUID 5
4 UP lwp lsdraid-MDT0000-lwp-OST0000 lsdraid-MDT0000-lwp-OST0000_UUID 5
[root@wolf-3 mdisolation_step2_draid_testing]#

from oss2
[root@wolf-4 mdisolation_step2_draid_testing]# lctl dl
0 UP osd-zfs lsdraid-OST0001-osd lsdraid-OST0001-osd_UUID 5
1 UP mgc MGC192.168.1.5@o2ib 9baa6009-12fa-026e-a230-d8562d7b63c7 5
2 UP ost OSS OSS_uuid 3
3 UP obdfilter lsdraid-OST0001 lsdraid-OST0001_UUID 5
4 UP lwp lsdraid-MDT0000-lwp-OST0001 lsdraid-MDT0000-lwp-OST0001_UUID 5

John Salinas (Inactive) added a comment - 20/Aug/16 8:00 PM It does indeed appear that I can only connect 1 ost – when the second ost tries to connect I see this: [root@wolf-4 mdisolation_step2_draid_testing] # zpool create -f -o cachefile=none -O recordsize=16MB ost1 draid1 /dev/mapper/mpathl /dev/mapper/mpathk /dev/mapper/mpathj /dev/mapper/mpathi /dev/mapper/mpathr /dev/mapper/mpathm /dev/mapper/mpathb /dev/mapper/mpatha /dev/mapper/mpatho /dev/mapper/mpathn /dev/mapper/mpathq spare '$draid1-0-s0' metadata mirror /dev/mapper/mpathg /dev/mapper/mpathx [root@wolf-4 mdisolation_step2_draid_testing] # zpool feature@large_blocks=enabled ost1 [root@wolf-4 mdisolation_step2_draid_testing] # mkfs.lustre --backfstype=zfs --reformat --fsname=lsdraid --ost --index=1 --mgsnode=192.168.1.5@o2ib ost1/ost1 Permanent disk data: Target: lsdraid:OST0001 Index: 1 Lustre FS: lsdraid Mount type: zfs Flags: 0x62 (OST first_time update ) Persistent mount opts: Parameters: mgsnode=192.168.1.5@o2ib mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost1/ost1 Writing ost1/ost1 properties lustre:version=1 lustre:flags=98 lustre:index=1 lustre:fsname=lsdraid lustre:svname=lsdraid:OST0001 lustre:mgsnode=192.168.1.5@o2ib [root@wolf-4 mdisolation_step2_draid_testing] # mount -t lustre ost1/ost1 /mnt/lustre/ost1 Aug 20 19:55:13 wolf-4 kernel: ZFS: spa metadata allocation class feature not enabled (ddt,dmu,mos) Aug 20 19:55:13 wolf-4 kernel: ZFS: vdev_alloc primary 'ddt,dmu,mos' Aug 20 19:55:14 wolf-4 kernel: ZFS: vdev_construct_zaps assign primary 'ddt,dmu,mos' Aug 20 19:55:27 wolf-4 ntpd_intres [2643] : ntp_intres.request: permission denied Aug 20 19:55:27 wolf-4 ntpd_intres [2643] : ntp_intres.request: permission denied Aug 20 19:55:41 wolf-4 kernel: LustreError: 41114:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22 Aug 20 19:55:41 wolf-4 kernel: Lustre: lsdraid-OST0001: new disk, initializing Aug 20 19:55:41 wolf-4 kernel: Lustre: srv-lsdraid-OST0001: No data found on store. Initialize space Aug 20 19:55:46 wolf-4 kernel: Lustre: lsdraid-OST0001: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib) Aug 20 19:55:46 wolf-4 kernel: Lustre: Skipped 2 previous similar messages [root@wolf-4 mdisolation_step2_draid_testing] # mount |grep lustre ost1/ost1 on /mnt/lustre/ost1 type lustre (ro) from mds: [root@wolf-5 mdisolation_step2_draid_testing] # lctl dl 0 UP osd-zfs lsdraid-MDT0000-osd lsdraid-MDT0000-osd_UUID 10 1 UP mgs MGS MGS 9 2 UP mgc MGC192.168.1.5@o2ib ac271bb5-f2ee-996f-94cf-6c9cd20a078a 5 3 UP mds MDS MDS_uuid 3 4 UP lod lsdraid-MDT0000-mdtlov lsdraid-MDT0000-mdtlov_UUID 4 5 UP mdt lsdraid-MDT0000 lsdraid-MDT0000_UUID 9 6 UP mdd lsdraid-MDD0000 lsdraid-MDD0000_UUID 4 7 UP qmt lsdraid-QMT0000 lsdraid-QMT0000_UUID 4 8 UP lwp lsdraid-MDT0000-lwp-MDT0000 lsdraid-MDT0000-lwp-MDT0000_UUID 5 9 UP osp lsdraid-OST0000-osc-MDT0000 lsdraid-MDT0000-mdtlov_UUID 5 10 UP osp lsdraid-OST0001-osc-MDT0000 lsdraid-MDT0000-mdtlov_UUID 5 from oss1 [root@wolf-3 mdisolation_step2_draid_testing] # lctl dl 0 UP osd-zfs lsdraid-OST0000-osd lsdraid-OST0000-osd_UUID 5 1 UP mgc MGC192.168.1.5@o2ib c2ddb1b6-eda7-06a1-c4c7-7a10c79169f2 5 2 UP ost OSS OSS_uuid 3 3 UP obdfilter lsdraid-OST0000 lsdraid-OST0000_UUID 5 4 UP lwp lsdraid-MDT0000-lwp-OST0000 lsdraid-MDT0000-lwp-OST0000_UUID 5 [root@wolf-3 mdisolation_step2_draid_testing] # from oss2 [root@wolf-4 mdisolation_step2_draid_testing] # lctl dl 0 UP osd-zfs lsdraid-OST0001-osd lsdraid-OST0001-osd_UUID 5 1 UP mgc MGC192.168.1.5@o2ib 9baa6009-12fa-026e-a230-d8562d7b63c7 5 2 UP ost OSS OSS_uuid 3 3 UP obdfilter lsdraid-OST0001 lsdraid-OST0001_UUID 5 4 UP lwp lsdraid-MDT0000-lwp-OST0001 lsdraid-MDT0000-lwp-OST0001_UUID 5

John Salinas (Inactive) added a comment - 19/Aug/16 10:06 PM - edited

The good news is one OST can now connect to MDS. I saw one odd messages when clients are mounting I see this:

[26059.015447] libcfs: module verification failed: signature and/or required key missing - tainting kernel
[26059.030063] LNet: HW CPU cores: 72, npartitions: 8
[26059.038309] alg: No test for adler32 (adler32-zlib)
[26059.044985] alg: No test for crc32 (crc32-table)
[26068.348811] LNet: HW CPU cores: 72, npartitions: 8
[26068.357409] alg: No test for adler32 (adler32-zlib)
[26068.364132] alg: No test for crc32 (crc32-table)
[26076.388483] Lustre: Lustre: Build Version: 2.8.56
[26076.471053] LNet: Added LNI 192.168.1.6@o2ib [8/512/0/180]
[26076.536301] LustreError: 44972:0:(mgc_request.c:1533:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib
[26076.550097] Lustre: 44972:0:(mgc_request.c:1760:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2

.4 is OSS2 and .5 is MDS node

[root@wolf-7 ~]# df -h /mnt/lustre/
Filesystem Size Used Avail Use% Mounted on
192.168.1.5@o2ib:/lsdraid 7.0T 16M 7.0T 1% /mnt/lustre

so we are missing the zpool from OSS2:

[root@wolf-3 ~]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
ost0 9.97T 21.0M 9.97T - 0% 0% 1.00x ONLINE -
[root@wolf-3 ~]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
ost0 17.8M 7.01T 19K /ost0
ost0/ost0 17.6M 7.01T 17.6M /ost0/ost0
[root@wolf-3 ~]# mount |grep lustre
ost0/ost0 on /mnt/lustre/ost0 type lustre (ro)

[root@wolf-4 mdisolation_step2_draid_testing]# zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
ost1 9.97T 846K 9.97T - 0% 0% 1.00x ONLINE -
[root@wolf-4 mdisolation_step2_draid_testing]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
ost1 1.51M 7.01T 19K /ost1
ost1/ost1 1.37M 7.01T 1.37M /ost1/ost1
[root@wolf-4 mdisolation_step2_draid_testing]# mount |grep lustre
ost1/ost1 on /mnt/lustre/ost1 type lustre (ro)
[25673.466998] ZFS: spa metadata allocation class feature not enabled (ddt,dmu,mos)
[25673.477813] ZFS: vdev_alloc primary 'ddt,dmu,mos'
[25674.379121] ZFS: vdev_construct_zaps assign primary 'ddt,dmu,mos'
[25677.322609] LustreError: 52126:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22
[25677.342892] Lustre: lsdraid-OST0000: new disk, initializing
[25677.351849] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space
[25677.450718] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450
[26177.093757] LustreError: 11-0: lsdraid-MDT0000-lwp-OST0000: operation obd_ping to node 192.168.1.5@o2ib failed: rc = -107
[26177.108646] LustreError: Skipped 21 previous similar messages
[26177.117392] Lustre: lsdraid-MDT0000-lwp-OST0000: Connection to lsdraid-MDT0000 (at 192.168.1.5@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[26177.140650] Lustre: Skipped 21 previous similar messages
[26177.149666] LustreError: 167-0: lsdraid-MDT0000-lwp-OST0000: This client was evicted by lsdraid-MDT0000; in progress operations using this service will fail.
[26177.170128] LustreError: Skipped 21 previous similar messages
[26177.180193] Lustre: lsdraid-MDT0000-lwp-OST0000: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib)
[26177.194074] Lustre: Skipped 21 previous similar messages

John Salinas (Inactive) added a comment - 19/Aug/16 10:06 PM - edited The good news is one OST can now connect to MDS. I saw one odd messages when clients are mounting I see this: [26059.015447] libcfs: module verification failed: signature and/or required key missing - tainting kernel [26059.030063] LNet: HW CPU cores: 72, npartitions: 8 [26059.038309] alg: No test for adler32 (adler32-zlib) [26059.044985] alg: No test for crc32 (crc32-table) [26068.348811] LNet: HW CPU cores: 72, npartitions: 8 [26068.357409] alg: No test for adler32 (adler32-zlib) [26068.364132] alg: No test for crc32 (crc32-table) [26076.388483] Lustre: Lustre: Build Version: 2.8.56 [26076.471053] LNet: Added LNI 192.168.1.6@o2ib [8/512/0/180] [26076.536301] LustreError: 44972:0:(mgc_request.c:1533:mgc_apply_recover_logs()) mgc: cannot find uuid by nid 192.168.1.4@o2ib [26076.550097] Lustre: 44972:0:(mgc_request.c:1760:mgc_process_recover_nodemap_log()) MGC192.168.1.5@o2ib: error processing recovery log lsdraid-cliir: rc = -2 .4 is OSS2 and .5 is MDS node [root@wolf-7 ~] # df -h /mnt/lustre/ Filesystem Size Used Avail Use% Mounted on 192.168.1.5@o2ib:/lsdraid 7.0T 16M 7.0T 1% /mnt/lustre so we are missing the zpool from OSS2: [root@wolf-3 ~] # zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ost0 9.97T 21.0M 9.97T - 0% 0% 1.00x ONLINE - [root@wolf-3 ~] # zfs list NAME USED AVAIL REFER MOUNTPOINT ost0 17.8M 7.01T 19K /ost0 ost0/ost0 17.6M 7.01T 17.6M /ost0/ost0 [root@wolf-3 ~] # mount |grep lustre ost0/ost0 on /mnt/lustre/ost0 type lustre (ro) [root@wolf-4 mdisolation_step2_draid_testing] # zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ost1 9.97T 846K 9.97T - 0% 0% 1.00x ONLINE - [root@wolf-4 mdisolation_step2_draid_testing] # zfs list NAME USED AVAIL REFER MOUNTPOINT ost1 1.51M 7.01T 19K /ost1 ost1/ost1 1.37M 7.01T 1.37M /ost1/ost1 [root@wolf-4 mdisolation_step2_draid_testing] # mount |grep lustre ost1/ost1 on /mnt/lustre/ost1 type lustre (ro) [25673.466998] ZFS: spa metadata allocation class feature not enabled (ddt,dmu,mos) [25673.477813] ZFS: vdev_alloc primary 'ddt,dmu,mos' [25674.379121] ZFS: vdev_construct_zaps assign primary 'ddt,dmu,mos' [25677.322609] LustreError: 52126:0:(mgc_request.c:257:do_config_log_add()) MGC192.168.1.5@o2ib: failed processing log, type 4: rc = -22 [25677.342892] Lustre: lsdraid-OST0000: new disk, initializing [25677.351849] Lustre: srv-lsdraid-OST0000: No data found on store. Initialize space [25677.450718] Lustre: lsdraid-OST0000: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-450 [26177.093757] LustreError: 11-0: lsdraid-MDT0000-lwp-OST0000: operation obd_ping to node 192.168.1.5@o2ib failed: rc = -107 [26177.108646] LustreError: Skipped 21 previous similar messages [26177.117392] Lustre: lsdraid-MDT0000-lwp-OST0000: Connection to lsdraid-MDT0000 (at 192.168.1.5@o2ib) was lost; in progress operations using this service will wait for recovery to complete [26177.140650] Lustre: Skipped 21 previous similar messages [26177.149666] LustreError: 167-0: lsdraid-MDT0000-lwp-OST0000: This client was evicted by lsdraid-MDT0000; in progress operations using this service will fail. [26177.170128] LustreError: Skipped 21 previous similar messages [26177.180193] Lustre: lsdraid-MDT0000-lwp-OST0000: Connection restored to 192.168.1.5@o2ib (at 192.168.1.5@o2ib) [26177.194074] Lustre: Skipped 21 previous similar messages

John Salinas (Inactive) added a comment - 19/Aug/16 6:02 PM

Today I re-built with tip of master + this patch and was able to get an OSS to mount successfully – using the same scripts:

./setup_oss1.sh
Restarting network (via systemctl): Job for network.service failed because the control process exited with error code. See "systemctl status network.service" and "journalctl -xe" for details.
[FAILED]
modprobe: FATAL: Module lnet is in use.
modprobe: FATAL: Module zfs is in use.
pool: ost0
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM
ost0 ONLINE 0 0 0
draid1-0

{any}

ONLINE 0 0 0
mpathaj ONLINE 0 0 0
mpathai ONLINE 0 0 0
mpathah ONLINE 0 0 0
mpathag ONLINE 0 0 0
mpathaq ONLINE 0 0 0
mpathap ONLINE 0 0 0
mpathak ONLINE 0 0 0
mpathz ONLINE 0 0 0
mpatham ONLINE 0 0 0
mpathal ONLINE 0 0 0
mpathao ONLINE 0 0 0
mirror-1

{ddt,dmu,mos}

ONLINE 0 0 0
mpathae ONLINE 0 0 0
mpathaa ONLINE 0 0 0
spares
$draid1-0-s0 AVAIL

errors: No known data errors
ost0 feature@large_blocks enabled local

Permanent disk data:
Target: lsdraid-OST0000
Index: 0
Lustre FS: lsdraid
Mount type: zfs
Flags: 0x42
(OST update )
Persistent mount opts:
Parameters: mgsnode=192.168.1.5@o2ib

mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0
Writing ost0/ost0 properties
lustre:version=1
lustre:flags=66
lustre:index=0
lustre:fsname=lsdraid
lustre:svname=lsdraid-OST0000
lustre:mgsnode=192.168.1.5@o2ib

mount |grep ost0
ost0 on /ost0 type zfs (rw,xattr,noacl)
ost0/ost0 on /mnt/lustre/ost0 type lustre (ro)

Will install the rest of the system and if we can get all of the nodes up then will mark this as fixed – but we will need this fix merged.

John Salinas (Inactive) added a comment - 19/Aug/16 6:02 PM Today I re-built with tip of master + this patch and was able to get an OSS to mount successfully – using the same scripts: ./setup_oss1.sh Restarting network (via systemctl): Job for network.service failed because the control process exited with error code. See "systemctl status network.service" and "journalctl -xe" for details. [FAILED] modprobe: FATAL: Module lnet is in use. modprobe: FATAL: Module zfs is in use. pool: ost0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM ost0 ONLINE 0 0 0 draid1-0 {any} ONLINE 0 0 0 mpathaj ONLINE 0 0 0 mpathai ONLINE 0 0 0 mpathah ONLINE 0 0 0 mpathag ONLINE 0 0 0 mpathaq ONLINE 0 0 0 mpathap ONLINE 0 0 0 mpathak ONLINE 0 0 0 mpathz ONLINE 0 0 0 mpatham ONLINE 0 0 0 mpathal ONLINE 0 0 0 mpathao ONLINE 0 0 0 mirror-1 {ddt,dmu,mos} ONLINE 0 0 0 mpathae ONLINE 0 0 0 mpathaa ONLINE 0 0 0 spares $draid1-0-s0 AVAIL errors: No known data errors ost0 feature@large_blocks enabled local Permanent disk data: Target: lsdraid-OST0000 Index: 0 Lustre FS: lsdraid Mount type: zfs Flags: 0x42 (OST update ) Persistent mount opts: Parameters: mgsnode=192.168.1.5@o2ib mkfs_cmd = zfs create -o canmount=off -o xattr=sa ost0/ost0 Writing ost0/ost0 properties lustre:version=1 lustre:flags=66 lustre:index=0 lustre:fsname=lsdraid lustre:svname=lsdraid-OST0000 lustre:mgsnode=192.168.1.5@o2ib mount |grep ost0 ost0 on /ost0 type zfs (rw,xattr,noacl) ost0/ost0 on /mnt/lustre/ost0 type lustre (ro) Will install the rest of the system and if we can get all of the nodes up then will mark this as fixed – but we will need this fix merged.

John Salinas (Inactive) added a comment - 17/Aug/16 8:03 PM

I would be surprised if there could be a timing issue – I am running scripts by hand first on mds and once it is done oss1 and I try it several times by hand to capture debug lugs. This process has worked for months and months until nodemap came along. However, that doesn't mean there isn't another issue here.

John Salinas (Inactive) added a comment - 17/Aug/16 8:03 PM I would be surprised if there could be a timing issue – I am running scripts by hand first on mds and once it is done oss1 and I try it several times by hand to capture debug lugs. This process has worked for months and months until nodemap came along. However, that doesn't mean there isn't another issue here.

Kit Westneat (Inactive) added a comment - 17/Aug/16 4:03 PM

It looks like the nodemap config loaded properly, but there was an error in registration:
On MDS:

20000000:02020000:21.0:1471449182.711299:0:40275:0:(mgs_handler.c:297:mgs_check_target()) 13b-9: lsdraid-OST0000 claims to have registered, but this MGS does not know about it, preventing registration.

On OSS:

10000000:01000000:37.0:1471448892.078388:0:48119:0:(mgc_request.c:2086:mgc_process_log()) MGC192.168.1.5@o2ib: configuration from log 'nodemap' succeeded (0).

Is it possible there's some kind of timing issue between reformatting the OSSes and the MDS?

Kit Westneat (Inactive) added a comment - 17/Aug/16 4:03 PM It looks like the nodemap config loaded properly, but there was an error in registration: On MDS: 20000000:02020000:21.0:1471449182.711299:0:40275:0:(mgs_handler.c:297:mgs_check_target()) 13b-9: lsdraid-OST0000 claims to have registered, but this MGS does not know about it, preventing registration. On OSS: 10000000:01000000:37.0:1471448892.078388:0:48119:0:(mgc_request.c:2086:mgc_process_log()) MGC192.168.1.5@o2ib: configuration from log 'nodemap' succeeded (0). Is it possible there's some kind of timing issue between reformatting the OSSes and the MDS?

John Salinas (Inactive) added a comment - 17/Aug/16 3:56 PM

Please see:
11192890 Aug 17 10:50 oss_lctl_log.afterpatch
8799197 Aug 17 10:54 mds_lctl_log_afterpatch

let me know if you need anything else

John Salinas (Inactive) added a comment - 17/Aug/16 3:56 PM Please see: 11192890 Aug 17 10:50 oss_lctl_log.afterpatch 8799197 Aug 17 10:54 mds_lctl_log_afterpatch let me know if you need anything else

People

Assignee:: Kit Westneat (Inactive)

Reporter:: Jinshan Xiong (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Aug/16 6:11 PM

Updated:: 14/Jun/18 9:39 PM

Resolved:: 08/Sep/16 4:19 AM