Details
-
Bug
-
Resolution: Fixed
-
Major
-
Lustre 2.9.0
-
lola
build: master commit 71d2ea0fde17ecde0bf237f486d4bafb5d54fe3f + patches
-
3
-
9223372036854775807
Description
The error happens during soak testing of build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427). DNE is enabled. OSTs had been formatted with zfs, MDT's using ldiskfs as storage backend. There's 1 MDT per MDS and 4 OSTs per OSS. The OSS and MDT nodes are configured in HA active-active failover configuration.
The configuration, especially the mapping of node to role can be found here: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration.
For this tickets MDS nodes lola-[8,9] which form a HA cluster are of interest.
After a hard failover of the primary node (lola-8) had been initiated using pm -c , mounting the MGT/MDT (one device) on the secondary node (lola-9) failed with error message:
Apr 29 14:00:36 lola-9 kernel: LDISKFS-fs warning (device dm-9): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, pleas e wait. ... ... Apr 29 14:19:06 lola-9 kernel: Lustre: Skipped 2 previous similar messages Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(genops.c:334:class_newdev()) Device MGC192.168.1.109@o2ib10 already exists at 2, won't add Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_config.c:370:class_attach()) Cannot create device MGC192.168.1.109@o2ib10 of type mgc : -17 Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount.c:198:lustre_start_simple()) MGC192.168.1.109@o2ib10 attach error -17 Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-MDT0000 Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-MDT0000 not registered Apr 29 14:19:06 lola-9 kernel: Lustre: 7204:0:(service.c:2096:ptlrpc_server_handle_request()) Skipped 9 previous similar messages Apr 29 14:19:06 lola-9 kernel: Lustre: server umount soaked-MDT0000 complete Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount.c:1450:lustre_fill_super()) Unable to mount (-17)
The exact sequence of events reads as:
- 2016-04-29 13:53:59,419:fsmgmt.fsmgmt:INFO triggering fault mds_failover for lola-8
- 2016-04-29 14:00:25,257:fsmgmt.fsmgmt:INFO lola-8 is up!!!
- 2016-04-29 14:00:36,270:fsmgmt.fsmgmt:INFO Mounting soaked-MDT0000 on lola-9 ...
- 2016-04-29 14:19:06 mount failed with message above
Attached files:
lola-9 messages, console files, debug log files lustre-log.1461964594.5065 and lustre-log.1461964707.7204
The effect can be reproduced manually.
The device could be mounted not before mdt-1 (primary resource of lola-9) were umounted and Lustre modules had been reloaded.
Try manually
[root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0/ mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: File exists [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0/ mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: File exists [root@lola-9 ~]# date Mon May 2 00:39:25 PDT 2016 [root@lola-9 ~]# date Mon May 2 00:39:52 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0040 Debug log: 2321 lines, 2321 kept, 0 dropped, 0 bad.
---> attached file lustre-log.2016-05-02-0040 also attached
umount + reload Lustre modules on lola-9
##### Umount and remove lustre mods [root@lola-9 ~]# umount /mnt/soaked-mdt1 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# lctl dl [root@lola-9 ~]# echo $? 2 [root@lola-9 ~]# date Mon May 2 00:41:10 PDT 2016 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026d52013098p1 /mnt/soaked-mdt1 [root@lola-9 ~]# ^Ctl debug_kernel [root@lola-9 ~]# date Mon May 2 00:43:52 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02.0044 Debug log: 177060 lines, 177060 kept, 0 dropped, 0 bad. [root@lola-9 ~]# mount /dev/sdk1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.4.0.1:/export/scratch on /scratch type nfs (rw,addr=10.4.0.1) 10.4.0.1:/home on /home type nfs (rw,addr=10.4.0.1) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/mapper/360080e50002ff4f00000026952013088p1 on /mnt/soaked-mdt0 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt1 type lustre (rw,user_xattr) ### one more time [root@lola-9 ~]# umount /mnt/soaked-mdt1 [root@lola-9 ~]# umount /mnt/soaked-mdt0 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026d52013098p1 /mnt/soaked-mdt1 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount /dev/sdk1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.4.0.1:/export/scratch on /scratch type nfs (rw,addr=10.4.0.1) 10.4.0.1:/home on /home type nfs (rw,addr=10.4.0.1) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/mapper/360080e50002ff4f00000026952013088p1 on /mnt/soaked-mdt0 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt1 type lustre (rw,user_xattr) [root@lola-9 ~]# date Mon May 2 00:49:25 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0049 Debug log: 78514 lines, 78514 kept, 0 dropped, 0 bad.
--> attached lustre-log.2016-05-02.0044, lustre-log.2016-05-02-0049
mount mdt-0 on lola-8 and do failover again
# Do failoverback to lola-8 [root@lola-8 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 mount.lustre: increased /sys/block/dm-8/queue/max_sectors_kb from 1024 to 16383 mount.lustre: increased /sys/block/dm-4/queue/max_sectors_kb from 1024 to 16383 mount.lustre: increased /sys/block/sdh/queue/max_sectors_kb from 1024 to 16383 mount.lustre: increased /sys/block/sdc/queue/max_sectors_kb from 1024 to 16383 [root@lola-8 ~]# echo $? 0 [root@lola-8 ~]# lctl dl 0 UP osd-ldiskfs soaked-MDT0000-osd soaked-MDT0000-osd_UUID 35 1 UP mgs MGS MGS 7 2 UP mgc MGC192.168.1.108@o2ib10 31b8074d-c86c-18f5-40f6-bbc2e7cf6f72 5 3 UP mds MDS MDS_uuid 3 4 UP lod soaked-MDT0000-mdtlov soaked-MDT0000-mdtlov_UUID 4 5 UP mdt soaked-MDT0000 soaked-MDT0000_UUID 45 6 UP mdd soaked-MDD0000 soaked-MDD0000_UUID 4 7 UP qmt soaked-QMT0000 soaked-QMT0000_UUID 4 8 UP osp soaked-MDT0001-osp-MDT0000 soaked-MDT0000-mdtlov_UUID 5 9 UP osp soaked-MDT0002-osp-MDT0000 soaked-MDT0000-mdtlov_UUID 5 10 UP osp soaked-MDT0003-osp-MDT0000 soaked-MDT0000-mdtlov_UUID 5 11 UP osp soaked-OST0000-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 12 UP osp soaked-OST0001-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 13 UP osp soaked-OST0002-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 14 UP osp soaked-OST0003-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 15 UP osp soaked-OST0004-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 16 UP osp soaked-OST0005-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 17 UP osp soaked-OST0006-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 18 UP osp soaked-OST0007-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 19 UP osp soaked-OST0008-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 20 UP osp soaked-OST0009-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 21 UP osp soaked-OST000a-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 22 UP osp soaked-OST000b-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 23 UP osp soaked-OST000c-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 24 UP osp soaked-OST000d-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 25 UP osp soaked-OST000e-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 26 UP osp soaked-OST000f-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 27 UP osp soaked-OST0010-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 28 UP osp soaked-OST0011-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 29 UP osp soaked-OST0012-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 30 UP osp soaked-OST0013-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 31 UP osp soaked-OST0014-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 32 UP osp soaked-OST0015-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 33 UP osp soaked-OST0016-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 34 UP osp soaked-OST0017-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 35 UP lwp soaked-MDT0000-lwp-MDT0000 soaked-MDT0000-lwp-MDT0000_UUID 5 ---> now do manual hard failover again [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: File exists [root@lola-9 ~]# date Mon May 2 01:05:13 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0105 Debug log: 560 lines, 560 kept, 0 dropped, 0 bad.
--> attached file lustre-log.2016-05-02-0105
umount mdt-1 and restart Lustre again
# ---> no restart lustre again on secondary node: [root@lola-9 ~]# umount /mnt/soaked-mdt1 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026d52013098p1 /mnt/soaked-mdt1 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount /dev/sdk1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.4.0.1:/export/scratch on /scratch type nfs (rw,addr=10.4.0.1) 10.4.0.1:/home on /home type nfs (rw,addr=10.4.0.1) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt1 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026952013088p1 on /mnt/soaked-mdt0 type lustre (rw,user_xattr) [root@lola-9 ~]# date Mon May 2 01:09:20 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0109 Debug log: 119979 lines, 119979 kept, 0 dropped, 0 bad.
--> attached file lustre-log.2016-05-02-0109
Attachments
Issue Links
- is related to
-
LU-4214 Hyperion - OST never recovers on failover node
-
- Resolved
-