[LU-8089] MGT/MDT mount fails on secondary HA node Created: 02/May/16 Updated: 14/Jun/18 Resolved: 25/Oct/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
The error happens during soak testing of build '20160427' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160427). DNE is enabled. OSTs had been formatted with zfs, MDT's using ldiskfs as storage backend. There's 1 MDT per MDS and 4 OSTs per OSS. The OSS and MDT nodes are configured in HA active-active failover configuration. The configuration, especially the mapping of node to role can be found here: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-Configuration. After a hard failover of the primary node (lola-8) had been initiated using pm -c , mounting the MGT/MDT (one device) on the secondary node (lola-9) failed with error message: Apr 29 14:00:36 lola-9 kernel: LDISKFS-fs warning (device dm-9): ldiskfs_multi_mount_protect: MMP interval 42 higher than expected, pleas e wait. ... ... Apr 29 14:19:06 lola-9 kernel: Lustre: Skipped 2 previous similar messages Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(genops.c:334:class_newdev()) Device MGC192.168.1.109@o2ib10 already exists at 2, won't add Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_config.c:370:class_attach()) Cannot create device MGC192.168.1.109@o2ib10 of type mgc : -17 Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount.c:198:lustre_start_simple()) MGC192.168.1.109@o2ib10 attach error -17 Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-MDT0000 Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount_server.c:140:server_deregister_mount()) soaked-MDT0000 not registered Apr 29 14:19:06 lola-9 kernel: Lustre: 7204:0:(service.c:2096:ptlrpc_server_handle_request()) Skipped 9 previous similar messages Apr 29 14:19:06 lola-9 kernel: Lustre: server umount soaked-MDT0000 complete Apr 29 14:19:06 lola-9 kernel: LustreError: 32539:0:(obd_mount.c:1450:lustre_fill_super()) Unable to mount (-17) The exact sequence of events reads as:
The effect can be reproduced manually. Try manually[root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0/ mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: File exists [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0/ mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: File exists [root@lola-9 ~]# date Mon May 2 00:39:25 PDT 2016 [root@lola-9 ~]# date Mon May 2 00:39:52 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0040 Debug log: 2321 lines, 2321 kept, 0 dropped, 0 bad. ---> attached file lustre-log.2016-05-02-0040 also attached umount + reload Lustre modules on lola-9##### Umount and remove lustre mods [root@lola-9 ~]# umount /mnt/soaked-mdt1 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# lctl dl [root@lola-9 ~]# echo $? 2 [root@lola-9 ~]# date Mon May 2 00:41:10 PDT 2016 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026d52013098p1 /mnt/soaked-mdt1 [root@lola-9 ~]# ^Ctl debug_kernel [root@lola-9 ~]# date Mon May 2 00:43:52 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02.0044 Debug log: 177060 lines, 177060 kept, 0 dropped, 0 bad. [root@lola-9 ~]# mount /dev/sdk1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.4.0.1:/export/scratch on /scratch type nfs (rw,addr=10.4.0.1) 10.4.0.1:/home on /home type nfs (rw,addr=10.4.0.1) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/mapper/360080e50002ff4f00000026952013088p1 on /mnt/soaked-mdt0 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt1 type lustre (rw,user_xattr) ### one more time [root@lola-9 ~]# umount /mnt/soaked-mdt1 [root@lola-9 ~]# umount /mnt/soaked-mdt0 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026d52013098p1 /mnt/soaked-mdt1 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount /dev/sdk1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.4.0.1:/export/scratch on /scratch type nfs (rw,addr=10.4.0.1) 10.4.0.1:/home on /home type nfs (rw,addr=10.4.0.1) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/mapper/360080e50002ff4f00000026952013088p1 on /mnt/soaked-mdt0 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt1 type lustre (rw,user_xattr) [root@lola-9 ~]# date Mon May 2 00:49:25 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0049 Debug log: 78514 lines, 78514 kept, 0 dropped, 0 bad. --> attached lustre-log.2016-05-02.0044, lustre-log.2016-05-02-0049 mount mdt-0 on lola-8 and do failover again# Do failoverback to lola-8 [root@lola-8 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 mount.lustre: increased /sys/block/dm-8/queue/max_sectors_kb from 1024 to 16383 mount.lustre: increased /sys/block/dm-4/queue/max_sectors_kb from 1024 to 16383 mount.lustre: increased /sys/block/sdh/queue/max_sectors_kb from 1024 to 16383 mount.lustre: increased /sys/block/sdc/queue/max_sectors_kb from 1024 to 16383 [root@lola-8 ~]# echo $? 0 [root@lola-8 ~]# lctl dl 0 UP osd-ldiskfs soaked-MDT0000-osd soaked-MDT0000-osd_UUID 35 1 UP mgs MGS MGS 7 2 UP mgc MGC192.168.1.108@o2ib10 31b8074d-c86c-18f5-40f6-bbc2e7cf6f72 5 3 UP mds MDS MDS_uuid 3 4 UP lod soaked-MDT0000-mdtlov soaked-MDT0000-mdtlov_UUID 4 5 UP mdt soaked-MDT0000 soaked-MDT0000_UUID 45 6 UP mdd soaked-MDD0000 soaked-MDD0000_UUID 4 7 UP qmt soaked-QMT0000 soaked-QMT0000_UUID 4 8 UP osp soaked-MDT0001-osp-MDT0000 soaked-MDT0000-mdtlov_UUID 5 9 UP osp soaked-MDT0002-osp-MDT0000 soaked-MDT0000-mdtlov_UUID 5 10 UP osp soaked-MDT0003-osp-MDT0000 soaked-MDT0000-mdtlov_UUID 5 11 UP osp soaked-OST0000-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 12 UP osp soaked-OST0001-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 13 UP osp soaked-OST0002-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 14 UP osp soaked-OST0003-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 15 UP osp soaked-OST0004-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 16 UP osp soaked-OST0005-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 17 UP osp soaked-OST0006-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 18 UP osp soaked-OST0007-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 19 UP osp soaked-OST0008-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 20 UP osp soaked-OST0009-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 21 UP osp soaked-OST000a-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 22 UP osp soaked-OST000b-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 23 UP osp soaked-OST000c-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 24 UP osp soaked-OST000d-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 25 UP osp soaked-OST000e-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 26 UP osp soaked-OST000f-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 27 UP osp soaked-OST0010-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 28 UP osp soaked-OST0011-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 29 UP osp soaked-OST0012-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 30 UP osp soaked-OST0013-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 31 UP osp soaked-OST0014-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 32 UP osp soaked-OST0015-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 33 UP osp soaked-OST0016-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 34 UP osp soaked-OST0017-osc-MDT0000 soaked-MDT0000-mdtlov_UUID 5 35 UP lwp soaked-MDT0000-lwp-MDT0000 soaked-MDT0000-lwp-MDT0000_UUID 5 ---> now do manual hard failover again [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: File exists [root@lola-9 ~]# date Mon May 2 01:05:13 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0105 Debug log: 560 lines, 560 kept, 0 dropped, 0 bad. --> attached file lustre-log.2016-05-02-0105 umount mdt-1 and restart Lustre again# ---> no restart lustre again on secondary node: [root@lola-9 ~]# umount /mnt/soaked-mdt1 [root@lola-9 ~]# lustre_rmmod [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026d52013098p1 /mnt/soaked-mdt1 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount -t lustre -o rw,user_xattr /dev/mapper/360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 [root@lola-9 ~]# echo $? 0 [root@lola-9 ~]# mount /dev/sdk1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 10.4.0.1:/export/scratch on /scratch type nfs (rw,addr=10.4.0.1) 10.4.0.1:/home on /home type nfs (rw,addr=10.4.0.1) nfsd on /proc/fs/nfsd type nfsd (rw) /dev/mapper/360080e50002ff4f00000026d52013098p1 on /mnt/soaked-mdt1 type lustre (rw,user_xattr) /dev/mapper/360080e50002ff4f00000026952013088p1 on /mnt/soaked-mdt0 type lustre (rw,user_xattr) [root@lola-9 ~]# date Mon May 2 01:09:20 PDT 2016 [root@lola-9 ~]# lctl debug_kernel /tmp/lustre-log.2016-05-02-0109 Debug log: 119979 lines, 119979 kept, 0 dropped, 0 bad. --> attached file lustre-log.2016-05-02-0109 |
| Comments |
| Comment by Frank Heckes (Inactive) [ 02/May/16 ] |
|
This seems to affect the MGT only. Failover for other MDSes (MDTs) run smoothly. |
| Comment by Di Wang [ 02/May/16 ] |
|
In lustre-log.2016-05-02-0105, it fails to restart the new MGC on lola-9 because of the mgc with same name already exists. But the original MGC should be MGC192.168.1.108@o2ib10, because the original MGS is 108. Frank, could you please list device on lola-9, before doing failover on lola-8. Thanks. (Or it is a failover MGC?) 00000020:00000080:16.0:1462176165.502813:0:219876:0:(obd_config.c:362:class_attach()) attach type mgc name: MGC192.168.1.109@o2ib10 uuid: 523ef026-efaa-85c3-2ecd-2c5b9b575258 00000020:00020000:16.0:1462176165.502820:0:219876:0:(genops.c:334:class_newdev()) Device MGC192.168.1.109@o2ib10 already exists at 2, won't add 00000020:00020000:16.0:1462176165.514006:0:219876:0:(obd_config.c:370:class_attach()) Cannot create device MGC192.168.1.109@o2ib10 of type mgc : -17 |
| Comment by Joseph Gmitter (Inactive) [ 02/May/16 ] |
|
Hi Mike, We are moving patch http://review.whamcloud.com/#/c/13726/ to this ticket since it was added to Thanks. |
| Comment by Andreas Dilger [ 02/May/16 ] |
|
Moved patch here from Mike Pershin (mike.pershin@intel.com) uploaded a new patch: http://review.whamcloud.com/13726 |
| Comment by Frank Heckes (Inactive) [ 20/Jul/16 ] |
|
The error didn't occurred for soak test of build https://build.hpdd.intel.com/job/lustre-master/3406 (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160713) during a test session that is ongoing and last already 7 days. |
| Comment by Frank Heckes (Inactive) [ 21/Jul/16 ] |
|
During the night the error (or at least symptom) happened again for build mentioned in previous box (lustre-master #3406).
The messages inside the debug log files look different then in those from the previous event. So eventually this is a new bug. Attached files:
|
| Comment by Gerrit Updater [ 25/Oct/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13726/ |
| Comment by Peter Jones [ 25/Oct/16 ] |
|
Landed for 2.9 |