[LU-7778] mount of MDT(==MGS) failed after MDS restart Created: 16/Feb/16  Updated: 24/Feb/16  Resolved: 24/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0, Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: soak
Environment:

lola
build: 2.8.50-6-gf9ca359 ; commit f9ca359284357d145819beb08b316e932f7a3060


Attachments: File console-lola-8.log.bz2     File lustre-log-mount-non-operational-20160216-0044-lola-8.bz2     File messages-lola-8.log.bz2    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Error happened during soak testing of build '20160215' (see https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20150215). DNE is enabled.
MDT had been formatted using ldiskfs, OSTs using zfs. MDS nodes are configured in active-active HA failover configuration.

Please note that build 20150215 is a vanilla build of the master brunch.
This issue might be addressed by the changes included in build '20160210' as we didn't observe this issue in a two day test session.

Sequence of events:

  • 2016-02-15 16:25:21,179:fsmgmt.fsmgmt:INFO triggering fault mds_restart
  • 2016-02-15 16:31:41,282:fsmgmt.fsmgmt:INFO lola-8 is up
  • 2016-02-15 16:36:50,594:fsmgmt.fsmgmt:INFO ... soaked-MDT0001 mounted successfully on lola-8
  • 2016-02-15 16:38:20, mount of MDT0000 (== MGS) fails
    Error message reads as:
    Feb 15 16:38:20 lola-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
    Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server soaked-MDT0000: -5
    Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1798:server_fill_super()) Unable to start targets: -5
    Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1512:server_put_super()) no obd soaked-MDT0000
    Feb 15 16:38:20 lola-8 kernel: Lustre: server umount soaked-MDT0000 complete
    Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount.c:1426:lustre_fill_super()) Unable to mount  (-5)
    
  • I checked the HW and cluster configuration: no problem with IB HCA, LNet is working, routers are up; Disk device file of MDT-0000 can be read and accessed.

Attached messages, console and manual forced debug log of node lola-8.



 Comments   
Comment by Di Wang [ 16/Feb/16 ]
Feb 15 16:37:47 lola-8 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts:
Feb 15 16:37:48 lola-8 kernel: LustreError: 11-0: soaked-MDT0006-osp-MDT0001: operation mds_connect to node 192.168.1.111@o2ib10 failed: rc = -16
Feb 15 16:37:48 lola-8 kernel: LustreError: Skipped 3 previous similar messages
Feb 15 16:37:48 lola-8 kernel: Lustre: 4320:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1455583068/real 1455583068]  req@ffff8804037909c0 x1526289292853684/t0(0) o38->soaked-MDT0000-osp-MDT0001@192.168.1.109@o2ib10:24/4 lens 520/544 e 0 to 1 dl 1455583079 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
Feb 15 16:37:48 lola-8 kernel: Lustre: 4320:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
Feb 15 16:37:54 lola-8 kernel: LustreError: 137-5: soaked-MDT0003_UUID: not available for connect from 192.168.1.104@o2ib10 (no target). If you are running an HA pair check that the target is mounted on the other server.
Feb 15 16:37:54 lola-8 kernel: LustreError: Skipped 58 previous similar messages
Feb 15 16:38:03 lola-8 kernel: Lustre: soaked-MDT0001: Client d26c53bc-3d10-5c53-0c35-f189140fc2e8 (at 192.168.1.131@o2ib100) reconnecting, waiting for 14 clients in recovery for 3:53
Feb 15 16:38:03 lola-8 kernel: Lustre: Skipped 180 previous similar messages
Feb 15 16:38:20 lola-8 kernel: LustreError: 15c-8: MGC192.168.1.108@o2ib10: The configuration from log 'soaked-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
Feb 15 16:38:20 lola-8 kernel: LustreError: 4538:0:(obd_mount_server.c:1309:server_start_targets()) failed to start server soaked-MDT0000: -5

It looks like MDT0 has trouble to communicate with MGS. But unfortunately, there are no logs to indicate what happens. I guess I need monitor the "run".

Comment by Gerrit Updater [ 18/Feb/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18509
Subject: LU-7778 osd: check if the object is destroyed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3096d9dbeae6bafccf10104b8221b91fac05a08f

Comment by Gerrit Updater [ 24/Feb/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18509/
Subject: LU-7778 osd: check if the object is destroyed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a19c1ea92fb8d9909ec9fb98f22a8a9e4835c572

Comment by Peter Jones [ 24/Feb/16 ]

Landed for 2.8 and 2.9

Generated at Sat Feb 10 02:11:50 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.