[LU-8510] ASSERTION( dt->do_ops->do_invalidate ) failed Created: 17/Aug/16  Updated: 19/Mar/19  Resolved: 08/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Blocker
Reporter: Giuseppe Di Natale (Inactive) Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: soak
Environment:

CentOS Linux 7/x86_64


Attachments: File console-lola-8.log.bz2     File lustre-log-20160906-020514-mgs-mount-fails.bz2     File messages-lola-8.log.bz2     File vmcore-dmesg.txt.bz2    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The following call stack during autotesting on Maloo for http://review.whamcloud.com/#/c/20546/. My new test is standing up 3 MDTs with non-consecutive indices and a couple of OSTs. The method I am using to start the "custom" filesystem seems to be consistent with how other tests start their "custom" filesystems.

Link to the Maloo test session results is https://testing.hpdd.intel.com/test_sessions/4599d8d8-6108-11e6-906c-5254006e85c2.

The LBUG is preventing the filesystem from coming up. Any suggestions?

 LustreError: 21374:0:(dt_object.h:2633:dt_invalidate()) ASSERTION( dt->do_ops->do_invalidate ) failed:
 LustreError: 21374:0:(dt_object.h:2633:dt_invalidate()) LBUG
 Pid: 21374, comm: mdt00_002 

 Call Trace:
  [<ffffffffa05e67d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
  [<ffffffffa05e6d75>] lbug_with_loc+0x45/0xc0 [libcfs]
  [<ffffffffa0ea8fcf>] lod_object_unlock+0x39f/0x440 [lod]
  [<ffffffffa0f11e1b>] mdd_object_unlock+0x3b/0xd0 [mdd]
  [<ffffffffa0ddbb62>] mdt_unlock_slaves+0x1a2/0x3c0 [mdt]
  [<ffffffffa0de3c72>] mdt_md_create+0xb52/0xba0 [mdt]
  [<ffffffffa0de3e2b>] mdt_reint_create+0x16b/0x350 [mdt]
  [<ffffffffa0de5330>] mdt_reint_rec+0x80/0x210 [mdt]
  [<ffffffffa0dc7d62>] mdt_reint_internal+0x5b2/0x9b0 [mdt]
  [<ffffffffa0dd3077>] mdt_reint+0x67/0x140 [mdt]
  [<ffffffffa0a69aa5>] tgt_request_handle+0x915/0x1320 [ptlrpc]
  [<ffffffffa0a15c5b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
  [<ffffffffa0a13818>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
  [<ffffffffa05f1957>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
  [<ffffffffa0a19d10>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
  [<ffffffffa0a19270>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
  [<ffffffff810a5aef>] kthread+0xcf/0xe0
  [<ffffffff810a5a20>] ? kthread+0x0/0xe0
  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
  [<ffffffff810a5a20>] ? kthread+0x0/0xe0

 Kernel panic - not syncing: LBUG


 Comments   
Comment by Peter Jones [ 18/Aug/16 ]

Bobijam

Could you please assist with this issue

Thanks

Peter

Comment by Gerrit Updater [ 19/Aug/16 ]

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/22017
Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9fae446db68cb1c34f2db949c875f30d5e93980

Comment by Frank Heckes (Inactive) [ 06/Sep/16 ]

The same error also happened during soak testing of '20160902' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160902)
Test cluster configuration:
4 MDS with 1 MDT / MDS, backend FS formatted with ldiskfs , in active-active HA configuration (node pair affected lola-[8,9])
6 OSS with 4 OST / OSS, backend FS formatted with zfs , n active-active HA configuration

Error message is the same beside addresses (see attached file vmcore-dmesg.txt)

Sequence of events

  • 2016-09-05 15:48:29 Node lola-8 crashed before (injected fault) failover of lola-9's resources (mdt11) to lola-8.
  • 2016-09-05 15:55:04 lola-8 became available before failover took place
  • 2016-09-05 16:25:38,987:fsmgmt.fsmgmt:INFO triggering fault mds_failover (lola-9)
  • 2016-09-05 16:35:39,398: mdt-1 successful mounted on lola-8, but stalled in recovery due to missing MDT/MGT of lola-8
  • The MDT/MDT can't be mounted on the primary node (active-active HA configuration) anymore. The error message reads as:
    [root@lola-8 ~]# date ; mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 ; date
    Tue Sep  6 02:04:13 PDT 2016
    mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: Input/output error
    Is the MGS running?
    Tue Sep  6 02:05:14 PDT 2016
    [root@lola-8 ~]# lctl debug_kernel /tmp/lustre-log-20160906-020514-mgs-mount-fails
    

    (Double checked HW; IB and disk resources are operation and sane)
    After manual umount of mdt1 and reboot of node lola-8, mdt-0, mdt-1 could be mounted and recovery completed within 2 mins and primary
    resource could be switched back to (primary) node lola-9 again.
    This symptom is eventually a different bug that happens only by chance due to the node crash.

Attached files: messages, console and vmcore-dmesg.txt of affected node lola-8, debug log (mask -1) containing debug information of the time interval while executing the mount command specified above.
A crash dump file exists and had been store to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8510/lola-8/127.0.0.1-2016-09-05-15:48:29.

Comment by Frank Heckes (Inactive) [ 06/Sep/16 ]

The soak test was executed with el6.7 build (https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57)

Comment by Gerrit Updater [ 08/Sep/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22017/
Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13364590a8c9ef64320f62b9937c01aaa6b6fa85

Comment by Peter Jones [ 08/Sep/16 ]

Landed for 2.9

Generated at Sat Feb 10 02:18:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.