[LU-8510] ASSERTION( dt->do_ops->do_invalidate ) failed - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.9.0
Labels:
- soak
Environment:
CentOS Linux 7/x86_64

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

The following call stack during autotesting on Maloo for http://review.whamcloud.com/#/c/20546/. My new test is standing up 3 MDTs with non-consecutive indices and a couple of OSTs. The method I am using to start the "custom" filesystem seems to be consistent with how other tests start their "custom" filesystems.

Link to the Maloo test session results is https://testing.hpdd.intel.com/test_sessions/4599d8d8-6108-11e6-906c-5254006e85c2.

The LBUG is preventing the filesystem from coming up. Any suggestions?

 LustreError: 21374:0:(dt_object.h:2633:dt_invalidate()) ASSERTION( dt->do_ops->do_invalidate ) failed:
 LustreError: 21374:0:(dt_object.h:2633:dt_invalidate()) LBUG
 Pid: 21374, comm: mdt00_002 

 Call Trace:
  [<ffffffffa05e67d3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
  [<ffffffffa05e6d75>] lbug_with_loc+0x45/0xc0 [libcfs]
  [<ffffffffa0ea8fcf>] lod_object_unlock+0x39f/0x440 [lod]
  [<ffffffffa0f11e1b>] mdd_object_unlock+0x3b/0xd0 [mdd]
  [<ffffffffa0ddbb62>] mdt_unlock_slaves+0x1a2/0x3c0 [mdt]
  [<ffffffffa0de3c72>] mdt_md_create+0xb52/0xba0 [mdt]
  [<ffffffffa0de3e2b>] mdt_reint_create+0x16b/0x350 [mdt]
  [<ffffffffa0de5330>] mdt_reint_rec+0x80/0x210 [mdt]
  [<ffffffffa0dc7d62>] mdt_reint_internal+0x5b2/0x9b0 [mdt]
  [<ffffffffa0dd3077>] mdt_reint+0x67/0x140 [mdt]
  [<ffffffffa0a69aa5>] tgt_request_handle+0x915/0x1320 [ptlrpc]
  [<ffffffffa0a15c5b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
  [<ffffffffa0a13818>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
  [<ffffffffa05f1957>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
  [<ffffffffa0a19d10>] ptlrpc_main+0xaa0/0x1de0 [ptlrpc]
  [<ffffffffa0a19270>] ? ptlrpc_main+0x0/0x1de0 [ptlrpc]
  [<ffffffff810a5aef>] kthread+0xcf/0xe0
  [<ffffffff810a5a20>] ? kthread+0x0/0xe0
  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
  [<ffffffff810a5a20>] ? kthread+0x0/0xe0

 Kernel panic - not syncing: LBUG

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console-lola-8.log.bz2
79 kB
06/Sep/16 10:30 AM
lustre-log-20160906-020514-mgs-mount-fails.bz2
1.55 MB
06/Sep/16 10:30 AM
messages-lola-8.log.bz2
290 kB
06/Sep/16 10:30 AM
vmcore-dmesg.txt.bz2
21 kB
06/Sep/16 10:30 AM

Activity

[LU-8510] ASSERTION( dt->do_ops->do_invalidate ) failed

Peter Jones added a comment - 08/Sep/16 4:24 AM

Landed for 2.9

Peter Jones added a comment - 08/Sep/16 4:24 AM Landed for 2.9

Gerrit Updater added a comment - 08/Sep/16 2:08 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22017/
Subject: ~~LU-8510~~ dne: set osd_obj_ea_ops::dt_invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 13364590a8c9ef64320f62b9937c01aaa6b6fa85

Gerrit Updater added a comment - 08/Sep/16 2:08 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/22017/ Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate Project: fs/lustre-release Branch: master Current Patch Set: Commit: 13364590a8c9ef64320f62b9937c01aaa6b6fa85

Frank Heckes (Inactive) added a comment - 06/Sep/16 9:50 AM

The soak test was executed with el6.7 build (https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57)

Frank Heckes (Inactive) added a comment - 06/Sep/16 9:50 AM The soak test was executed with el6.7 build ( https://build.hpdd.intel.com/job/lustre-master/3431/ tag 2.8.57)

Frank Heckes (Inactive) added a comment - 06/Sep/16 9:47 AM - edited

The same error also happened during soak testing of '20160902' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160902)
Test cluster configuration:
4 MDS with 1 MDT / MDS, backend FS formatted with ldiskfs , in active-active HA configuration (node pair affected lola-[8,9])
6 OSS with 4 OST / OSS, backend FS formatted with zfs , n active-active HA configuration

Error message is the same beside addresses (see attached file vmcore-dmesg.txt)

Sequence of events

2016-09-05 15:48:29 Node lola-8 crashed before (injected fault) failover of lola-9's resources (mdt11) to lola-8.
2016-09-05 15:55:04 lola-8 became available before failover took place
2016-09-05 16:25:38,987:fsmgmt.fsmgmt:INFO triggering fault mds_failover (lola-9)
2016-09-05 16:35:39,398: mdt-1 successful mounted on lola-8, but stalled in recovery due to missing MDT/MGT of lola-8
The MDT/MDT can't be mounted on the primary node (active-active HA configuration) anymore. The error message reads as:
```
[root@lola-8 ~]# date ; mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 ; date
Tue Sep  6 02:04:13 PDT 2016
mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: Input/output error
Is the MGS running?
Tue Sep  6 02:05:14 PDT 2016
[root@lola-8 ~]# lctl debug_kernel /tmp/lustre-log-20160906-020514-mgs-mount-fails
```
(Double checked HW; IB and disk resources are operation and sane)
After manual umount of mdt1 and reboot of node lola-8, mdt-0, mdt-1 could be mounted and recovery completed within 2 mins and primary
resource could be switched back to (primary) node lola-9 again.
This symptom is eventually a different bug that happens only by chance due to the node crash.

Attached files: messages, console and vmcore-dmesg.txt of affected node lola-8, debug log (mask -1) containing debug information of the time interval while executing the mount command specified above.
A crash dump file exists and had been store to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8510/lola-8/127.0.0.1-2016-09-05-15:48:29.

Frank Heckes (Inactive) added a comment - 06/Sep/16 9:47 AM - edited The same error also happened during soak testing of '20160902' (see https://wiki.hpdd.intel.com/pages/viewpage.action?title=Soak+Testing+on+Lola&spaceKey=Releases#SoakTestingonLola-20160902 ) Test cluster configuration: 4 MDS with 1 MDT / MDS, backend FS formatted with ldiskfs , in active-active HA configuration (node pair affected lola- [8,9] ) 6 OSS with 4 OST / OSS, backend FS formatted with zfs , n active-active HA configuration Error message is the same beside addresses (see attached file vmcore-dmesg.txt) Sequence of events 2016-09-05 15:48:29 Node lola-8 crashed before (injected fault) failover of lola-9's resources (mdt11) to lola-8. 2016-09-05 15:55:04 lola-8 became available before failover took place 2016-09-05 16:25:38,987:fsmgmt.fsmgmt:INFO triggering fault mds_failover ( lola-9 ) 2016-09-05 16:35:39,398: mdt-1 successful mounted on lola-8 , but stalled in recovery due to missing MDT/MGT of lola-8 The MDT/MDT can't be mounted on the primary node (active-active HA configuration) anymore. The error message reads as: [root@lola-8 ~]# date ; mount -t lustre -o rw,user_xattr /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 /mnt/soaked-mdt0 ; date Tue Sep 6 02:04:13 PDT 2016 mount.lustre: mount /dev/mapper/360080e50002ff4f00000026952013088p1 at /mnt/soaked-mdt0 failed: Input/output error Is the MGS running? Tue Sep 6 02:05:14 PDT 2016 [root@lola-8 ~]# lctl debug_kernel /tmp/lustre-log-20160906-020514-mgs-mount-fails (Double checked HW; IB and disk resources are operation and sane) After manual umount of mdt1 and reboot of node lola-8 , mdt-0, mdt-1 could be mounted and recovery completed within 2 mins and primary resource could be switched back to (primary) node lola-9 again. This symptom is eventually a different bug that happens only by chance due to the node crash. Attached files: messages, console and vmcore-dmesg.txt of affected node lola-8 , debug log (mask -1 ) containing debug information of the time interval while executing the mount command specified above. A crash dump file exists and had been store to lhn.hpdd.intel.com:/scratch/crashdumps/lu-8510/lola-8/127.0.0.1-2016-09-05-15:48:29 .

Gerrit Updater added a comment - 19/Aug/16 3:09 AM

Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/22017
Subject: ~~LU-8510~~ dne: set osd_obj_ea_ops::dt_invalidate
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: a9fae446db68cb1c34f2db949c875f30d5e93980

Gerrit Updater added a comment - 19/Aug/16 3:09 AM Bobi Jam (bobijam@hotmail.com) uploaded a new patch: http://review.whamcloud.com/22017 Subject: LU-8510 dne: set osd_obj_ea_ops::dt_invalidate Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a9fae446db68cb1c34f2db949c875f30d5e93980

Peter Jones added a comment - 18/Aug/16 8:43 PM

Bobijam

Could you please assist with this issue

Thanks

Peter

Peter Jones added a comment - 18/Aug/16 8:43 PM Bobijam Could you please assist with this issue Thanks Peter

People

Assignee:: Zhenyu Xu

Reporter:: Giuseppe Di Natale (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 17/Aug/16 6:11 PM

Updated:: 19/Mar/19 3:43 PM

Resolved:: 08/Sep/16 4:24 AM