[LU-7737] osd_handler.c:2777:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) Created: 04/Feb/16  Updated: 25/Sep/20  Resolved: 11/Feb/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Blocker
Reporter: Frank Heckes (Inactive) Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: dne2, soak
Environment:

lola
build: https://build.hpdd.intel.com/job/lustre-reviews/37226/


Attachments: File console-lola-8.log.bz2     File messages-lola-8.log.bz2     File soak.log.bz2     File vmcore-dmesg.txt.bz2    
Issue Links:
Gantt End to Start
has to be done after LU-7680 mdd_migrate_update_name() doesn't put... Resolved
Related
is related to LU-4448 Test failure timeout on sanity-quota ... Open
is related to LU-7680 mdd_migrate_update_name() doesn't put... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This error might be a duplicate of LU-6699. Anyway as the bug occurred in conjunction with llog errors, it might related to the latest changes in DNE (change 16838).

The error happens during soak testing of build '20160203' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160203). DNE is enable. MDTs had been formatted with ldiskfs, OSTs with zfs. MDSes are configured in actve-active HA failover configuration.

The configuration for the HA pair in question reads as:

  • lola-8 - mdt-0, 1 (primary resources)
  • lola-9 - mdt-2, 3 (primary resources)

During the umount (failback of resources) of mdt-3 on lola-8 the
node crashed with LBUG:

<0>LustreError: 5861:0:(osd_handler.c:2777:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) fail
ed: 
<0>LustreError: 5861:0:(osd_handler.c:2777:osd_object_destroy()) LBUG
<4>Pid: 5861, comm: umount
<4>
<4>Call Trace:
<4> [<ffffffffa0772875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0772e77>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa1060fbb>] osd_object_destroy+0x52b/0x5b0 [osd_ldiskfs]
<4> [<ffffffffa105e42d>] ? osd_object_ref_del+0x22d/0x4e0 [osd_ldiskfs]
<4> [<ffffffffa0851dda>] llog_osd_destroy+0x1ba/0x9e0 [obdclass]
<4> [<ffffffffa08417a6>] llog_destroy+0x2b6/0x470 [obdclass]
<4> [<ffffffffa08438cb>] llog_cat_close+0x17b/0x220 [obdclass]
<4> [<ffffffffa12b04e7>] lod_sub_fini_llog+0xb7/0x380 [lod]
<4> [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffffa12b35c4>] lod_process_config+0xbc4/0x1830 [lod]
<4> [<ffffffffa111361f>] ? lfsck_stop+0x15f/0x4c0 [lfsck]
<4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
<4> [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffffa1331474>] mdd_process_config+0x114/0x5d0 [mdd]
<4> [<ffffffffa11db55e>] mdt_device_fini+0x3ee/0xf40 [mdt]
<4> [<ffffffffa0860406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass]
<4> [<ffffffffa087a552>] class_cleanup+0x572/0xd20 [obdclass]
<4> [<ffffffffa085b0c6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa087cbd6>] class_process_config+0x1ed6/0x2830 [obdclass]
<4> [<ffffffffa077dd01>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
<4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230
<4> [<ffffffffa087d9ef>] class_manual_cleanup+0x4bf/0x8e0 [obdclass]
<4> [<ffffffffa085b0c6>] ? class_name2dev+0x56/0xe0 [obdclass]
<4> [<ffffffffa08b610c>] server_put_super+0xa0c/0xed0 [obdclass]
<4> [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190
<4> [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0
<4> [<ffffffff81190c66>] kill_anon_super+0x16/0x60
<4> [<ffffffffa08808a6>] lustre_kill_super+0x36/0x60 [obdclass]
<4> [<ffffffff81191407>] deactivate_super+0x57/0x80
<4> [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110
<4> [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0
<4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b
<4>

Also, immediately the following error was reported on lola-8:

Feb  3 10:51:27 lola-8 kernel: LustreError: 5733:0:(llog.c:588:llog_process_thread()) soaked-MDT0006-osp-MDT0003 retry remo
te llog process
Feb  3 10:51:27 lola-8 kernel: LustreError: 5733:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0006-osp-MDT0003 get
ting update log failed: rc = -11
Feb  3 10:51:27 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted
Feb  3 10:51:27 lola-8 kernel: LustreError: 5730:0:(osp_object.c:588:osp_attr_get()) soaked-MDT0002-osp-MDT0003:osp_attr_ge
t update error [0x200000009:0x2:0x0]: rc = -5
Feb  3 10:51:27 lola-8 kernel: LustreError: 5730:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0003-mdtlov: can't get id from catalogs: rc = -5
Feb  3 10:51:28 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted
Feb  3 10:51:28 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 1 previous similar message
Feb  3 10:51:29 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted
Feb  3 10:51:29 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 1 previous similar message
Feb  3 10:51:31 lola-8 kernel: Lustre: soaked-MDT0003: Not available for connect from 0@lo (stopping)
Feb  3 10:51:32 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted
Feb  3 10:51:32 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 2 previous similar messages
Feb  3 10:51:35 lola-8 kernel: LustreError: 5861:0:(osd_handler.c:3291:osd_object_ref_del()) soaked-MDT0003-osd: nlink == 0 on [0x2c00042a3:0x15ec5:0x0], maybe an upgraded file? (LU-3915)

The sequence of events are:

  • 2016-02-03 10:42:39 - failover started for lola-9
  • 2016-02-03 10:42:39 - lola-9 online again
  • 2016-02-03 10:51:26 - Failback of resoures (umount mdt-3)
  • 2016-02-03 10:51:35 - lola-8 hit LBUG

Attached files:
lola-8 messages, console, vmcore-dmesg.txt
soak.log (for injected errors)
Note:
Crash dump have been created. I'll add the info about the storage location as soon as ticket is created.

Info required for matching: sanity-quota 7c



 Comments   
Comment by Frank Heckes (Inactive) [ 04/Feb/16 ]

crash dump can be found at: lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7737/lola-8/127.0.0.1-2016-02-03-10:51:51/

Comment by Joseph Gmitter (Inactive) [ 04/Feb/16 ]

Hi Di,
Can you comment on this one?
Thanks.
Joe

Comment by Gerrit Updater [ 04/Feb/16 ]

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18308
Subject: LU-7737 lod: not return -EIO during process updates log
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 56050a1fd3e41b769a9b14ac8adfa128939215a3

Comment by Di Wang [ 04/Feb/16 ]

According to the log, it seems the log records are incorrectly deleted during umount when the target is doing recovery.

Comment by James Nunez (Inactive) [ 05/Feb/16 ]

We just started seeing this on master. replay-single test 102c is failing on unmount of the MDS1 in review-dne-part-2.

Here are the logs:
2016-02-04 08:17:35 - https://testing.hpdd.intel.com/test_sets/c7aac802-cb65-11e5-a59a-5254006e85c2
2016-02-04 09:26:36 - https://testing.hpdd.intel.com/test_sets/1bfe48b2-cb74-11e5-be8d-5254006e85c2
2016-02-06 11:49:35 - https://testing.hpdd.intel.com/test_sets/8cca149e-cd0f-11e5-8b0e-5254006e85c2
2016-02-07 21:58:41 - https://testing.hpdd.intel.com/test_sets/107745d8-ce36-11e5-af15-5254006e85c2
2016-02-07 21:59:13 - https://testing.hpdd.intel.com/test_sets/1d39d220-ce33-11e5-876a-5254006e85c2
2016-02-07 22:23:26 - https://testing.hpdd.intel.com/test_sets/3b006fae-ce37-11e5-90aa-5254006e85c2

And in sanity-quota test_7c in review-dne-part-2. Logs at:
2016-02-04 11:52:10 - https://testing.hpdd.intel.com/test_sets/e1d55ec6-cb70-11e5-a59a-5254006e85c2
2016-02-04 14:52:30 - https://testing.hpdd.intel.com/test_sets/a5588724-cb8a-11e5-b49e-5254006e85c2
2016-02-04 19:18:03 - https://testing.hpdd.intel.com/test_sets/dc1de50a-cbae-11e5-be8d-5254006e85c2
2016-02-04 21:16:36 - https://testing.hpdd.intel.com/test_sets/cc8b486c-cbbd-11e5-b2cb-5254006e85c2
2016-02-06 03:40:20 - https://testing.hpdd.intel.com/test_sets/6d50e5cc-cccc-11e5-8b0e-5254006e85c2
2016-02-06 11:22:59 - https://testing.hpdd.intel.com/test_sets/cf38a7e2-ccff-11e5-963e-5254006e85c2
2016-02-06 11:25:21 - https://testing.hpdd.intel.com/test_sets/8b554926-cd00-11e5-963e-5254006e85c2
2016-02-06 20:01:18 - https://testing.hpdd.intel.com/test_sets/6f228e60-cd48-11e5-b1fa-5254006e85c2
2016-02-07 04:05:05 - https://testing.hpdd.intel.com/test_sets/95528d4e-cd87-11e5-8c5d-5254006e85c2
2016-02-07 10:14:30 - https://testing.hpdd.intel.com/test_sets/d1108a5c-cdbe-11e5-9bc0-5254006e85c2

Comment by Bob Glossman (Inactive) [ 06/Feb/16 ]

another on master:
2016-02-06 02:57:54 - https://testing.hpdd.intel.com/test_sets/edc883e8-ccba-11e5-b80c-5254006e85c2

Comment by Richard Henwood (Inactive) [ 08/Feb/16 ]

One more failure on master (from the canary patch) over the weekend:

https://testing.hpdd.intel.com/test_sets/66f80856-cddd-11e5-9bc0-5254006e85c2

Comment by Richard Henwood (Inactive) [ 08/Feb/16 ]

And another over the weekend:

https://testing.hpdd.intel.com/test_sets/8c4963c2-cc3c-11e5-b2cb-5254006e85c2

Comment by Bob Glossman (Inactive) [ 08/Feb/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/6f228e60-cd48-11e5-b1fa-5254006e85c2

Comment by Alex Zhuravlev [ 09/Feb/16 ]

I think we should also avoid the case when we try to destroy llog twice: in llog_process() (due to cancels on an error) and in llog_cat_close(). the patch is coming..

Comment by Gerrit Updater [ 09/Feb/16 ]

Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18362
Subject: LU-7737 llog: do not destroy llog twice
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 866689144744cae95116b69a992abbfaca806517

Comment by Bob Glossman (Inactive) [ 09/Feb/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/f9c7e7e6-cee1-11e5-b578-5254006e85c2

Comment by Gerrit Updater [ 09/Feb/16 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18308/
Subject: LU-7737 lod: not return -EIO during process updates log
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 028e65b03dac9497256978d2266acb8c20b48a99

Comment by Peter Jones [ 10/Feb/16 ]

Landed for 2.8

Comment by Peter Jones [ 10/Feb/16 ]

Oh! There is a second patch - sorry

Comment by Alex Zhuravlev [ 11/Feb/16 ]

http://review.whamcloud.com/18362 got another ticket - LU-7772

Comment by Peter Jones [ 11/Feb/16 ]

Thanks Alex so I'll re-resolve this

Generated at Sat Feb 10 02:11:29 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.