[LU-7737] osd_handler.c:2777:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) Created: 04/Feb/16 Updated: 25/Sep/20 Resolved: 11/Feb/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | Lustre 2.8.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Frank Heckes (Inactive) | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne2, soak | ||
| Environment: |
lola |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
This error might be a duplicate of The error happens during soak testing of build '20160203' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160203). DNE is enable. MDTs had been formatted with ldiskfs, OSTs with zfs. MDSes are configured in actve-active HA failover configuration. The configuration for the HA pair in question reads as:
During the umount (failback of resources) of mdt-3 on lola-8 the <0>LustreError: 5861:0:(osd_handler.c:2777:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) fail ed: <0>LustreError: 5861:0:(osd_handler.c:2777:osd_object_destroy()) LBUG <4>Pid: 5861, comm: umount <4> <4>Call Trace: <4> [<ffffffffa0772875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0772e77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa1060fbb>] osd_object_destroy+0x52b/0x5b0 [osd_ldiskfs] <4> [<ffffffffa105e42d>] ? osd_object_ref_del+0x22d/0x4e0 [osd_ldiskfs] <4> [<ffffffffa0851dda>] llog_osd_destroy+0x1ba/0x9e0 [obdclass] <4> [<ffffffffa08417a6>] llog_destroy+0x2b6/0x470 [obdclass] <4> [<ffffffffa08438cb>] llog_cat_close+0x17b/0x220 [obdclass] <4> [<ffffffffa12b04e7>] lod_sub_fini_llog+0xb7/0x380 [lod] <4> [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffffa12b35c4>] lod_process_config+0xbc4/0x1830 [lod] <4> [<ffffffffa111361f>] ? lfsck_stop+0x15f/0x4c0 [lfsck] <4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230 <4> [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffffa1331474>] mdd_process_config+0x114/0x5d0 [mdd] <4> [<ffffffffa11db55e>] mdt_device_fini+0x3ee/0xf40 [mdt] <4> [<ffffffffa0860406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] <4> [<ffffffffa087a552>] class_cleanup+0x572/0xd20 [obdclass] <4> [<ffffffffa085b0c6>] ? class_name2dev+0x56/0xe0 [obdclass] <4> [<ffffffffa087cbd6>] class_process_config+0x1ed6/0x2830 [obdclass] <4> [<ffffffffa077dd01>] ? libcfs_debug_msg+0x41/0x50 [libcfs] <4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230 <4> [<ffffffffa087d9ef>] class_manual_cleanup+0x4bf/0x8e0 [obdclass] <4> [<ffffffffa085b0c6>] ? class_name2dev+0x56/0xe0 [obdclass] <4> [<ffffffffa08b610c>] server_put_super+0xa0c/0xed0 [obdclass] <4> [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190 <4> [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0 <4> [<ffffffff81190c66>] kill_anon_super+0x16/0x60 <4> [<ffffffffa08808a6>] lustre_kill_super+0x36/0x60 [obdclass] <4> [<ffffffff81191407>] deactivate_super+0x57/0x80 <4> [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110 <4> [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0 <4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b <4> Also, immediately the following error was reported on lola-8: Feb 3 10:51:27 lola-8 kernel: LustreError: 5733:0:(llog.c:588:llog_process_thread()) soaked-MDT0006-osp-MDT0003 retry remo te llog process Feb 3 10:51:27 lola-8 kernel: LustreError: 5733:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0006-osp-MDT0003 get ting update log failed: rc = -11 Feb 3 10:51:27 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:27 lola-8 kernel: LustreError: 5730:0:(osp_object.c:588:osp_attr_get()) soaked-MDT0002-osp-MDT0003:osp_attr_ge t update error [0x200000009:0x2:0x0]: rc = -5 Feb 3 10:51:27 lola-8 kernel: LustreError: 5730:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0003-mdtlov: can't get id from catalogs: rc = -5 Feb 3 10:51:28 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:28 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 1 previous similar message Feb 3 10:51:29 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:29 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 1 previous similar message Feb 3 10:51:31 lola-8 kernel: Lustre: soaked-MDT0003: Not available for connect from 0@lo (stopping) Feb 3 10:51:32 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:32 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 2 previous similar messages Feb 3 10:51:35 lola-8 kernel: LustreError: 5861:0:(osd_handler.c:3291:osd_object_ref_del()) soaked-MDT0003-osd: nlink == 0 on [0x2c00042a3:0x15ec5:0x0], maybe an upgraded file? (LU-3915) The sequence of events are:
Attached files: Info required for matching: sanity-quota 7c |
| Comments |
| Comment by Frank Heckes (Inactive) [ 04/Feb/16 ] |
|
crash dump can be found at: lhn.lola.hpdd.intel.com:/scratch/crashdumps/lu-7737/lola-8/127.0.0.1-2016-02-03-10:51:51/ |
| Comment by Joseph Gmitter (Inactive) [ 04/Feb/16 ] |
|
Hi Di, |
| Comment by Gerrit Updater [ 04/Feb/16 ] |
|
wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/18308 |
| Comment by Di Wang [ 04/Feb/16 ] |
|
According to the log, it seems the log records are incorrectly deleted during umount when the target is doing recovery. |
| Comment by James Nunez (Inactive) [ 05/Feb/16 ] |
|
We just started seeing this on master. replay-single test 102c is failing on unmount of the MDS1 in review-dne-part-2. Here are the logs: And in sanity-quota test_7c in review-dne-part-2. Logs at: |
| Comment by Bob Glossman (Inactive) [ 06/Feb/16 ] |
|
another on master: |
| Comment by Richard Henwood (Inactive) [ 08/Feb/16 ] |
|
One more failure on master (from the canary patch) over the weekend: https://testing.hpdd.intel.com/test_sets/66f80856-cddd-11e5-9bc0-5254006e85c2 |
| Comment by Richard Henwood (Inactive) [ 08/Feb/16 ] |
|
And another over the weekend: https://testing.hpdd.intel.com/test_sets/8c4963c2-cc3c-11e5-b2cb-5254006e85c2 |
| Comment by Bob Glossman (Inactive) [ 08/Feb/16 ] |
|
another on master: |
| Comment by Alex Zhuravlev [ 09/Feb/16 ] |
|
I think we should also avoid the case when we try to destroy llog twice: in llog_process() (due to cancels on an error) and in llog_cat_close(). the patch is coming.. |
| Comment by Gerrit Updater [ 09/Feb/16 ] |
|
Alex Zhuravlev (alexey.zhuravlev@intel.com) uploaded a new patch: http://review.whamcloud.com/18362 |
| Comment by Bob Glossman (Inactive) [ 09/Feb/16 ] |
|
another on master: |
| Comment by Gerrit Updater [ 09/Feb/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/18308/ |
| Comment by Peter Jones [ 10/Feb/16 ] |
|
Landed for 2.8 |
| Comment by Peter Jones [ 10/Feb/16 ] |
|
Oh! There is a second patch - sorry |
| Comment by Alex Zhuravlev [ 11/Feb/16 ] |
|
http://review.whamcloud.com/18362 got another ticket - |
| Comment by Peter Jones [ 11/Feb/16 ] |
|
Thanks Alex so I'll re-resolve this |