Details
-
Bug
-
Resolution: Fixed
-
Blocker
-
Lustre 2.8.0
-
3
-
9223372036854775807
Description
This error might be a duplicate of LU-6699. Anyway as the bug occurred in conjunction with llog errors, it might related to the latest changes in DNE (change 16838).
The error happens during soak testing of build '20160203' (see: https://wiki.hpdd.intel.com/display/Releases/Soak+Testing+on+Lola#SoakTestingonLola-20160203). DNE is enable. MDTs had been formatted with ldiskfs, OSTs with zfs. MDSes are configured in actve-active HA failover configuration.
The configuration for the HA pair in question reads as:
- lola-8 - mdt-0, 1 (primary resources)
- lola-9 - mdt-2, 3 (primary resources)
During the umount (failback of resources) of mdt-3 on lola-8 the
node crashed with LBUG:
<0>LustreError: 5861:0:(osd_handler.c:2777:osd_object_destroy()) ASSERTION( !lu_object_is_dying(dt->do_lu.lo_header) ) fail ed: <0>LustreError: 5861:0:(osd_handler.c:2777:osd_object_destroy()) LBUG <4>Pid: 5861, comm: umount <4> <4>Call Trace: <4> [<ffffffffa0772875>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0772e77>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa1060fbb>] osd_object_destroy+0x52b/0x5b0 [osd_ldiskfs] <4> [<ffffffffa105e42d>] ? osd_object_ref_del+0x22d/0x4e0 [osd_ldiskfs] <4> [<ffffffffa0851dda>] llog_osd_destroy+0x1ba/0x9e0 [obdclass] <4> [<ffffffffa08417a6>] llog_destroy+0x2b6/0x470 [obdclass] <4> [<ffffffffa08438cb>] llog_cat_close+0x17b/0x220 [obdclass] <4> [<ffffffffa12b04e7>] lod_sub_fini_llog+0xb7/0x380 [lod] <4> [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffffa12b35c4>] lod_process_config+0xbc4/0x1830 [lod] <4> [<ffffffffa111361f>] ? lfsck_stop+0x15f/0x4c0 [lfsck] <4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230 <4> [<ffffffff8109ec20>] ? autoremove_wake_function+0x0/0x40 <4> [<ffffffffa1331474>] mdd_process_config+0x114/0x5d0 [mdd] <4> [<ffffffffa11db55e>] mdt_device_fini+0x3ee/0xf40 [mdt] <4> [<ffffffffa0860406>] ? class_disconnect_exports+0x116/0x2f0 [obdclass] <4> [<ffffffffa087a552>] class_cleanup+0x572/0xd20 [obdclass] <4> [<ffffffffa085b0c6>] ? class_name2dev+0x56/0xe0 [obdclass] <4> [<ffffffffa087cbd6>] class_process_config+0x1ed6/0x2830 [obdclass] <4> [<ffffffffa077dd01>] ? libcfs_debug_msg+0x41/0x50 [libcfs] <4> [<ffffffff8117523c>] ? __kmalloc+0x21c/0x230 <4> [<ffffffffa087d9ef>] class_manual_cleanup+0x4bf/0x8e0 [obdclass] <4> [<ffffffffa085b0c6>] ? class_name2dev+0x56/0xe0 [obdclass] <4> [<ffffffffa08b610c>] server_put_super+0xa0c/0xed0 [obdclass] <4> [<ffffffff811ac776>] ? invalidate_inodes+0xf6/0x190 <4> [<ffffffff81190b7b>] generic_shutdown_super+0x5b/0xe0 <4> [<ffffffff81190c66>] kill_anon_super+0x16/0x60 <4> [<ffffffffa08808a6>] lustre_kill_super+0x36/0x60 [obdclass] <4> [<ffffffff81191407>] deactivate_super+0x57/0x80 <4> [<ffffffff811b10df>] mntput_no_expire+0xbf/0x110 <4> [<ffffffff811b1c2b>] sys_umount+0x7b/0x3a0 <4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b <4>
Also, immediately the following error was reported on lola-8:
Feb 3 10:51:27 lola-8 kernel: LustreError: 5733:0:(llog.c:588:llog_process_thread()) soaked-MDT0006-osp-MDT0003 retry remo te llog process Feb 3 10:51:27 lola-8 kernel: LustreError: 5733:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0006-osp-MDT0003 get ting update log failed: rc = -11 Feb 3 10:51:27 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:27 lola-8 kernel: LustreError: 5730:0:(osp_object.c:588:osp_attr_get()) soaked-MDT0002-osp-MDT0003:osp_attr_ge t update error [0x200000009:0x2:0x0]: rc = -5 Feb 3 10:51:27 lola-8 kernel: LustreError: 5730:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0003-mdtlov: can't get id from catalogs: rc = -5 Feb 3 10:51:28 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:28 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 1 previous similar message Feb 3 10:51:29 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:29 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 1 previous similar message Feb 3 10:51:31 lola-8 kernel: Lustre: soaked-MDT0003: Not available for connect from 0@lo (stopping) Feb 3 10:51:32 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Local llog found corrupted Feb 3 10:51:32 lola-8 kernel: LustreError: 5727:0:(llog.c:595:llog_process_thread()) Skipped 2 previous similar messages Feb 3 10:51:35 lola-8 kernel: LustreError: 5861:0:(osd_handler.c:3291:osd_object_ref_del()) soaked-MDT0003-osd: nlink == 0 on [0x2c00042a3:0x15ec5:0x0], maybe an upgraded file? (LU-3915)
The sequence of events are:
- 2016-02-03 10:42:39 - failover started for lola-9
- 2016-02-03 10:42:39 - lola-9 online again
- 2016-02-03 10:51:26 - Failback of resoures (umount mdt-3)
- 2016-02-03 10:51:35 - lola-8 hit LBUG
Attached files:
lola-8 messages, console, vmcore-dmesg.txt
soak.log (for injected errors)
Note:
Crash dump have been created. I'll add the info about the storage location as soon as ticket is created.
Info required for matching: sanity-quota 7c