[LU-9334] LBUG lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Created: 13/Apr/17 Updated: 28/Apr/17 Resolved: 28/Apr/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Soak stress cluster - Lustre-master build 3554 version=2.9.55_35_gaa32cc5 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
After installing build 3554, attempting to re-mount the filesystem. On MGS/MDS node: Apr 13 14:14:42 soak-8 kernel: LDISKFS-fs (dm-5): recovery complete Apr 13 14:14:42 soak-8 kernel: LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: user_xattr,user_xattr,errors=remount-ro,no_mbcache,nodelalloc Apr 13 14:14:44 soak-8 kernel: Lustre: MGS: Connection restored to 7703d69c-e9bc-c7f1-f9af-daf27057ff03 (at 0@lo) Apr 13 14:14:44 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20 Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20 Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20 Apr 13 14:14:44 soak-8 kernel: Lustre: Failing over soaked-MDT0000 Apr 13 14:14:44 soak-8 kernel: LustreError: 5553:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5 Apr 13 14:14:44 soak-8 kernel: LustreError: 5554:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5 Apr 13 14:14:44 soak-8 kernel: LustreError: 5554:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5 Apr 13 14:14:44 soak-8 kernel: LustreError: 5553:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lu_object.c:1224:lu_device_fini()) LBUG Apr 13 14:14:44 soak-8 kernel: Pid: 5015, comm: mount.lustre Apr 13 14:14:44 soak-8 kernel: #012Call Trace: Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0bcb7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0bcb861>] lbug_with_loc+0x41/0xb0 [libcfs] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0fe48>] lu_device_fini+0xb8/0xc0 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf4b12>] ls_device_put+0x82/0x2a0 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf4e0d>] local_oid_storage_fini+0xdd/0x210 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa129a308>] mgs_fs_cleanup+0x88/0xa0 [mgs] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa1292b96>] mgs_device_fini+0x196/0x5b0 [mgs] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cff724>] class_cleanup+0x7f4/0xd80 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d02364>] class_process_config+0x1f84/0x2c30 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf1659>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d030ff>] class_manual_cleanup+0xef/0x810 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d332e0>] server_put_super+0xb20/0xcd0 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d3708b>] server_fill_super+0xcdb/0x184c [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0f1d8>] lustre_fill_super+0x328/0x950 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0eeb0>] ? lustre_fill_super+0x0/0x950 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffff81201b1d>] mount_nodev+0x4d/0xb0 Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d06c78>] lustre_mount+0x38/0x60 [obdclass] Apr 13 14:14:44 soak-8 kernel: [<ffffffff812024c9>] mount_fs+0x39/0x1b0 Apr 13 14:14:44 soak-8 kernel: [<ffffffff8121e24f>] vfs_kern_mount+0x5f/0xf0 Apr 13 14:14:44 soak-8 kernel: [<ffffffff812207ae>] do_mount+0x24e/0xaa0 Apr 13 14:14:44 soak-8 kernel: [<ffffffff81185a7e>] ? __get_free_pages+0xe/0x50 Apr 13 14:14:44 soak-8 kernel: [<ffffffff81221096>] SyS_mount+0x96/0xf0 Apr 13 14:14:44 soak-8 kernel: [<ffffffff81696c49>] system_call_fastpath+0x16/0x1b Apr 13 14:14:45 soak-8 kernel: Apr 13 14:14:45 soak-8 kernel: Kernel panic - not syncing: LBUG vmcore-dmesg is attached, crash dump is available on soak-8 |
| Comments |
| Comment by Joseph Gmitter (Inactive) [ 13/Apr/17 ] |
|
Hi Fan Yong, Can you please look into this one? Thanks. |
| Comment by nasf (Inactive) [ 14/Apr/17 ] |
[34132.961391] LustreError: 5015:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20 That means the object claimed as non-dir, but we expect it as dir. It looks strange, because the target object is the local root, it must be dir. That is why the mount failed. As for the LBUG() itself, it is suspected that we forget to release some object when handle the mount failure. But I have not found the point with the pure code analysis. Cliff, is it reproducible issue? If yes, would you please collect -1 level kernel debug logs on the failed MDT? Thanks! |
| Comment by Cliff White (Inactive) [ 17/Apr/17 ] |
|
It is easily reproducible. When I have time I will re-load thet build and try again - moved testing back to 3551 at this time |
| Comment by Cliff White (Inactive) [ 17/Apr/17 ] |
|
Reproduced the error, ran 'lctl dk' on the MDT every 10 seconds durning the mount to make sure we caught it all. Logs attached. |
| Comment by Gerrit Updater [ 17/Apr/17 ] |
|
none. |
| Comment by nasf (Inactive) [ 18/Apr/17 ] |
|
Cliff, Thanks for collecting the logs. Unfortunately, the logs do not contain any information about the failure in the bug description. So I have to try to reproduce it by myself. I have simulated the cases of lfsck_layout_setup() failure and osp_attr_get() failure. But still cannot reproduce it. So would you please to describe your test configuration, including the MDTs count and how to format the MDT devices? And also the steps (commands) to reproduce the issue? Thanks! |
| Comment by nasf (Inactive) [ 18/Apr/17 ] |
|
On the other hand, is it only reproduced on the build 3554? or on other builds also? |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26703 |
| Comment by Cliff White (Inactive) [ 18/Apr/17 ] |
|
This is on the soak performance cluster - 4 MDS, 4 MDT in a DNE configuration. mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=1 --servicenode=192.168.1.109@o2ib0:192.168.1.108@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026d52013098p1 mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=3 --servicenode=192.168.1.111@o2ib0:192.168.1.110@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015752012949 mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=2 --servicenode=192.168.1.110@o2ib0:192.168.1.111@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015552012932 mkfs.lustre --mgs --fsname=soaked --mdt --index=0 --servicenode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1 To produce the failure, I just mount the MDT. Mount script halts at that point. |
| Comment by Cliff White (Inactive) [ 18/Apr/17 ] |
|
Tested latest patch Apr 18 16:26:55 soak-8 kernel: Lustre: MGS: Connection restored to c060ba9e-ea2b-a109-ec49-a8f45fcb1eaf (at 0@lo) Apr 18 16:26:56 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20 Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20 Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20 Apr 18 16:26:56 soak-8 kernel: Lustre: Failing over soaked-MDT0000 Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5 Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5 Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5 Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages Apr 18 16:27:02 soak-8 kernel: Lustre: 12003:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492532816/real 1492532816] req@ffff8803fc650000 x1565032298054672/t0(0) o251->MGC192.168.1.108@o2ib10@0@lo:26/25 lens 224/224 e 0 to 1 dl 1492532822 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 Apr 18 16:27:02 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete Apr 18 16:27:02 soak-8 kernel: LustreError: 12003:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount (-20) Apr 18 16:27:02 soak-8 sshd[11981]: Received disconnect from 192.168.1.135: 11: disconnected by user Apr 18 16:27:02 soak-8 sshd[11981]: pam_unix(sshd:session): session closed for user root fails |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26715 |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26715/ |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26716 |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26718 |
| Comment by Gerrit Updater [ 18/Apr/17 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26718/ |
| Comment by Gerrit Updater [ 26/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26703/ |
| Comment by nasf (Inactive) [ 26/Apr/17 ] |
|
Still need the patch: |
| Comment by Gerrit Updater [ 28/Apr/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26716/ |
| Comment by Peter Jones [ 28/Apr/17 ] |
|
Landed for 2.10 |