[LU-9334] LBUG lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1 Created: 13/Apr/17  Updated: 28/Apr/17  Resolved: 28/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Soak stress cluster - Lustre-master build 3554 version=2.9.55_35_gaa32cc5


Attachments: File 191916.dump.txt.gz     File 191952.dump.txt.gz     File 192002.dump.txt.gz     File 192012.dump.txt.gz     File 192022.dump.txt.gz     File 192032.dump.txt.gz     File 192042.dump.txt.gz     File 192052.dump.txt.gz     File 192102.dump.txt.gz     File 192112.dump.txt.gz     Text File vmcore-dmesg.txt    
Issue Links:
Related
is related to LU-8998 Progressive File Layout (PFL) Resolved
is related to LU-9415 lfsck_load_one_trace_file() is too noisy Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After installing build 3554, attempting to re-mount the filesystem. On MGS/MDS node:

Apr 13 14:14:42 soak-8 kernel: LDISKFS-fs (dm-5): recovery complete
Apr 13 14:14:42 soak-8 kernel: LDISKFS-fs (dm-5): mounted filesystem with ordered data mode. Opts: user_xattr,user_xattr,errors=remount-ro,no_mbcache,nodelalloc
Apr 13 14:14:44 soak-8 kernel: Lustre: MGS: Connection restored to 7703d69c-e9bc-c7f1-f9af-daf27057ff03 (at 0@lo)
Apr 13 14:14:44 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20
Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20
Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20
Apr 13 14:14:44 soak-8 kernel: Lustre: Failing over soaked-MDT0000
Apr 13 14:14:44 soak-8 kernel: LustreError: 5553:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5
Apr 13 14:14:44 soak-8 kernel: LustreError: 5554:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5
Apr 13 14:14:44 soak-8 kernel: LustreError: 5554:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5
Apr 13 14:14:44 soak-8 kernel: LustreError: 5553:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages
Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lu_object.c:1224:lu_device_fini()) ASSERTION( atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
Apr 13 14:14:44 soak-8 kernel: LustreError: 5015:0:(lu_object.c:1224:lu_device_fini()) LBUG
Apr 13 14:14:44 soak-8 kernel: Pid: 5015, comm: mount.lustre
Apr 13 14:14:44 soak-8 kernel: #012Call Trace:
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0bcb7f3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0bcb861>] lbug_with_loc+0x41/0xb0 [libcfs]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0fe48>] lu_device_fini+0xb8/0xc0 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf4b12>] ls_device_put+0x82/0x2a0 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf4e0d>] local_oid_storage_fini+0xdd/0x210 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa129a308>] mgs_fs_cleanup+0x88/0xa0 [mgs]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa1292b96>] mgs_device_fini+0x196/0x5b0 [mgs]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cff724>] class_cleanup+0x7f4/0xd80 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d02364>] class_process_config+0x1f84/0x2c30 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0cf1659>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d030ff>] class_manual_cleanup+0xef/0x810 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d332e0>] server_put_super+0xb20/0xcd0 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d3708b>] server_fill_super+0xcdb/0x184c [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0f1d8>] lustre_fill_super+0x328/0x950 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d0eeb0>] ? lustre_fill_super+0x0/0x950 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffff81201b1d>] mount_nodev+0x4d/0xb0
Apr 13 14:14:44 soak-8 kernel: [<ffffffffa0d06c78>] lustre_mount+0x38/0x60 [obdclass]
Apr 13 14:14:44 soak-8 kernel: [<ffffffff812024c9>] mount_fs+0x39/0x1b0
Apr 13 14:14:44 soak-8 kernel: [<ffffffff8121e24f>] vfs_kern_mount+0x5f/0xf0
Apr 13 14:14:44 soak-8 kernel: [<ffffffff812207ae>] do_mount+0x24e/0xaa0
Apr 13 14:14:44 soak-8 kernel: [<ffffffff81185a7e>] ? __get_free_pages+0xe/0x50
Apr 13 14:14:44 soak-8 kernel: [<ffffffff81221096>] SyS_mount+0x96/0xf0
Apr 13 14:14:44 soak-8 kernel: [<ffffffff81696c49>] system_call_fastpath+0x16/0x1b
Apr 13 14:14:45 soak-8 kernel:
Apr 13 14:14:45 soak-8 kernel: Kernel panic - not syncing: LBUG

vmcore-dmesg is attached, crash dump is available on soak-8



 Comments   
Comment by Joseph Gmitter (Inactive) [ 13/Apr/17 ]

Hi Fan Yong,

Can you please look into this one?

Thanks.
Joe

Comment by nasf (Inactive) [ 14/Apr/17 ]
[34132.961391] LustreError: 5015:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20

That means the object claimed as non-dir, but we expect it as dir. It looks strange, because the target object is the local root, it must be dir. That is why the mount failed. As for the LBUG() itself, it is suspected that we forget to release some object when handle the mount failure. But I have not found the point with the pure code analysis.

Cliff, is it reproducible issue? If yes, would you please collect -1 level kernel debug logs on the failed MDT? Thanks!

Comment by Cliff White (Inactive) [ 17/Apr/17 ]

It is easily reproducible. When I have time I will re-load thet build and try again - moved testing back to 3551 at this time

Comment by Cliff White (Inactive) [ 17/Apr/17 ]

Reproduced the error, ran 'lctl dk' on the MDT every 10 seconds durning the mount to make sure we caught it all. Logs attached.

Comment by Gerrit Updater [ 17/Apr/17 ]

none.

Comment by nasf (Inactive) [ 18/Apr/17 ]

Cliff,

Thanks for collecting the logs. Unfortunately, the logs do not contain any information about the failure in the bug description. So I have to try to reproduce it by myself. I have simulated the cases of lfsck_layout_setup() failure and osp_attr_get() failure. But still cannot reproduce it. So would you please to describe your test configuration, including the MDTs count and how to format the MDT devices? And also the steps (commands) to reproduce the issue?

Thanks!

Comment by nasf (Inactive) [ 18/Apr/17 ]

On the other hand, is it only reproduced on the build 3554? or on other builds also?

Comment by Gerrit Updater [ 18/Apr/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26703
Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6886869d700a7e84f7cc3dc5765096277c84c81e

Comment by Cliff White (Inactive) [ 18/Apr/17 ]

This is on the soak performance cluster - 4 MDS, 4 MDT in a DNE configuration.
Formatted quite some time ago, actually.
MDTs created with older 2.9 build

mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=1 --servicenode=192.168.1.109@o2ib0:192.168.1.108@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026d52013098p1
mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=3 --servicenode=192.168.1.111@o2ib0:192.168.1.110@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015752012949
mkfs.lustre --mgsnode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --fsname=soaked --mdt --index=2 --servicenode=192.168.1.110@o2ib0:192.168.1.111@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50001fedb80000015552012932
mkfs.lustre --mgs --fsname=soaked --mdt --index=0 --servicenode=192.168.1.108@o2ib0:192.168.1.109@o2ib0 --reformat --mkfsoptions='-i 4096' --backfstype=ldiskfs --device-size=2147483648 /dev/disk/by-id/dm-name-360080e50002ff4f00000026952013088p1

To produce the failure, I just mount the MDT. Mount script halts at that point.
Does PFL not work on an existing filesystem? Does the filesystem have to be re-created with PFL code?

Comment by Cliff White (Inactive) [ 18/Apr/17 ]

Tested latest patch

Apr 18 16:26:55 soak-8 kernel: Lustre: MGS: Connection restored to c060ba9e-ea2b-a109-ec49-a8f45fcb1eaf (at 0@lo)
Apr 18 16:26:56 soak-8 kernel: Lustre: soaked-MDT0000: Imperative Recovery not enabled, recovery window 300-900
Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(lfsck_layout.c:6780:lfsck_layout_setup()) soaked-MDT0000-osd: fail to init layout LFSCK component: rc = -20
Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(mdd_device.c:1084:mdd_prepare()) soaked-MDD0000: failed to initialize lfsck: rc = -20
Apr 18 16:26:56 soak-8 kernel: LustreError: 12003:0:(obd_mount_server.c:1840:server_fill_super()) Unable to start targets: -20
Apr 18 16:26:56 soak-8 kernel: Lustre: Failing over soaked-MDT0000
Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) soaked-MDT0001-osp-MDT0000:osp_attr_get update error [0x200000009:0x1:0x0]: rc = -5
Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_sub_object.c:959:lod_sub_prep_llog()) soaked-MDT0000-mdtlov: can't get id from catalogs: rc = -5
Apr 18 16:26:56 soak-8 kernel: LustreError: 12459:0:(lod_dev.c:419:lod_sub_recovery_thread()) soaked-MDT0002-osp-MDT0000 getting update log failed: rc = -5
Apr 18 16:26:56 soak-8 kernel: LustreError: 12458:0:(osp_object.c:527:osp_attr_get()) Skipped 2 previous similar messages
Apr 18 16:27:02 soak-8 kernel: Lustre: 12003:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1492532816/real 1492532816]  req@ffff8803fc650000 x1565032298054672/t0(0) o251->MGC192.168.1.108@o2ib10@0@lo:26/25 lens 224/224 e 0 to 1 dl 1492532822 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Apr 18 16:27:02 soak-8 kernel: Lustre: server umount soaked-MDT0000 complete
Apr 18 16:27:02 soak-8 kernel: LustreError: 12003:0:(obd_mount.c:1502:lustre_fill_super()) Unable to mount  (-20)
Apr 18 16:27:02 soak-8 sshd[11981]: Received disconnect from 192.168.1.135: 11: disconnected by user
Apr 18 16:27:02 soak-8 sshd[11981]: pam_unix(sshd:session): session closed for user root

fails

Comment by Gerrit Updater [ 18/Apr/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/26715
Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
Project: fs/lustre-release
Branch: pfl
Current Patch Set: 1
Commit: 806549e6ba5e96059cb41d006290bfc70ce58709

Comment by Gerrit Updater [ 18/Apr/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26715/
Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
Project: fs/lustre-release
Branch: pfl
Current Patch Set:
Commit: e62a57f86027582c8c9b56a7d3d348fb38510f4b

Comment by Gerrit Updater [ 18/Apr/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/26716
Subject: LU-9334 lfsck: reset trace file for upgrade case
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 48672c55bf83bbadec89bbd3406489251d701db0

Comment by Gerrit Updater [ 18/Apr/17 ]

Andreas Dilger (andreas.dilger@intel.com) uploaded a new patch: https://review.whamcloud.com/26718
Subject: LU-9334 lfsck: reset trace file for upgrade case
Project: fs/lustre-release
Branch: pfl
Current Patch Set: 1
Commit: 740e28ad8fad95080da4b0275562378ef0b51cd3

Comment by Gerrit Updater [ 18/Apr/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) merged in patch https://review.whamcloud.com/26718/
Subject: LU-9334 lfsck: reset trace file for upgrade case
Project: fs/lustre-release
Branch: pfl
Current Patch Set:
Commit: ba4a83aa9f72d7855ff77a37897619155313b198

Comment by Gerrit Updater [ 26/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26703/
Subject: LU-9334 lfsck: object leak in lfsck_load_one_trace_file
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1b136613440bd81e284a12df97618f92c9729d71

Comment by nasf (Inactive) [ 26/Apr/17 ]

Still need the patch:
https://review.whamcloud.com/26716

Comment by Gerrit Updater [ 28/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/26716/
Subject: LU-9334 lfsck: reset trace file for upgrade case
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b3e30f49d3947c63b56274657a5c55af9ba85a2d

Comment by Peter Jones [ 28/Apr/17 ]

Landed for 2.10

Generated at Sat Feb 10 02:25:16 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.