[LU-3723] LBUG on MDS when unmounting file system after lfsck -t namespace on 2.4 Created: 07/Aug/13  Updated: 28/Aug/13  Resolved: 21/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: nasf (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9585

 Description   

After doing running lfsck -t namespace (specifically lctl lfsck_start -M [fsname]-MDT0000 -t namespace) on a 2.4 formatted file system, the MDS LBUGs when unmounting the file system.

This was observed with master on CentOS 6 and with 2.4 on SLES11SP1. The dump I'll be making available is on SLES11SP1 with 2.4.

This issue has been observed both during an upgrade from 1.8.6 to 2.4, and also on a fresh 2.4 install.

Here's the stack trace:
2013-08-07T10:30:11.921271-05:00 c0-0c1s5n0 LustreError: 20626:0:(lu_object.c:1141:lu_device_fini()) ASSERTION( cfs_atomic_read(&d->ld_ref) == 0 ) failed: Refcount is 1
2013-08-07T10:30:11.921310-05:00 c0-0c1s5n0 LustreError: 20626:0:(lu_object.c:1141:lu_device_fini()) LBUG
2013-08-07T10:30:11.921321-05:00 c0-0c1s5n0 Pid: 20626, comm: umount
2013-08-07T10:30:11.921329-05:00 c0-0c1s5n0 Call Trace:
2013-08-07T10:30:11.921338-05:00 c0-0c1s5n0 [<ffffffff81007e59>] try_stack_unwind+0x1a9/0x200
2013-08-07T10:30:11.921347-05:00 c0-0c1s5n0 [<ffffffff81006625>] dump_trace+0x95/0x300
2013-08-07T10:30:11.921356-05:00 c0-0c1s5n0 [<ffffffffa044c8d7>] libcfs_debug_dumpstack+0x57/0x80 [libcfs]
2013-08-07T10:30:11.921365-05:00 c0-0c1s5n0 [<ffffffffa044ce27>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-08-07T10:30:11.921373-05:00 c0-0c1s5n0 [<ffffffffa05590c7>] lu_device_fini+0x87/0xc0 [obdclass]
2013-08-07T10:30:11.921382-05:00 c0-0c1s5n0 [<ffffffffa053e4e9>] ls_device_put+0xa9/0x200 [obdclass]
2013-08-07T10:30:11.921390-05:00 c0-0c1s5n0 [<ffffffffa053e74b>] local_oid_storage_fini+0x10b/0x210 [obdclass]
2013-08-07T10:30:11.921398-05:00 c0-0c1s5n0 [<ffffffffa0251944>] mdd_process_config+0x274/0x610 [mdd]
2013-08-07T10:30:11.921407-05:00 c0-0c1s5n0 [<ffffffffa0b7ed6b>] mdt_stack_fini+0x17b/0xbc0 [mdt]
2013-08-07T10:30:11.921416-05:00 c0-0c1s5n0 [<ffffffffa0b7fe39>] mdt_device_fini+0x689/0xdd0 [mdt]
2013-08-07T10:30:11.921424-05:00 c0-0c1s5n0 [<ffffffffa054b00f>] class_cleanup+0x65f/0xdb0 [obdclass]
2013-08-07T10:30:11.921432-05:00 c0-0c1s5n0 [<ffffffffa054c874>] class_process_config+0x1114/0x1cb0 [obdclass]
2013-08-07T10:30:11.921440-05:00 c0-0c1s5n0 [<ffffffffa054d587>] class_manual_cleanup+0x177/0x6f0 [obdclass]
2013-08-07T10:30:11.921448-05:00 c0-0c1s5n0 [<ffffffffa05845ba>] server_put_super+0x5ba/0xf00 [obdclass]
2013-08-07T10:30:11.921456-05:00 c0-0c1s5n0 [<ffffffff811159bd>] generic_shutdown_super+0x5d/0x110
2013-08-07T10:30:11.921465-05:00 c0-0c1s5n0 [<ffffffff81115ad6>] kill_anon_super+0x16/0x60
2013-08-07T10:30:11.921473-05:00 c0-0c1s5n0 [<ffffffffa054f2c6>] lustre_kill_super+0x36/0x50 [obdclass]
2013-08-07T10:30:11.921481-05:00 c0-0c1s5n0 [<ffffffff81115f73>] deactivate_super+0x73/0x90
2013-08-07T10:30:11.921490-05:00 c0-0c1s5n0 [<ffffffff8112e082>] mntput_no_expire+0xc2/0xf0
2013-08-07T10:30:11.921498-05:00 c0-0c1s5n0 [<ffffffff8112e43c>] sys_umount+0x7c/0x360
2013-08-07T10:30:11.921506-05:00 c0-0c1s5n0 [<ffffffff8100305b>] system_call_fastpath+0x16/0x1b
2013-08-07T10:30:11.921514-05:00 c0-0c1s5n0 [<00007fa6f1b37d07>] 0x7fa6f1b37d07
2013-08-07T10:30:11.921523-05:00 c0-0c1s5n0 Kernel panic - not syncing: LBUG
2013-08-07T10:30:11.921531-05:00 c0-0c1s5n0 Pid: 20626, comm: umount Tainted: P 2.6.32.59-0.7.1_1.0000.7461-cray_gem_s #1
2013-08-07T10:30:11.921539-05:00 c0-0c1s5n0 Call Trace:
2013-08-07T10:30:11.921547-05:00 c0-0c1s5n0 [<ffffffff81007e59>] try_stack_unwind+0x1a9/0x200
2013-08-07T10:30:11.921555-05:00 c0-0c1s5n0 [<ffffffff81006625>] dump_trace+0x95/0x300
2013-08-07T10:30:11.921563-05:00 c0-0c1s5n0 [<ffffffff8100786c>] show_trace_log_lvl+0x5c/0x80
2013-08-07T10:30:11.921572-05:00 c0-0c1s5n0 [<ffffffff810078a5>] show_trace+0x15/0x20
2013-08-07T10:30:11.921580-05:00 c0-0c1s5n0 [<ffffffff814283c5>] dump_stack+0x77/0x82
2013-08-07T10:30:11.921588-05:00 c0-0c1s5n0 [<ffffffff8142844a>] panic+0x7a/0x165
2013-08-07T10:30:11.921597-05:00 c0-0c1s5n0 [<ffffffffa044ce7b>] lbug_with_loc+0x9b/0xb0 [libcfs]
2013-08-07T10:30:11.921605-05:00 c0-0c1s5n0 [<ffffffffa05590c7>] lu_device_fini+0x87/0xc0 [obdclass]
2013-08-07T10:30:11.921613-05:00 c0-0c1s5n0 [<ffffffffa053e4e9>] ls_device_put+0xa9/0x200 [obdclass]
2013-08-07T10:30:11.921621-05:00 c0-0c1s5n0 [<ffffffffa053e74b>] local_oid_storage_fini+0x10b/0x210 [obdclass]
2013-08-07T10:30:11.921630-05:00 c0-0c1s5n0 [<ffffffffa0251944>] mdd_process_config+0x274/0x610 [mdd]
2013-08-07T10:30:11.921638-05:00 c0-0c1s5n0 [<ffffffffa0b7ed6b>] mdt_stack_fini+0x17b/0xbc0 [mdt]
2013-08-07T10:30:11.921646-05:00 c0-0c1s5n0 [<ffffffffa0b7fe39>] mdt_device_fini+0x689/0xdd0 [mdt]
2013-08-07T10:30:11.921654-05:00 c0-0c1s5n0 [<ffffffffa054b00f>] class_cleanup+0x65f/0xdb0 [obdclass]
2013-08-07T10:30:11.921663-05:00 c0-0c1s5n0 [<ffffffffa054c874>] class_process_config+0x1114/0x1cb0 [obdclass]
2013-08-07T10:30:11.921672-05:00 c0-0c1s5n0 [<ffffffffa054d587>] class_manual_cleanup+0x177/0x6f0 [obdclass]
2013-08-07T10:30:11.921680-05:00 c0-0c1s5n0 [<ffffffffa05845ba>] server_put_super+0x5ba/0xf00 [obdclass]
2013-08-07T10:30:11.921688-05:00 c0-0c1s5n0 [<ffffffff811159bd>] generic_shutdown_super+0x5d/0x110
2013-08-07T10:30:11.921697-05:00 c0-0c1s5n0 [<ffffffff81115ad6>] kill_anon_super+0x16/0x60
2013-08-07T10:30:11.921705-05:00 c0-0c1s5n0 [<ffffffffa054f2c6>] lustre_kill_super+0x36/0x50 [obdclass]
2013-08-07T10:30:11.921713-05:00 c0-0c1s5n0 [<ffffffff81115f73>] deactivate_super+0x73/0x90
2013-08-07T10:30:11.921725-05:00 c0-0c1s5n0 [<ffffffff8112e082>] mntput_no_expire+0xc2/0xf0
2013-08-07T10:30:11.921751-05:00 c0-0c1s5n0 [<ffffffff8112e43c>] sys_umount+0x7c/0x360
2013-08-07T10:30:11.921760-05:00 c0-0c1s5n0 [<ffffffff8100305b>] system_call_fastpath+0x16/0x1b
2013-08-07T10:30:11.921768-05:00 c0-0c1s5n0 [<00007fa6f1b37d07>] 0x7fa6f1b37d07



 Comments   
Comment by Patrick Farrell (Inactive) [ 07/Aug/13 ]

The dump (with associated ko files and lustre-ext.so) is available over FTP here:
ftp.cray.com
anonymous/anonymous

cd outbound
get 801166_dump_and_logs.tar.gz

The full dk log (debug=-1) from the MDS, from before lfsck was started until the LBUG on unmount, is in the file called mds_log.sort

Comment by Patrick Farrell (Inactive) [ 20/Aug/13 ]

With patch http://review.whamcloud.com/#/c/7190/ from https://jira.hpdd.intel.com/browse/LU-3649, this no longer occurs. I've tested both with the release branch with that patch applied on CentOS 6.4, and with Cray's local 2.4 with that patch applied on SLES.

I'd say this bug can be closed with a reference to that one. Thanks!

Comment by nasf (Inactive) [ 21/Aug/13 ]

This is a duplicate of LU-3649.

Comment by Patrick Farrell (Inactive) [ 21/Aug/13 ]

nasf,

Two things occurred to me:

First of all, we need a fix to b2_4 as well as to master for this, right? So LU-3649, or at least http://review.whamcloud.com/#/c/7190/, needs to be landed to b2_4.

Secondly, any idea how was this missed in the release? It fails reliably on file system down after LFSCK, it's hard to see how that wouldn't have been hit even without any testing aimed at hitting it.

Thanks,
Patrick

Comment by nasf (Inactive) [ 22/Aug/13 ]

1) This bug was introduced by the LFSCK 1.5, so it needs to be fixed on both b2_4 and b2_5.

2) The issue can be triggered when the LFSCK scanning position is reset, such as specify "lctl lfsck_start -r", or repeatedly run the LFSCK (more similar as your failure case). We did not test such cases before, so missed to find the issue in time.

Comment by Cory Spitz [ 22/Aug/13 ]

nasf, I think that we were able to replicate it from a simple `lctl lfsck_start -M [fsname]-MDT0000 -t namespace`. Any idea why that failed hard for us, but passed your testing?

Comment by nasf (Inactive) [ 22/Aug/13 ]

The first run `lctl lfsck_start -M [fsname]-MDT0000 -t namespace` on a new formatted MDT will pass, but the next run `lctl lfsck_start -M [fsname]-MDT0000 -t namespace` will cause your failure. Does it your case?

Comment by Patrick Farrell (Inactive) [ 28/Aug/13 ]

nasf,

It appears you're right about that, it only happens after the second and subsequent runs on a newly formatted file system.

  • Patrick
Generated at Sat Feb 10 01:36:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.