Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 1.8.6
-
None
-
RHEL5 on all affected machines, Lustre exported via NFS
-
3
-
17,764
-
6577
Description
We hit this LBUG frequently on one of our production file systems and now have managed to reproduce reliably on our test file system by exporting the Lustre file system via NFS on one Lustre client and by running a version of racer on a NFS client in the exported Lustre file system. After a few minutes the LBUG will happen on the MDS. We've initially seen this on Lustre 1.6.7.2, then 1.8.3-ddn3.3 and now have been able to reproduce on the test file system after upgrading the MDS to 1.8.6-wc1, leaving the OSSes and clients at 1.8.3-ddn3.3 for now.
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: LustreError: 6854:0:(mds_open.c:1323:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 1d2764:0e4d3640 (ffff810429fc2b70) inode ffff81042aabfc30/1910628/239941184
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: LustreError: 6854:0:(mds_open.c:1323:mds_open()) LBUG
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: Pid: 6854, comm: ll_mdt_03
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel:
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: Call Trace:
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff887aa6a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff887aabda>] lbug_with_loc+0x7a/0xd0 [libcfs]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88c1d33d>] mds_open+0x26ad/0x38eb [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff889a3461>] ksocknal_launch_packet+0x2b1/0x3a0 [ksocklnd]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff889a4f65>] ksocknal_alloc_tx+0x1f5/0x2a0 [ksocklnd]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88917491>] lustre_swab_buf+0x81/0x170 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8000d567>] dput+0x2c/0x113
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88bf40b5>] mds_reint_rec+0x365/0x550 [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88c1eb3e>] mds_update_unpack+0x1fe/0x280 [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88be6eca>] mds_reint+0x35a/0x420 [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88be5dda>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88bf0bfc>] mds_intent_policy+0x4ac/0xc20 [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888d8270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888d5eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888d27fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888fa870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888f7b29>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88befb20>] mds_handle+0x40e0/0x4d10 [mds]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8008ddcd>] enqueue_task+0x41/0x56
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8008de38>] __activate_task+0x56/0x6d
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8891bd55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff889256d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88925e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88926dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88925e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel:
Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: LustreError: dumping log to /tmp/lustre-log.1311606830.6854
I'll attach the racer scripts and lustre-log.
I'm not sure but at least earlier traces seemed to look like it might have been this bug, now reporting here as I can still reproduce it with the 1.8.6-wc1: https://bugzilla.lustre.org/show_bug.cgi?id=17764
[MDS:]cat /proc/fs/lustre/version
lustre: 1.8.6
kernel: patchless_client
build: jenkins-wc1--PRISTINE-2.6.18-238.12.1.el5_lustre.g266a955
Attachments
Issue Links
- Trackbacks
-
Lustre 1.8.x known issues tracker While testing against Lustre b18 branch, we would hit known bugs which were already reported in Lustre Bugzilla https://bugzilla.lustre.org/. In order to move away from relying on Bugzilla, we would create a JIRA
-
Changelog 1.8 Changes from version 1.8.7wc1 to version 1.8.8wc1 Server support for kernels: 2.6.18308.4.1.el5 (RHEL5) Client support for unpatched kernels: 2.6.18308.4.1.el5 (RHEL5) 2.6.32220.13.1.el6 (RHEL6) Recommended e2fsprogs version: 1.41.90....