Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-534

(mds_open.c:1323:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: -> LBUG

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 1.8.8
    • Lustre 1.8.6
    • None
    • RHEL5 on all affected machines, Lustre exported via NFS
    • 3
    • 17,764
    • 6577

    Description

      We hit this LBUG frequently on one of our production file systems and now have managed to reproduce reliably on our test file system by exporting the Lustre file system via NFS on one Lustre client and by running a version of racer on a NFS client in the exported Lustre file system. After a few minutes the LBUG will happen on the MDS. We've initially seen this on Lustre 1.6.7.2, then 1.8.3-ddn3.3 and now have been able to reproduce on the test file system after upgrading the MDS to 1.8.6-wc1, leaving the OSSes and clients at 1.8.3-ddn3.3 for now.

      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: LustreError: 6854:0:(mds_open.c:1323:mds_open()) ASSERTION(!mds_inode_is_orphan(dchild->d_inode)) failed: dchild 1d2764:0e4d3640 (ffff810429fc2b70) inode ffff81042aabfc30/1910628/239941184
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: LustreError: 6854:0:(mds_open.c:1323:mds_open()) LBUG
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: Pid: 6854, comm: ll_mdt_03
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel:
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: Call Trace:
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff887aa6a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff887aabda>] lbug_with_loc+0x7a/0xd0 [libcfs]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88c1d33d>] mds_open+0x26ad/0x38eb [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff889a3461>] ksocknal_launch_packet+0x2b1/0x3a0 [ksocklnd]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff889a4f65>] ksocknal_alloc_tx+0x1f5/0x2a0 [ksocklnd]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88917491>] lustre_swab_buf+0x81/0x170 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8000d567>] dput+0x2c/0x113
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88bf40b5>] mds_reint_rec+0x365/0x550 [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88c1eb3e>] mds_update_unpack+0x1fe/0x280 [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88be6eca>] mds_reint+0x35a/0x420 [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88be5dda>] fixup_handle_for_resent_req+0x5a/0x2c0 [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88bf0bfc>] mds_intent_policy+0x4ac/0xc20 [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888d8270>] ldlm_resource_putref_internal+0x230/0x460 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888d5eb6>] ldlm_lock_enqueue+0x186/0xb20 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888d27fd>] ldlm_lock_create+0x9bd/0x9f0 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888fa870>] ldlm_server_blocking_ast+0x0/0x83d [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff888f7b29>] ldlm_handle_enqueue+0xbf9/0x1210 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88befb20>] mds_handle+0x40e0/0x4d10 [mds]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8008ddcd>] enqueue_task+0x41/0x56
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8008de38>] __activate_task+0x56/0x6d
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8891bd55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff889256d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88925e35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8008c85d>] __wake_up_common+0x3e/0x68
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88926dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff88925e60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel:
      Jul 25 16:13:50 cs04r-sc-mds02-03 kernel: LustreError: dumping log to /tmp/lustre-log.1311606830.6854

      I'll attach the racer scripts and lustre-log.

      I'm not sure but at least earlier traces seemed to look like it might have been this bug, now reporting here as I can still reproduce it with the 1.8.6-wc1: https://bugzilla.lustre.org/show_bug.cgi?id=17764

      [MDS:]cat /proc/fs/lustre/version
      lustre: 1.8.6
      kernel: patchless_client
      build: jenkins-wc1--PRISTINE-2.6.18-238.12.1.el5_lustre.g266a955

      Attachments

        1. lustre-log.1311606830.6854.txt
          1.22 MB
          Frederik Ferner
        2. lustre-log.1313079467.7339.txt.bz2
          1.96 MB
          Frederik Ferner
        3. racer-dls.tar.gz
          3 kB
          Frederik Ferner

        Issue Links

          Activity

            People

              bobijam Zhenyu Xu
              ferner Frederik Ferner (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: