Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3356

LBUG LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) ASSERTION(pending_child->d_inode != NULL) failed

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Critical
    • None
    • Lustre 1.8.7
    • None
    • 3
    • 8309

    Description

      We have now had the same LBUG twice in one month on the MDS for one of our Lustre file systems.

      The error in syslog on the MDS is this:

      May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1483:mds_mfd_close()) found "orphan" file 1621419:9595d9c8 with link count 0
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) ASSERTION(pending_child->d_inode != NULL) failed
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) LBUG
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: Pid: 3202, comm: ll_mdt_rdpg_35
      May 18 20:48:56 cs04r-sc-mds03-02 kernel:
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: Call Trace:
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff889946a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88994bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8899cfc0>] tracefile_init+0x0/0x110 [libcfs]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e4cd06>] mds_mfd_close+0x796/0x1680 [mds]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff889e7121>] LNetMDBind+0x301/0x450 [lnet]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e549f0>] mds_close+0x6e0/0x8d0 [mds]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e27fab>] mds_handle+0x254b/0x4d10 [mds]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008e1a4>] enqueue_task+0x41/0x56
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008e20f>] __activate_task+0x56/0x6d
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b05d55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0f6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0fe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b10dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0fe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
      May 18 20:48:56 cs04r-sc-mds03-02 kernel:
      May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: dumping log to /tmp/lustre-log.1368906536.3202

      [bnh65367@cs04r-sc-mds03-02 ~]$ cat /proc/fs/lustre/version
      lustre: 1.8.7.80
      kernel: patchless_client
      build: jenkins-gfa6b90d-PRISTINE-2.6.18-274.3.1.el5_lustre.gb18a13c

      This version has been running on these MDS without any problems for quite some time now. I'm not entirely sure without checking why we are running this version but I believe it contains a fix for one issue we have seen frequently.

      Unfortunately we have so far not been able to identify any reproducer etc but after the LBUG until the fail-over today at least 4 clients were hanging on every access to the file system, other clients were fine.

      The logs are still available and we can upload them if it helps.

      Attachments

        Activity

          People

            bobijam Zhenyu Xu
            ferner Frederik Ferner (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: