Details
-
Bug
-
Resolution: Won't Fix
-
Critical
-
None
-
Lustre 1.8.7
-
None
-
3
-
8309
Description
We have now had the same LBUG twice in one month on the MDS for one of our Lustre file systems.
The error in syslog on the MDS is this:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1483:mds_mfd_close()) found "orphan" file 1621419:9595d9c8 with link count 0
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) ASSERTION(pending_child->d_inode != NULL) failed
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) LBUG
May 18 20:48:56 cs04r-sc-mds03-02 kernel: Pid: 3202, comm: ll_mdt_rdpg_35
May 18 20:48:56 cs04r-sc-mds03-02 kernel:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: Call Trace:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff889946a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88994bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8899cfc0>] tracefile_init+0x0/0x110 [libcfs]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e4cd06>] mds_mfd_close+0x796/0x1680 [mds]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff889e7121>] LNetMDBind+0x301/0x450 [lnet]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e549f0>] mds_close+0x6e0/0x8d0 [mds]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e27fab>] mds_handle+0x254b/0x4d10 [mds]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008e1a4>] enqueue_task+0x41/0x56
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008e20f>] __activate_task+0x56/0x6d
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b05d55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0f6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0fe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b10dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0fe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
May 18 20:48:56 cs04r-sc-mds03-02 kernel:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: dumping log to /tmp/lustre-log.1368906536.3202
[bnh65367@cs04r-sc-mds03-02 ~]$ cat /proc/fs/lustre/version
lustre: 1.8.7.80
kernel: patchless_client
build: jenkins-gfa6b90d-PRISTINE-2.6.18-274.3.1.el5_lustre.gb18a13c
This version has been running on these MDS without any problems for quite some time now. I'm not entirely sure without checking why we are running this version but I believe it contains a fix for one issue we have seen frequently.
Unfortunately we have so far not been able to identify any reproducer etc but after the LBUG until the fail-over today at least 4 clients were hanging on every access to the file system, other clients were fine.
The logs are still available and we can upload them if it helps.