[LU-3356] LBUG LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) ASSERTION(pending_child->d_inode != NULL) failed Created: 18/May/13  Updated: 25/Nov/14  Resolved: 25/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.7
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Frederik Ferner (Inactive) Assignee: Zhenyu Xu
Resolution: Won't Fix Votes: 0
Labels: None

Attachments: File lustre-logs.tar.gz     File mds03-02-messages    
Severity: 3
Rank (Obsolete): 8309

 Description   

We have now had the same LBUG twice in one month on the MDS for one of our Lustre file systems.

The error in syslog on the MDS is this:

May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1483:mds_mfd_close()) found "orphan" file 1621419:9595d9c8 with link count 0
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) ASSERTION(pending_child->d_inode != NULL) failed
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: 3202:0:(mds_open.c:1494:mds_mfd_close()) LBUG
May 18 20:48:56 cs04r-sc-mds03-02 kernel: Pid: 3202, comm: ll_mdt_rdpg_35
May 18 20:48:56 cs04r-sc-mds03-02 kernel:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: Call Trace:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff889946a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88994bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8899cfc0>] tracefile_init+0x0/0x110 [libcfs]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e4cd06>] mds_mfd_close+0x796/0x1680 [mds]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff889e7121>] LNetMDBind+0x301/0x450 [lnet]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e549f0>] mds_close+0x6e0/0x8d0 [mds]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88e27fab>] mds_handle+0x254b/0x4d10 [mds]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008e1a4>] enqueue_task+0x41/0x56
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008e20f>] __activate_task+0x56/0x6d
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b05d55>] lustre_msg_get_conn_cnt+0x35/0xf0 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0f6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0fe35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b10dc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff88b0fe60>] ptlrpc_main+0x0/0x1120 [ptlrpc]
May 18 20:48:56 cs04r-sc-mds03-02 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
May 18 20:48:56 cs04r-sc-mds03-02 kernel:
May 18 20:48:56 cs04r-sc-mds03-02 kernel: LustreError: dumping log to /tmp/lustre-log.1368906536.3202

[bnh65367@cs04r-sc-mds03-02 ~]$ cat /proc/fs/lustre/version
lustre: 1.8.7.80
kernel: patchless_client
build: jenkins-gfa6b90d-PRISTINE-2.6.18-274.3.1.el5_lustre.gb18a13c

This version has been running on these MDS without any problems for quite some time now. I'm not entirely sure without checking why we are running this version but I believe it contains a fix for one issue we have seen frequently.

Unfortunately we have so far not been able to identify any reproducer etc but after the LBUG until the fail-over today at least 4 clients were hanging on every access to the file system, other clients were fine.

The logs are still available and we can upload them if it helps.



 Comments   
Comment by Peter Jones [ 19/May/13 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 20/May/13 ]

please upload the logs.

Comment by Dave Bond (Inactive) [ 21/May/13 ]

/var/log/messages from server cs04r-sc-mds03-02

Comment by Dave Bond (Inactive) [ 21/May/13 ]

Lustre log files for cs04r-sc-mds03-02

Comment by Zhenyu Xu [ 22/May/13 ]

patch tracking at http://review.whamcloud.com/6412

Comment by Frederik Ferner (Inactive) [ 30/May/13 ]

I noticed the patch fails very early (in lustre-initialization-1) and the last update has been a while ago. We have a maintenance window coming up next week. If there is a patch we should start testing at least on our test file systems and maybe on the affected file systems, it would be good to have this by then.

Thanks,
Frederik

Comment by Zhenyu Xu [ 30/May/13 ]

the test failure is due to TT-1072 issue, I think you can test with this patch.

Comment by Peter Jones [ 30/May/13 ]

Frederik

The TT project is not open because it tracks configuration issues in our test lab. So, the failure itself means that the verification testing has not yet taken place rather than there is a problem with the patch.

Peter

Comment by Peter Jones [ 25/Nov/14 ]

Frederik

I think that this issue is no longer relevant since your upgrade to 2.5.x

Peter

Generated at Sat Feb 10 01:33:13 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.