[LU-5185] NFS export of DNE: (service.c:193:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: Created: 12/Jun/14  Updated: 23/Oct/14  Resolved: 23/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Patrick Farrell (Inactive) Assignee: Lai Siyao
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

NFS export of DNE Phase 2


Severity: 3
Rank (Obsolete): 14387

 Description   

During Cray testing of NFS export from a SLES11SP3 client with patch set 20 (from http://review.whamcloud.com/#/c/7476/) on client+server, I hit the assertion below on MDS0 (which has MDTs 0 and 1 on it. Striped directories are in use on the file system). Ran a modified racer against the NFS export and an unmodified racer on a separate Lustre 2.5.58 client.

Dump of MDS0 is up at:
ftp.whamcloud.com/uploads/LU-3544/LU-3544_140609.tar.gz

<6>Lustre: Skipped 1 previous similar message
<3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) centssm2-MDT0000: parent [0x400000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1
<3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) Skipped 3 previous similar messages
<3>LustreError: 7298:0:(mdd_dir.c:3957:mdd_migrate()) centssm2-MDD0000: [0x400000401:0x330f:0x0]8 is already opened count 1: rc = -16
<0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed:
<0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) LBUG
<4>Pid: 7816, comm: mdt00_007
<4>
<4>Call Trace:
<4> [<ffffffffa0b27895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0b27e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0ebd656>] ptlrpc_save_lock+0xb6/0xf0 [ptlrpc]
<4> [<ffffffffa15b074b>] mdt_save_lock+0x22b/0x320 [mdt]
<4> [<ffffffffa15b089c>] mdt_object_unlock+0x5c/0x160 [mdt]
<4> [<ffffffffa15b2187>] mdt_object_unlock_put+0x17/0x110 [mdt]
<4> [<ffffffffa15cf18d>] mdt_unlock_list+0x5d/0x1e0 [mdt]
<4> [<ffffffffa15d1e7c>] mdt_reint_migrate_internal+0x109c/0x1b50 [mdt]
<4> [<ffffffffa15d6113>] mdt_reint_rename_or_migrate+0x2a3/0x660 [mdt]
<4> [<ffffffffa0e8abc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
<4> [<ffffffffa0e8c230>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
<4> [<ffffffffa15d64e3>] mdt_reint_migrate+0x13/0x20 [mdt]
<4> [<ffffffffa15cea81>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa15b3e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
<4> [<ffffffffa15b471b>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa0f182ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0ec7d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
<4> [<ffffffffa0ec7000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 7816, comm: mdt00_007 Not tainted 2.6.32.431.5.1.el6_lustre #1



 Comments   
Comment by Patrick Farrell (Inactive) [ 12/Jun/14 ]

Note there are 4 MDTs in this file system, on two MDSes.

Comment by Peter Jones [ 12/Jun/14 ]

Lai is looking into this one

Comment by Andreas Dilger [ 18/Jun/14 ]

Patrick, have you ever been able to hit this problem again? The difficulty with bugs hit during racer is that they may be extremely rare combinations of events, and are often not problems seen by real users. If this is something that has only been hit once we may not want to block the 2.6 release waiting for a fix.

Comment by Patrick Farrell (Inactive) [ 18/Jun/14 ]

Andreas - I haven't tried again, and I certainly understand the concern. It's not something we do enough to have a good handle on, it's always a special testing session. I'm especially sensitive to the weird conditions I created here - I wanted to turn up the heat on it and see if I found anything interesting. It's not necessarily relevant to the real world.

I'll take another look at NFS export of DNE, especially since the LU-3544 patch seems to be winding down. I'll try to do that in the next day or so.

Comment by Jodi Levi (Inactive) [ 19/Jun/14 ]

We are reducing priority until we hear back from Patrick.

Comment by Patrick Farrell (Inactive) [ 14/Oct/14 ]

We are not currently planning to test this again, as DNE II was not released as a feature in 2.6. We may test again when DNE II is once again a feature slated for actual release.

For now, I'd suggest closing this bug. If it's seen again in the future, we can open a new one or re-open this.

Comment by Andreas Dilger [ 23/Oct/14 ]

Closing this per discussion with Patrick. It can be reopened as needed.

Generated at Sat Feb 10 01:49:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.