[LU-5285] mdt_reconstruct_setattr() calls mdt_attr_get_complex() without checking that object exists Created: 02/Jul/14  Updated: 12/Jan/15  Resolved: 11/Jul/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: Lustre 2.6.0, Lustre 2.5.3

Type: Bug Priority: Critical
Reporter: John Hammond Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: mdt, mn4

Issue Links:
Related
is related to LU-5125 lu_object_attr() should return 0 for ... Closed
is related to LU-6085 racer stuck on mutex_lock in ll_setat... Resolved
is related to LU-5388 Interop 2.5.2<->2.6 failure on test s... Resolved
is related to LU-6088 racer test_1: dir_create.sh mutex dea... Resolved
is related to LU-5340 oops in cl_object_top() after reconst... Resolved
Severity: 3
Rank (Obsolete): 14748

 Description   

Running racer on 2 nodes with MDSCOUNT=1 and http://review.whamcloud.com/#/c/5936/ (add more operations to racer) I see this frequently.

[  803.128945] Lustre: lustre-MDT0000: Client 045a894d-d2a7-e744-df1f-a695740583d8 (at 19\
2.168.122.108@tcp) reconnecting
[  803.134578] LustreError: 5113:0:(lu_object.h:852:lu_object_attr()) ASSERTION( ((o)->lo\
_header->loh_attr & LOHA_EXISTS) != 0 ) failed:
[  803.136583] LustreError: 5113:0:(lu_object.h:852:lu_object_attr()) LBUG
[  803.137789] Pid: 5113, comm: mdt00_005
[  803.138439]
[  803.138440] Call Trace:
[  803.139129]  [<ffffffffa02be8c5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
[  803.140296]  [<ffffffffa02beec7>] lbug_with_loc+0x47/0xb0 [libcfs]
[  803.141337]  [<ffffffffa0c034ac>] mdt_attr_get_complex+0xaec/0xb80 [mdt]
[  803.142457]  [<ffffffffa0c20425>] mdt_reconstruct_setattr+0xd5/0x410 [mdt]
[  803.143631]  [<ffffffffa04737e0>] ? lu_ucred+0x20/0x30 [obdclass]
[  803.144677]  [<ffffffffa0c1fde5>] mdt_reconstruct+0x45/0x120 [mdt]
[  803.145722]  [<ffffffffa0bfdbfa>] mdt_reint_internal+0x6fa/0x7c0 [mdt]
[  803.146820]  [<ffffffffa0bfe24b>] mdt_reint+0x6b/0x120 [mdt]
[  803.147818]  [<ffffffffa06e9c45>] tgt_request_handle+0x245/0xad0 [ptlrpc]
[  803.148977]  [<ffffffffa069ce31>] ptlrpc_main+0xcf1/0x1880 [ptlrpc]
[  803.150053]  [<ffffffffa069c140>] ? ptlrpc_main+0x0/0x1880 [ptlrpc]
[  803.151093]  [<ffffffff8109eab6>] kthread+0x96/0xa0
[  803.151886]  [<ffffffff8100c30a>] child_rip+0xa/0x20
[  803.152714]  [<ffffffff81554710>] ? _spin_unlock_irq+0x30/0x40
[  803.153677]  [<ffffffff8100bb10>] ? restore_args+0x0/0x30
[  803.154572]  [<ffffffff8109ea20>] ? kthread+0x0/0xa0
[  803.155403]  [<ffffffff8100c300>] ? child_rip+0x0/0x20
[  803.156261]

When I hit this I see the following sequence of events:

  1. The MDT crashes.
  2. The client node stays up.
  3. I restart the MDT.
  4. The client resends the setattr.
  5. GOTO 1.


 Comments   
Comment by Andreas Dilger [ 02/Jul/14 ]

Looks like this was added in http://review.whamcloud.com/3782

Comment by John Hammond [ 02/Jul/14 ]

From the source mdt_reconstruct_create() has exactly the same issue. I haven't gotten a crash there but it does pass an object to mdt_attr_get_complex() without checking that it exists. mdt_reconstruct_open() looks OK. Probably all uses of mdt_attr_get_complex() should be audited...

Comment by Alex Zhuravlev [ 03/Jul/14 ]

Andreas, 2.1 doesn't check that either.

why do we need to return the attributes in the setattr path? we're in the reconstruct path and given the original transaction was committed, rep-ack lock was cancelled and another client was able to destroy the object. so we can't return valid attribues. with no LDLM lock the attributes can't be used either?

Comment by Andreas Dilger [ 04/Jul/14 ]

I'm OK with not returning the attributes for setattr if the object is missing. The setattr itself must have been processed, so it shouldn't return an error to the caller (i.e. no -ENOENT), but it is OK if the attributes on the client inode are not updated, and it could even be purged from cache if there is a mechanism to do that (e.g. return only mbo_nlink=0 and mbo_flags=OBD_MD_FLNLINK to the client).

Comment by Alex Zhuravlev [ 09/Jul/14 ]

http://review.whamcloud.com/11025

Comment by Peter Jones [ 11/Jul/14 ]

Landed for 2.6

Generated at Sat Feb 10 01:50:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.