[LU-1626] GPF in osc_create Created: 12/Jul/12  Updated: 02/Aug/12  Resolved: 02/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.1
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3

Type: Bug Priority: Major
Reporter: Sebastien Piechurski Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

Bull lustre distribution 213 including patch from ORNL-22 and LU-1144


Severity: 2
Rank (Obsolete): 4510

 Description   

A MDS crashed with a general protection failure in osc_create.
This is probably caused by a race as the struct obdo *oa parameter of osc_create is passed from a poisoned lov_request.

crash> bt
PID: 7769   TASK: ffff881808256790  CPU: 0   COMMAND: "mdt_00"
 #0 [ffff8817a456af50] machine_kexec at ffffffff81027a4b
 #1 [ffff8817a456afb0] crash_kexec at ffffffff810a2db2
 #2 [ffff8817a456b080] oops_end at ffffffff81481730
 #3 [ffff8817a456b0b0] die at ffffffff810071cb
 #4 [ffff8817a456b0e0] do_general_protection at ffffffff814812c2
 #5 [ffff8817a456b110] general_protection at ffffffff81480a95
    [exception RIP: osc_create+101]
    RIP: ffffffffa08d69b5  RSP: ffff8817a456b1c0  RFLAGS: 00010282
    RAX: ffffffffa08d6950  RBX: ffff881792313178  RCX: ffff8817f70e8b00
    RDX: ffff880b622b2d80  RSI: 5a5a5a5a5a5a5a5a  RDI: ffff8817921c8000
    RBP: ffff8817a456b290   R8: ffff8817f70e8b00   R9: 00000000ffffffff
    R10: ffff881792b92000  R11: 00000000ffffff95  R12: ffff8817923124b8
    R13: 5a5a5a5a5a5a5a5a  R14: ffff8817f70e8b00  R15: ffff880b622b2d80
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffff8817a456b298] lov_check_and_create_object at ffffffffa0929482 [lov]
 #7 [ffff8817a456b308] qos_remedy_create at ffffffffa0929c55 [lov]
 #8 [ffff8817a456b398] lov_fini_create_set at ffffffffa092660e [lov]
 #9 [ffff8817a456b468] lov_create at ffffffffa090f2ed [lov]
#10 [ffff8817a456b5a8] mdd_lov_create at ffffffffa0a897cd [mdd]
#11 [ffff8817a456b688] mdd_create_data at ffffffffa0a93f6e [mdd]
#12 [ffff8817a456b728] cml_create_data at ffffffffa0b41066 [cmm]
#13 [ffff8817a456b7a8] mdt_finish_open at ffffffffa0af6335 [mdt]
#14 [ffff8817a456b878] mdt_reint_open at ffffffffa0af81d2 [mdt]
#15 [ffff8817a456b998] mdt_reint_rec at ffffffffa0adfadf [mdt]
#16 [ffff8817a456b9e8] mdt_reint_internal at ffffffffa0ad7f74 [mdt]
#17 [ffff8817a456ba78] mdt_intent_reint at ffffffffa0ad85f5 [mdt]
#18 [ffff8817a456baf8] mdt_intent_policy at ffffffffa0ad0550 [mdt]
#19 [ffff8817a456bb68] ldlm_lock_enqueue at ffffffffa06eab8a [ptlrpc]
#20 [ffff8817a456bc08] ldlm_handle_enqueue0 at ffffffffa0711777 [ptlrpc]
#21 [ffff8817a456bca8] mdt_enqueue at ffffffffa0ad00ca [mdt]
#22 [ffff8817a456bcd8] mdt_handle_common at ffffffffa0aca865 [mdt]
#23 [ffff8817a456bd58] mdt_regular_handle at ffffffffa0acb875 [mdt]
#24 [ffff8817a456bd68] ptlrpc_main at ffffffffa07409e9 [ptlrpc]
#25 [ffff8817a456bf48] kernel_thread at ffffffff810041aa

0xffffffffa08d69ac <osc_create+92>:     test   %r15,%r15
0xffffffffa08d69af <osc_create+95>:     je     0xffffffffa08d7705
0xffffffffa08d69b5 <osc_create+101>:    mov    0x0(%r13),%rax           <=== R13: 5a5a5a5a5a5a5a5a
0xffffffffa08d69b9 <osc_create+105>:    test   $0x1000000,%eax


0xffffffffa08d6974 <osc_create+36>:     mov    0xe0(%rdi),%r12
0xffffffffa08d697b <osc_create+43>:     mov    %rsi,%r13  <==== r13 value comes from rsi which is second parameter of osc_create
0xffffffffa08d697e <osc_create+46>:     mov    %rdx,%r15


int osc_create(struct obd_export *exp, struct obdo *oa,
                                       ^^^^^^^^^^^^^^^ == 0x5a5a5a5a5a5a5a5a
               struct lov_stripe_md **ea, struct obd_trans_info *oti)

obdo *oa is transmitted from lov_check_and_create_object lov_request req which is poisoned except for the obdidx:

crash> lov_request ffff880b622b2d40
struct lov_request {
  rq_oi = {
    oi_policy = {
      l_extent = {
        start = 6510615555426900570, 
        end = 6510615555426900570, 
        gid = 6510615555426900570
      }, 
      l_flock = {
        start = 6510615555426900570, 
        end = 6510615555426900570, 
        owner = 6510615555426900570, 
        blocking_owner = 6510615555426900570, 
        blocking_export = 0x5a5a5a5a5a5a5a5a, 
        pid = 1515870810
      }, 
      l_inodebits = {
        bits = 6510615555426900570
      }
    }, 
    oi_flags = 1515870810, 
    oi_lockh = 0x5a5a5a5a5a5a5a5a, 
    oi_md = 0x5a5a5a5a5a5a5a5a, 
    oi_oa = 0x5a5a5a5a5a5a5a5a, 
    oi_osfs = 0x5a5a5a5a5a5a5a5a, 
    oi_cb_up = 0x5a5a5a5a5a5a5a5a, 
    oi_capa = 0x5a5a5a5a5a5a5a5a
  }, 
  rq_rqset = 0x5a5a5a5a5a5a5a5a, 
  rq_link = {
    next = 0x5a5a5a5a5a5a5a5a, 
    prev = 0x5a5a5a5a5a5a5a5a
  }, 
  rq_idx = 39,                 <== scratch-OST0027
  rq_stripe = 1515870810, 
  rq_complete = 1515870810, 
  rq_rc = 1515870810, 
  rq_buflen = 1515870810, 
  rq_oabufs = 1515870810, 
  rq_pgaidx = 1515870810
}



 Comments   
Comment by Peter Jones [ 13/Jul/12 ]

Bobijam

Could you please comment on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 16/Jul/12 ]

patch tracking at http://review.whamcloud.com/3401

lov: fix lov request set finish check race

When several lov_request callbacks are called, if one of them is
the last lov_request in the set, lov_finished_set() checks for
all of them will return true, while the following action is supposed
be called only once for the set, in this case the assumption is broke
and the lov request set's refcount is wrong.

This patch fixed another glitch, in qos_remedy_create(), when we use
OST pool, the ost_idx value does not initialied correctly.

Comment by Zhenyu Xu [ 16/Jul/12 ]

b2_1 patch tracking at http://review.whamcloud.com/3402

Comment by Zhenyu Xu [ 02/Aug/12 ]

patch landed for 2.1.3 and 2.3.0

Generated at Sat Feb 10 01:18:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.