[LU-6973] Null pointer access during umount MDT server if orph_cleanup_sc is not finish Created: 10/Aug/15  Updated: 28/Aug/15  Resolved: 10/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Antoine Percher Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

MDT Server crash


Attachments: Text File trace_debug_lascaux111_fld_server_lookup_NULL_pointer.txt    
Issue Links:
Related
is related to LU-5249 conf-sanity test_32a: NULL pointer in... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Null pointer access during umount MDT server if orph_cleanup_sc is not finish

crash> bt
PID: 31563  TASK: ffff880879176ab0  CPU: 1   COMMAND: "orph_cleanup_sc"
 #0 [ffff880547d9b810] machine_kexec at ffffffff8103b71b
 #1 [ffff880547d9b870] crash_kexec at ffffffff810c9942
 #2 [ffff880547d9b940] oops_end at ffffffff8152f070
 #3 [ffff880547d9b970] no_context at ffffffff8104c80b
 #4 [ffff880547d9b9c0] __bad_area_nosemaphore at ffffffff8104ca95
 #5 [ffff880547d9ba10] bad_area_nosemaphore at ffffffff8104cb63
 #6 [ffff880547d9ba20] __do_page_fault at ffffffff8104d25c
 #7 [ffff880547d9bb40] do_page_fault at ffffffff81530fbe
 #8 [ffff880547d9bb70] page_fault at ffffffff8152e375
    [exception RIP: fld_server_lookup+97]
    RIP: ffffffffa0a50b31  RSP: ffff880547d9bc20  RFLAGS: 00010286
    RAX: ffff8810515df4c0  RBX: 00000002122fc000  RCX: ffff8810426c5078
    RDX: ffff880e6896b400  RSI: ffffffffa0a56b00  RDI: ffff881040aff840
    RBP: ffff880547d9bc70   R8: 0000000015fbb1dc   R9: 0000000000000000
    R10: 092c8d41cd51a9c5  R11: 0000000000000041  R12: 0000000000000000
    R13: ffff8810515df4c0  R14: ffff8810426c5078  R15: ffff8810426c4000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff880547d9bc78] osd_fld_lookup at ffffffffa0d53b8a [osd_ldiskfs]
#10 [ffff880547d9bca8] osd_remote_fid at ffffffffa0d54320 [osd_ldiskfs]
#11 [ffff880547d9bcf8] osd_it_ea_rec at ffffffffa0d6539e [osd_ldiskfs]
#12 [ffff880547d9be38] lod_it_rec at ffffffffa0eec331 [lod]
#13 [ffff880547d9be48] __mdd_orphan_cleanup at ffffffffa0f55050 [mdd]
#14 [ffff880547d9bee8] kthread at ffffffff8109e71e
#15 [ffff880547d9bf48] kernel_thread at ffffffff8100c20a
crash>

This crash appear because ss_server_fld is NULL

crash> p *(*((struct osd_device *)0xffff8801f827a000).od_dt_dev.dd_lu_dev.ld_site).ld_seq_site
$9 = {
  ss_lu = 0xffff8801f827a150,
  ss_node_id = 0,
  ss_server_fld = 0x0,     <---------- ICI
  ss_client_fld = 0x0,
  ss_server_seq = 0x0,
  ss_control_seq = 0x0,
  ss_control_exp = 0x0,
  ss_client_seq = 0x0
}
2228 int osd_fld_lookup(const struct lu_env *env, struct osd_device *osd,
2229                    obd_seq seq, struct lu_seq_range *range)
2230 {
....
2248
2249         LASSERT(ss != NULL);
2250         fld_range_set_any(range);
2251         rc = fld_server_lookup(env, ss->ss_server_fld, seq, range);    <--- in some condition ss->ss_server_fld could be NULL
2252         if (rc != 0) {
2253                 CERROR("%s: cannot find FLD range for "LPX64": rc = %d\n",
2254                        osd_name(osd), seq, rc);
"lustre/osd-ldiskfs/osd_handler.c" 6005 lines --37%--                                                                      2258,0-1      37%

Is it possible to stop orph_cleanup_sc process on the begining of umount MDT process to prevent this issue ?



 Comments   
Comment by Bruno Faccini (Inactive) [ 10/Aug/15 ]

Hello Antoine,
My first reading of your bug report makes me think that this could be a dup of LU-5249.
Thus, and even if associated b2_5 patch (http://review.whamcloud.com/#/c/13579) still has not landed, you may want to give it a try ?

Comment by Antoine Percher [ 10/Aug/15 ]

Hello Bruno,
Thanks for your answer, the patch look like well, is it possible this patch has landed ASAP ?
and I will ask to have this fix on the next lustre T100 release
You can tag this issue as a dup of LU-5249

Comment by Peter Jones [ 10/Aug/15 ]

Antoine

The LU-5249 fix has landed for a 2.5.x maintenance release so you can either rebase on a more current release or else apply the patch to your current baseline

Peter

Generated at Sat Feb 10 02:04:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.