[LU-6983] LBUG on osc_extent_find() ASSERTION( (max_end - cur->oe_start) < max_pages ) failed: [35840 -> 511/511] Created: 11/Aug/15  Updated: 08/Feb/18  Resolved: 08/Feb/18

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Antoine Percher Assignee: Jinshan Xiong (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

RHEL7 lustre client with 2.5.3 lustre server


Attachments: Text File trace_debug_neel1062_osc_extent_find_new.txt    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

LBUG on osc_extent_find() ASSERTION( (max_end - cur->oe_start) < max_pages ) failed: [35840 -> 511/511]

As the LU-6271 after some OST eviction and reconnection during an eavy I/O load
the client do an LBUG like this :

[794894.288763] Lustre: store0-OST0045-osc-ffff88201fcb5800: Connection restored to store0-OST0045 (at QQ.P.BBO.FB@o2ib2)
[794896.511870] Lustre: store0-OST01f3-osc-ffff88201fcb5800: Connection restored to store0-OST01f3 (at QQ.P.BBO.II@o2ib2)
...
[794898.170269] LustreError: 40201:0:(osc_cache.c:662:osc_extent_find()) ASSERTION( (max_end - cur->oe_start) < max_pages ) failed: [35840 -> 511/511]
[794898.170280] LustreError: 40201:0:(osc_cache.c:662:osc_extent_find()) LBUG
[794898.170287] Pid: 40201, comm: testsApiC++-gcc
[794898.170287]

and the stack of the Lbug thread was

crash>  bt
PID: 40201  TASK: ffff880e6f474440  CPU: 6   COMMAND: "testsApiC++-gcc"
 #0 [ffff880eeff93638] machine_kexec at ffffffff8104c4cb
 #1 [ffff880eeff93698] crash_kexec at ffffffff810e1fe2
 #2 [ffff880eeff93768] panic at ffffffff815fd7e1
 #3 [ffff880eeff937e8] lbug_with_loc at ffffffffa0473e5b [libcfs]
 #4 [ffff880eeff93808] osc_extent_find at ffffffffa0becdf2 [osc]
 #5 [ffff880eeff93990] osc_queue_async_io at ffffffffa0be4bf0 [osc]
 #6 [ffff880eeff93ad8] osc_page_cache_add at ffffffffa0bd2463 [osc]
 #7 [ffff880eeff93b00] osc_io_commit_async at ffffffffa0bd9162 [osc]
 #8 [ffff880eeff93b60] cl_io_commit_async at ffffffffa06f4007 [obdclass]
 #9 [ffff880eeff93ba8] lov_io_commit_async at ffffffffa09ecbea [lov]
#10 [ffff880eeff93c08] cl_io_commit_async at ffffffffa06f4007 [obdclass]
#11 [ffff880eeff93c50] vvp_io_write_commit at ffffffffa0b0007a [lustre]
#12 [ffff880eeff93cb0] vvp_io_write_start at ffffffffa0b00aa6 [lustre]
#13 [ffff880eeff93d00] cl_io_start at ffffffffa06f3875 [obdclass]
#14 [ffff880eeff93d28] cl_io_loop at ffffffffa06f6c95 [obdclass]
#15 [ffff880eeff93d58] ll_file_io_generic at ffffffffa0a9f85c [lustre]
#16 [ffff880eeff93e60] ll_file_aio_write at ffffffffa0aa00ce [lustre]
#17 [ffff880eeff93ea8] ll_file_write at ffffffffa0aa02b2 [lustre]
#18 [ffff880eeff93ef8] vfs_write at ffffffff811c65dd
#19 [ffff880eeff93f38] sys_write at ffffffff811c7028
#20 [ffff880eeff93f80] system_call_fastpath at ffffffff81613da9
    RIP: 00007f8d6bbc39fd  RSP: 00007fff791cd238  RFLAGS: 00010216
    RAX: 0000000000000001  RBX: ffffffff81613da9  RCX: 000000000000003f
    RDX: 0000000005c00000  RSI: 00007f8bce395038  RDI: 0000000000000020
    RBP: 00007f8bce395038   R8: 00000000003ffffe   R9: 00000000003ffff4
    R10: 00000000003ffff5  R11: 0000000000000293  R12: 0000000005c00000
    R13: 0000000005c00000  R14: 0000000006f656c0  R15: 0000000005c00000
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

in this case a lot of user thread application do the same LBUG at the same time

Question: is the LU-6271 fix (http://review.whamcloud.com/#/c/14915/) could help for this issue ?



 Comments   
Comment by Joseph Gmitter (Inactive) [ 11/Aug/15 ]

Jinshan,
Can you have a quick look here to see if you have any guidance on this one?
Thanks.
Joe

Comment by Jinshan Xiong (Inactive) [ 11/Aug/15 ]

if you still have an alive vmcore, please dump the information of the client_obd in question.

Comment by Antoine Percher [ 12/Aug/15 ]

Hi Jinshan,
I have a vmcore but I not insite during this week, I could do next monday. Sorry for the delay
Antoine

Comment by Antoine Percher [ 21/Sep/15 ]

Hi Jinshan,
Sorry for the delay ...
You can find client_obd structure data from my crash (Thanks to BrunoF for his helping)
We saw also that the rootcause of the LBUG could be on the struct osc_session.os_io.oi_write_osclock.ols_cl.cls_lock.cll_descr

  crash> p ((struct osc_lock *)0xffff880036883648).ols_cl.cls_lock.cll_descr
  $11 = {
  cld_obj = 0xffff880ec1f66798,
  cld_start = 0x0,
  cld_end = 0x1ff,
  cld_gid = 0x0,
  cld_mode = CLM_WRITE,
  cld_enq_flags = 0x0
}

These datas didn't fit with the IOs in progress and explain the 511 (0x1ff) from the LBUG message :
ASSERTION( (max_end - cur->oe_start) < max_pages ) failed: [35840 -> 511/511]
Sorry again for the delay
Antoine

Comment by Antoine Percher [ 21/Sep/15 ]

Add attachment file

Generated at Sat Feb 10 02:04:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.