Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16638

LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.15.2
    • None
    • RHEL 9.0 client running 2.15.2 with tcp networking.
    • 3
    • 9223372036854775807

    Description

      We're seeing a regular crash on one of our clients that reexports a lustre volume via nfs to other clients. For a while I thought it was related to atime updates as we were seeing that in the stack trace, but it's still crashing in the same spot in osc_object after disabling atime.

      [238496.543455] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) page@00000000124db7f5[4 000000001cc24e6a 4 1 0000000000000000]
      [238496.543488] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) vvp-page@00000000644ae261(0:0) vm@0000000013d555b5 17ffffc0002001 4:0 ffff8b33354a0500 540 lru
      [238496.543514] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) lov-page@00000000fac88b5b
      [238496.543532] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) osc-page@00000000c5423838 540: 1< 0x845fed 1 + + > 2< 2211840 0 4096 0x7 0x9 | 0000000000000000 0000000032e9a87e 00000000b31fd886 > 3< 1 0 0 > 4< 0 0 8 156499967 - | - - - + > 5< - - - + | 0 - | 0 - ->
      [238496.543569] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) end page@00000000124db7f5
      [238496.543585] LustreError: 18531:0:(osc_object.c:396:osc_req_attr_set()) uncovered page!
      [238496.543598] LustreError: 18531:0:(ldlm_resource.c:1783:ldlm_resource_dump()) --- Resource: [0xd3409f:0x0:0x0].0x0 (000000004660d5d9) refcount = 3
      [238496.543618] LustreError: 18531:0:(ldlm_resource.c:1787:ldlm_resource_dump()) Granted locks (in reverse order):
      [238496.543635] LustreError: 18531:0:(ldlm_resource.c:1790:ldlm_resource_dump()) ### ### ns: work-OST0003-osc-ffff8b3d1067e800 lock: 00000000552d990c/0x2904edfb430539b2 lrc: 3/1,0 mode: PR/PR res: [0xd3409f:0x0:0x0].0x0 rrc: 4 type: EXT [0->2211839] (req 2146304->2211839) gid 0 flags: 0x800420400020000 nid: local remote: 0x27d356efda730f51 expref: -99 pid: 18893 timeout: 0 lvb_type: 1
      [238496.543687] LustreError: 18531:0:(ldlm_resource.c:1802:ldlm_resource_dump()) Waiting locks:
      [238496.543701] LustreError: 18531:0:(ldlm_resource.c:1804:ldlm_resource_dump()) ### ### ns: work-OST0003-osc-ffff8b3d1067e800 lock: 0000000049878f3e/0x2904edfb430539b9 lrc: 4/1,0 mode: --/PR res: [0xd3409f:0x0:0x0].0x0 rrc: 4 type: EXT [2211840->2277375] (req 2211840->2277375) gid 0 flags: 0x20000 nid: local remote: 0x27d356efda730f5f expref: -99 pid: 18894 timeout: 0 lvb_type: 1
      [238496.543746] Pid: 18531, comm: ptlrpcd_03_06 5.14.0-70.36.1.el9_0.x86_64 #1 SMP PREEMPT Thu Nov 24 11:28:21 EST 2022
      [238496.543762] Call Trace TBD:
      [238496.543767] LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG
      [238496.543779] Pid: 18531, comm: ptlrpcd_03_06 5.14.0-70.36.1.el9_0.x86_64 #1 SMP PREEMPT Thu Nov 24 11:28:21 EST 2022
      [238496.543794] Call Trace TBD:
      [238496.543799] Kernel panic - not syncing: LBUG
      [238496.543807] CPU: 46 PID: 18531 Comm: ptlrpcd_03_06 Kdump: loaded Tainted: P           OE    --------- ---  5.14.0-70.36.1.el9_0.x86_64 #1
      [238496.543827] Hardware name: Supermicro AS -1114S-WN10RT/H12SSW-NTR, BIOS 2.3 12/03/2021
      [238496.543840] Call Trace:
      [238496.543848]  dump_stack_lvl+0x34/0x48
      [238496.543860]  panic+0x102/0x2d4
      [238496.543869]  lbug_with_loc.cold+0x18/0x18 [libcfs]
      [238496.543887]  osc_req_attr_set+0x32a/0x540 [osc]
      [238496.543905]  cl_req_attr_set+0x5e/0x160 [obdclass]
      [238496.543939]  osc_build_rpc+0x4a7/0x11f0 [osc]
      [238496.544421]  osc_send_read_rpc+0x6de/0x810 [osc]
      [238496.545787]  osc_check_rpcs+0x335/0x3c0 [osc]
      [238496.546230]  osc_io_unplug0+0x75/0x90 [osc]
      [238496.546662]  brw_queue_work+0x2f/0xd0 [osc]
      [238496.547086]  work_interpreter+0x32/0x170 [ptlrpc]
      [238496.547527]  ptlrpc_check_set+0x415/0x1ea0 [ptlrpc]
      [238496.547966]  ptlrpcd_check+0x3d0/0x5c0 [ptlrpc]
      [238496.548787]  ptlrpcd+0x20d/0x4a0 [ptlrpc]
      [238496.550000]  kthread+0x149/0x170
      [238496.550732]  ret_from_fork+0x22/0x30
      

      This crash is relatively new for us, we started to notice it after we switched from o2ib to tcp to address stability issues in our environment that we believe (we're still investigating) are related to rdma on rhel9 with omnipath.

      We have a couple vmcores from the crash kernel available if desired, however I'd rather not attach them here.

      Attachments

        Issue Links

          Activity

            [LU-16638] LustreError: 18531:0:(osc_object.c:410:osc_req_attr_set()) LBUG

            the kernel patch that corrects this has been incorporated into rhel 9.2 in RHSA-2023:3723

            snehring Shane Nehring added a comment - the kernel patch that corrects this has been incorporated into rhel 9.2 in RHSA-2023:3723

            Thank you Andreas. I've reached out to Red Hat to ask about incorporating that patch to el9.

            snehring Shane Nehring added a comment - Thank you Andreas. I've reached out to Red Hat to ask about incorporating that patch to el9.

            Shane, I was on a meeting while writing my last comment, so I didn't see your previous exchange with Peter until after I submitted my comment.

            You likely missed that it was a duplicate because I just recently edited LU-16412 to include enough information to find it, which was previously only in a customer ticket. Luckily you don't have to wait for the long time it took to debug this issue before we found it was a kernel bug.

            adilger Andreas Dilger added a comment - Shane, I was on a meeting while writing my last comment, so I didn't see your previous exchange with Peter until after I submitted my comment. You likely missed that it was a duplicate because I just recently edited LU-16412 to include enough information to find it, which was previously only in a customer ticket. Luckily you don't have to wait for the long time it took to debug this issue before we found it was a kernel bug.

            This looks like a duplicate of LU-16412, which exposed a bug in the kernel. There were patches landed to the mainline kernel recently and backported to stable kernels, but likely still need to be added to vendor kernels. You could potentially speed that process up by filing a ticket with your OS vendor and referencing the lore.kernel.org links in LU-16412 to request that patch be added to their kernel (it also improves performance in some workloads, so it is a win-win).

            A workaround has also been added to Lustre:
            https://review.whamcloud.com/50277

            and will likely appear in 2.15.3.

            adilger Andreas Dilger added a comment - This looks like a duplicate of LU-16412 , which exposed a bug in the kernel. There were patches landed to the mainline kernel recently and backported to stable kernels, but likely still need to be added to vendor kernels. You could potentially speed that process up by filing a ticket with your OS vendor and referencing the lore.kernel.org links in LU-16412 to request that patch be added to their kernel (it also improves performance in some workloads, so it is a win-win). A workaround has also been added to Lustre: https://review.whamcloud.com/50277 and will likely appear in 2.15.3.
            pjones Peter Jones added a comment -

            Yes - likely

            pjones Peter Jones added a comment - Yes - likely

            Sorry, don't know how I missed that this is a duplicate. I'm assuming this'll end up in b2_15 in time for 2.15.3?

            snehring Shane Nehring added a comment - Sorry, don't know how I missed that this is a duplicate. I'm assuming this'll end up in b2_15 in time for 2.15.3?

            People

              wc-triage WC Triage
              snehring Shane Nehring
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: