Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16245

__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 )

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.15.1, Lustre 2.15.4
    • 3
    • 9223372036854775807

    Description

      [ 8753.247529] LustreError: 37772:0:(osd_io.c:79:__osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 ) failed: iobuf 000000006eba9531, reqs 0, rw 1, line 1633
      [ 8753.262771] LustreError: 37772:0:(osd_io.c:79:__osd_init_iobuf()) LBUG
      [ 8753.269970] Pid: 37772, comm: mdt_io05_022 5.10.0-60.18.0.50.aarch64 #1 SMP Wed Oct 5 10:58:08 CST 2022
      [ 8753.280021] Call Trace TBD:
      [ 8753.283505] Kernel panic - not syncing: LBUG
      [ 8753.288454] CPU: 59 PID: 37772 Comm: mdt_io05_022 Kdump: loaded Tainted: P           OE     5.10.0-60.18.0.50.aarch64 #1
      [ 8753.299963] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDA, BIOS 1.38 07/04/2020
      [ 8753.308881] Call trace:
      [ 8753.312014]  dump_backtrace+0x0/0x1e0
      [ 8753.316352]  show_stack+0x20/0x30
      [ 8753.320347]  dump_stack+0xe0/0x148
      [ 8753.324426]  panic+0x170/0x398
      [ 8753.328188]  param_set_delay_minmax.isra.1+0x0/0xd0 [libcfs]
      [ 8753.334552]  __osd_init_iobuf+0x2e8/0x408 [osd_ldiskfs]
      [ 8753.340454]  osd_write_prep+0xec/0x330 [osd_ldiskfs]
      [ 8753.346149]  mdt_obd_preprw+0xaa0/0xc38 [mdt]
      [ 8753.351294]  tgt_brw_write+0x1208/0x2f30 [ptlrpc]
      [ 8753.351367]  tgt_handle_request0+0xd4/0x9b0 [ptlrpc]
      [ 8753.362369]  tgt_request_handle+0x7cc/0x1a30 [ptlrpc]
      [ 8753.368148]  ptlrpc_server_handle_request+0x3bc/0x1218 [ptlrpc]
      [ 8753.374791]  ptlrpc_main+0xdfc/0x16c8 [ptlrpc]
      [ 8753.379910]  kthread+0x130/0x138
      [ 8753.383818]  ret_from_fork+0x10/0x18
      [ 8753.388121] SMP: stopping secondary CPUs
      [ 8753.395179] Starting crashdump kernel...
      [ 8753.399781] Bye!
      

      Attachments

        Issue Links

          Activity

            [LU-16245] __osd_init_iobuf()) ASSERTION( iobuf->dr_elapsed_valid == 0 )
            xinliang Xinliang Liu added a comment - - edited

            +1 on v2.15.4 , crash on server5 when running io500 mdtest-hard-write test, see attached kernel log vmcore-dmesg.txt.

            Testbed:

            server_num(Arm64): 6, client_num(x86_64): 5, cores_per_node: 8, np: 40, Mdt: 24, ost: 96 (4 mdts, 16 osts per server)

            OS: openEuler 22.03 SP3, kernel 5.10.0-188.0.0.101.oe2203sp3.aarch64.

            IO500 version: io500-sc23_v1

             

            xinliang Xinliang Liu added a comment - - edited +1 on v2.15.4 , crash on server5 when running io500 mdtest-hard-write test, see attached kernel log vmcore-dmesg.txt. Testbed: server_num(Arm64): 6, client_num(x86_64): 5, cores_per_node: 8, np: 40, Mdt: 24, ost: 96 (4 mdts, 16 osts per server) OS: openEuler 22.03 SP3, kernel 5.10.0-188.0.0.101.oe2203sp3.aarch64. IO500 version: io500-sc23_v1  
            xinliang Xinliang Liu added a comment -

            I suspect this issue is similar to LU-12362. Nested sleeping primitives might lead to an infinite wait, making osd_fini_iobuf() won't be called which causes this crash.

            See about the problem of nested sleeping primitives here: https://lwn.net/Articles/628628/. We might need to fix this issue like LU-12362.

            xinliang Xinliang Liu added a comment - I suspect this issue is similar to LU-12362 . Nested sleeping primitives might lead to an infinite wait, making osd_fini_iobuf() won't be called which causes this crash. See about the problem of nested sleeping primitives here: https://lwn.net/Articles/628628/. We might need to fix this issue like LU-12362 .

            "fengchunsong <fengchunsong@huawei.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48905
            Subject: LU-16245 osd-ldiskfs: prevent dr_elapsed_valid assertion
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 5bc624a9c930f5dfd38b62eb661b706c418682e0

            gerrit Gerrit Updater added a comment - "fengchunsong <fengchunsong@huawei.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/48905 Subject: LU-16245 osd-ldiskfs: prevent dr_elapsed_valid assertion Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5bc624a9c930f5dfd38b62eb661b706c418682e0
            fengchunsong Jason Feng added a comment -

            Do not modify dr_elapsed_valid if osd_fini_iobuf has been invoked.

            The initial value of dr_elapsed_valid is 0. When the I/O is complete, dio_complete_routine will set dr_elapsed_valid  to 1. Finally, dr_elapsed_valid is cleared in osd_fini_iobuf.In the I/O write process, wait_event is not called, and osd_fini_iobuf cannot be executed before dio_complete_routine. As a result, dr_elapsed_valid is not cleared and is asserted when it is used again.
            The initial value of dr_elapsed_valid is 0 and is changed to 2 in osd_fini_iobuf. The value of dr_elapsed_valid is changed to 1 only when the value of dr_elapsed_valid is 0 in dio_complete_routine. This avoids modification after finishing.

            fengchunsong Jason Feng added a comment - Do not modify dr_elapsed_valid if osd_fini_iobuf has been invoked. The initial value of dr_elapsed_valid is 0. When the I/O is complete, dio_complete_routine will set dr_elapsed_valid  to 1. Finally, dr_elapsed_valid is cleared in osd_fini_iobuf.In the I/O write process, wait_event is not called, and osd_fini_iobuf cannot be executed before dio_complete_routine. As a result, dr_elapsed_valid is not cleared and is asserted when it is used again. The initial value of dr_elapsed_valid is 0 and is changed to 2 in osd_fini_iobuf. The value of dr_elapsed_valid is changed to 1 only when the value of dr_elapsed_valid is 0 in dio_complete_routine. This avoids modification after finishing.

            People

              fengchunsong Jason Feng
              fengchunsong Jason Feng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: