Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10160

LBUG ldlm_lock.c:2259:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed:

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.11.0
    • Soak cluster version=2.10.54_13_g84f690e
      lustre-master-ib build 6 with DNE configured
    • 3
    • 9223372036854775807

    Description

      After OSS failover and recovery:

      Oct 25 08:41:02 soak-2 kernel: Lustre: soaked-OST0012: Recovery over after 0:35, of 33 clients 33 recovered and 0 were evicted.
      Oct 25 08:41:03 soak-2 kernel: LustreError: 26237:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x0:1943200
      Oct 25 08:41:03 soak-2 kernel: LustreError: 26369:0:(lustre_dlm.h:1371:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
      Oct 25 08:41:03 soak-2 kernel: format at lustre_dlm.h:1097:ldlm_lvbo_fill doesn't end in newline
      Oct 25 08:41:03 soak-2 kernel: LustreError: 26237:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 3 previous similar messages
      Oct 25 08:41:03 soak-2 kernel: format at lustre_dlm.h:1097:ldlm_lvbo_fill doesn't end in newline
      Oct 25 08:41:03 soak-2 kernel: LustreError: 12692:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x780000400:835459
      Oct 25 08:41:03 soak-2 kernel: LustreError: 12692:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 1 previous similar message
      Oct 25 08:41:03 soak-2 kernel: Lustre: soaked-OST0006: Connection restored to e5f19f2d-79a3-0744-c0c5-e9d37cbaf536 (at 192.168.1.118@o2ib)
      Oct 25 08:41:03 soak-2 kernel: Lustre: soaked-OST0000: Connection restored to e5f19f2d-79a3-0744-c0c5-e9d37cbaf536 (at 192.168.1.118@o2ib)
      Oct 25 08:41:03 soak-2 kernel: Lustre: Skipped 43 previous similar messages
      Oct 25 08:41:03 soak-2 kernel: Lustre: Skipped 43 previous similar messages
      Oct 25 08:41:03 soak-2 kernel: LustreError: 26369:0:(lustre_dlm.h:1097:ldlm_lvbo_fill()) lock ffff8800a4621c00: delayed lvb init failed (rc -2)
      Oct 25 08:41:03 soak-2 kernel: LustreError: 26469:0:(lustre_dlm.h:1097:ldlm_lvbo_fill()) lock ffff8800afe75200: delayed lvb init failed (rc -2)
      Oct 25 08:41:03 soak-2 kernel: LustreError: 26469:0:(ldlm_lock.c:2259:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed:
      Oct 25 08:41:04 soak-2 kernel: LustreError: 26469:0:(ldlm_lock.c:2259:__ldlm_reprocess_all()) LBUG
      Oct 25 08:41:04 soak-2 kernel: Pid: 26469, comm: ldlm_cn00_003
      Oct 25 08:41:04 soak-2 kernel: #012Call Trace:
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc0d3e7ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc0d3e83c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc112a88b>] __ldlm_reprocess_all+0x32b/0x3a0 [ptlrpc]
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc112a910>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc114dc8d>] ldlm_request_cancel+0x46d/0x720 [ptlrpc]
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc115417a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      Oct 25 08:41:04 soak-2 kernel: [<ffffffffc1154451>] ldlm_cancel_handler+0x141/0x490 [ptlrpc]
      Oct 25 08:41:05 soak-2 kernel: [<ffffffffc118537e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
      Oct 25 08:41:05 soak-2 kernel: [<ffffffffc1182248>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Oct 25 08:41:05 soak-2 kernel: [<ffffffffc1188b22>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff816a8f01>] ? __schedule+0x311/0x8b0
      Oct 25 08:41:05 soak-2 kernel: [<ffffffffc1188090>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
      Oct 25 08:41:05 soak-2 kernel: [<ffffffff810b08c0>] ? kthread+0x0/0xe0
      

      Lustre-log dump attached

      Attachments

        Issue Links

          Activity

            [LU-10160] LBUG ldlm_lock.c:2259:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed:

            Li Xi (lixi@ddn.com) uploaded a new patch: https://review.whamcloud.com/36885
            Subject: LU-10160 ldlm: fix message of ldlm_lvbo_fill
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d7eef551256205643ee70fe8cd102498470cae1d

            gerrit Gerrit Updater added a comment - Li Xi (lixi@ddn.com) uploaded a new patch: https://review.whamcloud.com/36885 Subject: LU-10160 ldlm: fix message of ldlm_lvbo_fill Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d7eef551256205643ee70fe8cd102498470cae1d

            Crashed again after full weekend, multiple successful failovers. Crash dump available on spirit.

            [37601.153508] Lustre: soaked-OST000c: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
            [37601.282466] Lustre: soaked-OST000c: in recovery but waiting for the first client to connect
            [37601.515428] Lustre: soaked-OST000c: Will be in recovery for at least 2:30, or until 33 clients reconnect
            [37610.255005] Lustre: soaked-OST0012: Recovery over after 0:52, of 33 clients 33 recovered and 0 were evicted.
            [37610.255159] LustreError: 5409:0:(ofd_io.c:671:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x740000402:491210
            [37610.256286] Lustre: soaked-OST0012: deleting orphan objects from 0x0:830355 to 0x0:831537
            [37610.258346] LustreError: 5387:0:(lustre_dlm.h:1376:ldlm_lvbo_update()) delayed lvb init failed (rc -2)
            [37610.509648] format at lustre_dlm.h:1103:ldlm_lvbo_fill doesn't end in newline
            [37610.509688] LustreError: 157856:0:(lustre_dlm.h:1103:ldlm_lvbo_fill()) lock ffff880820f860c0: delayed lvb init failed (rc -2)
            [37610.509688] LustreError: 157856:0:(ldlm_lock.c:2244:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed:
            [37610.509691] LustreError: 157856:0:(ldlm_lock.c:2244:__ldlm_reprocess_all()) LBUG
            [37610.509693] Pid: 157856, comm: ldlm_cn00_011
            [37610.509693]
            Call Trace:
            [37610.509715]  [<ffffffffc0a477ae>] libcfs_call_trace+0x4e/0x60 [libcfs]
            [37610.509726]  [<ffffffffc0a4783c>] lbug_with_loc+0x4c/0xb0 [libcfs]
            [37610.509795]  [<ffffffffc0ddd60b>] __ldlm_reprocess_all+0x30b/0x380 [ptlrpc]
            [37610.509846]  [<ffffffffc0ddd690>] ldlm_reprocess_all+0x10/0x20 [ptlrpc]
            [37610.509904]  [<ffffffffc0e002ad>] ldlm_request_cancel+0x41d/0x710 [ptlrpc]
            [37610.509962]  [<ffffffffc0e0534a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
            [37610.510018]  [<ffffffffc0e05621>] ldlm_cancel_handler+0x141/0x490 [ptlrpc]
            [37610.510104]  [<ffffffffc0e3658e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
            [37610.510168]  [<ffffffffc0e33448>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            [37610.510174]  [<ffffffff810c6452>] ? default_wake_function+0x12/0x20
            [37610.510179]  [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90
            [37610.510242]  [<ffffffffc0e39d42>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            [37610.510305]  [<ffffffffc0e392b0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
            [37610.510310]  [<ffffffff810b252f>] kthread+0xcf/0xe0
            [37610.510315]  [<ffffffff810c0d34>] ? finish_task_switch+0x54/0x160
            [37610.510318]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
            [37610.510323]  [<ffffffff816b8798>] ret_from_fork+0x58/0x90
            [37610.510326]  [<ffffffff810b2460>] ? kthread+0x0/0xe0
            [37610.510327]
            [37610.510328] Kernel panic - not syncing: LBUG
            
            cliffw Cliff White (Inactive) added a comment - Crashed again after full weekend, multiple successful failovers. Crash dump available on spirit. [37601.153508] Lustre: soaked-OST000c: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [37601.282466] Lustre: soaked-OST000c: in recovery but waiting for the first client to connect [37601.515428] Lustre: soaked-OST000c: Will be in recovery for at least 2:30, or until 33 clients reconnect [37610.255005] Lustre: soaked-OST0012: Recovery over after 0:52, of 33 clients 33 recovered and 0 were evicted. [37610.255159] LustreError: 5409:0:(ofd_io.c:671:ofd_preprw_write()) soaked-OST0012: BRW to missing obj 0x740000402:491210 [37610.256286] Lustre: soaked-OST0012: deleting orphan objects from 0x0:830355 to 0x0:831537 [37610.258346] LustreError: 5387:0:(lustre_dlm.h:1376:ldlm_lvbo_update()) delayed lvb init failed (rc -2) [37610.509648] format at lustre_dlm.h:1103:ldlm_lvbo_fill doesn't end in newline [37610.509688] LustreError: 157856:0:(lustre_dlm.h:1103:ldlm_lvbo_fill()) lock ffff880820f860c0: delayed lvb init failed (rc -2) [37610.509688] LustreError: 157856:0:(ldlm_lock.c:2244:__ldlm_reprocess_all()) ASSERTION( list_empty(&rpc_list) ) failed: [37610.509691] LustreError: 157856:0:(ldlm_lock.c:2244:__ldlm_reprocess_all()) LBUG [37610.509693] Pid: 157856, comm: ldlm_cn00_011 [37610.509693] Call Trace: [37610.509715] [<ffffffffc0a477ae>] libcfs_call_trace+0x4e/0x60 [libcfs] [37610.509726] [<ffffffffc0a4783c>] lbug_with_loc+0x4c/0xb0 [libcfs] [37610.509795] [<ffffffffc0ddd60b>] __ldlm_reprocess_all+0x30b/0x380 [ptlrpc] [37610.509846] [<ffffffffc0ddd690>] ldlm_reprocess_all+0x10/0x20 [ptlrpc] [37610.509904] [<ffffffffc0e002ad>] ldlm_request_cancel+0x41d/0x710 [ptlrpc] [37610.509962] [<ffffffffc0e0534a>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [37610.510018] [<ffffffffc0e05621>] ldlm_cancel_handler+0x141/0x490 [ptlrpc] [37610.510104] [<ffffffffc0e3658e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [37610.510168] [<ffffffffc0e33448>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [37610.510174] [<ffffffff810c6452>] ? default_wake_function+0x12/0x20 [37610.510179] [<ffffffff810bc0f8>] ? __wake_up_common+0x58/0x90 [37610.510242] [<ffffffffc0e39d42>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [37610.510305] [<ffffffffc0e392b0>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [37610.510310] [<ffffffff810b252f>] kthread+0xcf/0xe0 [37610.510315] [<ffffffff810c0d34>] ? finish_task_switch+0x54/0x160 [37610.510318] [<ffffffff810b2460>] ? kthread+0x0/0xe0 [37610.510323] [<ffffffff816b8798>] ret_from_fork+0x58/0x90 [37610.510326] [<ffffffff810b2460>] ? kthread+0x0/0xe0 [37610.510327] [37610.510328] Kernel panic - not syncing: LBUG

            Crashed again with 2.10.55, vmcore-dmesg attached, crash dump available on soak.

            cliffw Cliff White (Inactive) added a comment - Crashed again with 2.10.55, vmcore-dmesg attached, crash dump available on soak.
            green Oleg Drokin added a comment -

            Hm, another unrelated stuff from the logs, seems the obd to target name is not working especially well:

            00000100:00080000:1.0:1508920863.284572:0:26256:0:(import.c:104:ptlrpc_import_enter_resend()) ffff880179734000 0<E4><CF><EE>^G<88><FF><FF>0<E4><CF><EE>^G<88><FF><FF>: changing import state from NEW to RECOVER
            00000100:00080000:1.0:1508920863.284578:0:26256:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) ffff880179734000 0<E4><CF><EE>^G<88><FF><FF>0<E4><CF><EE>^G<88><FF><FF>: changing import state from RECOVER to FULL
            
            green Oleg Drokin added a comment - Hm, another unrelated stuff from the logs, seems the obd to target name is not working especially well: 00000100:00080000:1.0:1508920863.284572:0:26256:0:(import.c:104:ptlrpc_import_enter_resend()) ffff880179734000 0<E4><CF><EE>^G<88><FF><FF>0<E4><CF><EE>^G<88><FF><FF>: changing import state from NEW to RECOVER 00000100:00080000:1.0:1508920863.284578:0:26256:0:(import.c:1541:ptlrpc_import_recovery_state_machine()) ffff880179734000 0<E4><CF><EE>^G<88><FF><FF>0<E4><CF><EE>^G<88><FF><FF>: changing import state from RECOVER to FULL

            We do have multiple earlier hits of the 'BRW to missing obj'
            Example:

            ct 12 08:29:22 soak-3 kernel: Lustre: soaked-OST000d: Will be in recovery for at least 2:30, or until 34 clients reconnect
            Oct 12 08:29:29 soak-3 kernel: Lustre: soaked-OST000d: Connection restored to 19d86900-89d1-0961-b2cc-97d38ab6ed62 (at 192.168.1.137@o2ib)
            Oct 12 08:29:29 soak-3 kernel: Lustre: Skipped 68 previous similar messages
            Oct 12 08:29:40 soak-3 kernel: Lustre: soaked-OST0013: Recovery over after 0:32, of 34 clients 34 recovered and 0 were evicted.
            Oct 12 08:29:41 soak-3 kernel: LustreError: 9714:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0007: BRW to missing obj 0x0:51358755
            Oct 12 08:29:52 soak-3 kernel: LustreError: 43590:0:(osd_index.c:223:__osd_xattr_load_by_oid()) soaked-OST000d: can't get bonus, rc = -17
            Oct 12 08:29:53 soak-3 kernel: Lustre: 9643:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507796987/real 1507796987]  req@ffff880816ff1200 x1581039691499472/t0(0) o38->soaked-MDT0000-lwp-OST000d@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1507796993 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            ........
            Oct 13 13:46:50 soak-4 kernel: Lustre: soaked-OST000f: Will be in recovery for at least 2:30, or until 34 clients reconnect
            Oct 13 13:46:57 soak-4 kernel: Lustre: soaked-OST0009: Recovery over after 1:02, of 34 clients 34 recovered and 0 were evicted.
            Oct 13 13:46:57 soak-4 kernel: LustreError: 9585:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0009: BRW to missing obj 0x0:67095807
            Oct 13 13:46:57 soak-4 kernel: LustreError: 9585:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 10 previous similar messages
            
            cliffw Cliff White (Inactive) added a comment - We do have multiple earlier hits of the 'BRW to missing obj' Example: ct 12 08:29:22 soak-3 kernel: Lustre: soaked-OST000d: Will be in recovery for at least 2:30, or until 34 clients reconnect Oct 12 08:29:29 soak-3 kernel: Lustre: soaked-OST000d: Connection restored to 19d86900-89d1-0961-b2cc-97d38ab6ed62 (at 192.168.1.137@o2ib) Oct 12 08:29:29 soak-3 kernel: Lustre: Skipped 68 previous similar messages Oct 12 08:29:40 soak-3 kernel: Lustre: soaked-OST0013: Recovery over after 0:32, of 34 clients 34 recovered and 0 were evicted. Oct 12 08:29:41 soak-3 kernel: LustreError: 9714:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0007: BRW to missing obj 0x0:51358755 Oct 12 08:29:52 soak-3 kernel: LustreError: 43590:0:(osd_index.c:223:__osd_xattr_load_by_oid()) soaked-OST000d: can't get bonus, rc = -17 Oct 12 08:29:53 soak-3 kernel: Lustre: 9643:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1507796987/real 1507796987] req@ffff880816ff1200 x1581039691499472/t0(0) o38->soaked-MDT0000-lwp-OST000d@192.168.1.109@o2ib:12/10 lens 520/544 e 0 to 1 dl 1507796993 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 ........ Oct 13 13:46:50 soak-4 kernel: Lustre: soaked-OST000f: Will be in recovery for at least 2:30, or until 34 clients reconnect Oct 13 13:46:57 soak-4 kernel: Lustre: soaked-OST0009: Recovery over after 1:02, of 34 clients 34 recovered and 0 were evicted. Oct 13 13:46:57 soak-4 kernel: LustreError: 9585:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0009: BRW to missing obj 0x0:67095807 Oct 13 13:46:57 soak-4 kernel: LustreError: 9585:0:(ofd_io.c:616:ofd_preprw_write()) Skipped 10 previous similar messages

            Niu - Would you please review this error and comment on possible causes?

            Thanks

            jamesanunez James Nunez (Inactive) added a comment - Niu - Would you please review this error and comment on possible causes? Thanks

            People

              green Oleg Drokin
              cliffw Cliff White (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: