Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18406

client hit LBUG: (osc_cache.c:1098:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: last_oap_count -61

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      1 client hit LBUG in 2.16.0-rc4 soak testing when soak has been running for about 4 days

      [351555.810525] Lustre: 24940:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1
      72.25.80.53@tcp:/sfa18k03/fid: [0x28003b736:0x8cbb:0x0]// may get corrupted (rc -108)
      [351555.810535] Lustre: 24940:0:(llite_lib.c:4120:ll_dirty_page_discard_warn()) sfa18k03: dirty page discard: 172.25.80.50@tcp:172.25.80.51@tcp:172.25.80.52@tcp:1
      72.25.80.53@tcp:/sfa18k03/fid: [0x28003b736:0x6fd1:0x0]// may get corrupted (rc -108)
      [351555.946431] LustreError: 207604:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x28003b719:0x11158:0x0] error -108.
      [351556.013419] Lustre: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.52@tcp (at 172.25.80.52@tcp)
      [351556.031730] LustreError: 207604:0:(vvp_io.c:1921:vvp_io_init()) Skipped 33 previous similar messages
      [351615.967171] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [351649.110938] rcu:    39-...0: (1 GPs behind) idle=daa/1/0x4000000000000002 softirq=12024628/12024629 fqs=6631 
      [351649.120913]         (detected by 39, t=93156 jiffies, g=52770953, q=137201)
      [351649.127464] NMI backtrace for cpu 39
      [351649.131225] CPU: 39 PID: 0 Comm: swapper/39 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-513.24.1.el8_9.x86_64 #1
      [351649.143279] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023
      [351649.150986] Call Trace:
      [351649.153621]  <IRQ>
      [351649.155824]  dump_stack+0x41/0x60
      [351649.159331]  nmi_cpu_backtrace.cold.8+0x13/0x4f
      [351649.164055]  ? lapic_can_unplug_cpu.cold.30+0x43/0x43
      [351649.169300]  nmi_trigger_cpumask_backtrace+0xe9/0xee
      [351649.174454]  rcu_dump_cpu_stacks+0xc8/0xfc
      [351649.178743]  rcu_sched_clock_irq.cold.101+0x33/0x215
      [351649.183903]  ? tick_sched_do_timer+0x50/0x50
      [351649.188365]  update_process_times+0x55/0x80
      [351649.192740]  tick_sched_handle+0x22/0x60
      [351649.196848]  tick_sched_timer+0x37/0x80
      [351649.200871]  __hrtimer_run_queues+0x101/0x280
      [351649.205420]  hrtimer_interrupt+0x100/0x220
      [351649.209709]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [351649.214694]  smp_apic_timer_interrupt+0x6a/0x130
      [351649.219503]  apic_timer_interrupt+0xf/0x20
      [351649.223795]  </IRQ>
      [351649.226082] RIP: 0010:cpuidle_enter_state+0xda/0x470
      [351649.231238] Code: e8 bb 1c 98 ff 80 7c 24 0f 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 4b 03 00 00 31 ff e8 3d 38 9f ff fb 66 0f 1f 44 00 00 <45> 85 f6 0f 
      88 24 01 00 00 49 63 d6 48 8b 4c 24 10 48 2b 0c 24 48
      [351649.250278] RSP: 0018:ffffa1b50c8d3e58 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
      [351649.258071] RAX: ffff8caead9f3180 RBX: ffffffff854d1798 RCX: 000000000000001f
      [351649.265426] RDX: 00013fcaf8d20a6c RSI: 000000002db6db6d RDI: 0000000000000000
      [351649.272785] RBP: ffff8c70b1d96c00 R08: 0000000000000002 R09: 0000000000032940
      [351649.280143] R10: 00037fe4cad7b104 R11: ffff8caead9f1dc4 R12: 0000000000000002
      [351649.287501] R13: ffffffff854d16a0 R14: 0000000000000002 R15: 0000000000000002
      [351649.294862]  ? cpuidle_enter_state+0xb5/0x470
      [351649.299412]  cpuidle_enter+0x2c/0x40
      [351649.303179]  do_idle+0x26f/0x2d0
      [351649.306596]  cpu_startup_entry+0x6f/0x80
      [351649.310712]  start_secondary+0x187/0x1d0
      [351649.314829]  secondary_startup_64_no_verify+0xd2/0xdb
      [351877.620015] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: operation ldlm_enqueue to node 172.25.80.53@tcp failed: rc = -107
      [351877.620321] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [351877.631905] LustreError: Skipped 1 previous similar message
      [351877.650117] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: This client was evicted by sfa18k03-MDT0003; in progress operations using this service will fail.
      [351877.670534] LustreError: 217484:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x2c003de64:0x158ce:0x0] error -108.
      [351877.676425] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp)
      [354508.337347] LustreError: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: operation ost_read to node 172.25.80.52@tcp failed: rc = -116
      [365488.929944] LustreError: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: operation ldlm_enqueue to node 172.25.80.52@tcp failed: rc = -107
      [365488.941844] Lustre: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: Connection to sfa18k03-MDT0001 (at 172.25.80.52@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [365488.960111] LustreError: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: This client was evicted by sfa18k03-MDT0001; in progress operations using this service will fail.
      [365488.975053] LustreError: 237455:0:(llite_lib.c:2018:ll_md_setattr()) md_setattr fails: rc = -108
      [365488.975096] LustreError: 237462:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x28003b753:0x1e7b9:0x0] error -108.
      [365488.976372] LustreError: 235100:0:(file.c:262:ll_close_inode_openhandle()) sfa18k03-clilmv-ffff8c70b6dd8800: inode [0x28003b753:0x1e7f3:0x0] mdc close failed: rc = -108
      [365488.978006] LustreError: 237557:0:(ldlm_resource.c:1149:ldlm_resource_complain()) sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: namespace resource [0x28003b753:0x1e7b9:0x0].0x0 (00000000adbbab83) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [365488.982357] Lustre: sfa18k03-MDT0001-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.52@tcp (at 172.25.80.52@tcp)
      [365489.044666] LustreError: 237462:0:(vvp_io.c:1921:vvp_io_init()) Skipped 2 previous similar messages
      [366333.556010] LustreError: sfa18k03-MDT0000-mdc-ffff8c70b6dd8800: operation ost_read to node 172.25.80.50@tcp failed: rc = -116
      [366520.381859] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: operation ldlm_enqueue to node 172.25.80.53@tcp failed: rc = -107
      [366520.381878] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection to sfa18k03-MDT0003 (at 172.25.80.53@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [366520.393781] LustreError: Skipped 6 previous similar messages
      [366520.412207] LustreError: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: This client was evicted by sfa18k03-MDT0003; in progress operations using this service will fail.
      [366520.432818] LustreError: 246109:0:(vvp_io.c:1921:vvp_io_init()) sfa18k03: refresh file layout [0x2c003deab:0x986f:0x0] error -108.
      [366520.432966] LustreError: 244654:0:(llite_lib.c:2018:ll_md_setattr()) md_setattr fails: rc = -108
      [366520.434530] LustreError: 246176:0:(file.c:262:ll_close_inode_openhandle()) sfa18k03-clilmv-ffff8c70b6dd8800: inode [0x2c003deab:0x9703:0x0] mdc close failed: rc = -108
      [366520.436661] LustreError: 246176:0:(ldlm_resource.c:1149:ldlm_resource_complain()) sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: namespace resource [0x2c003deab:0x9735:0x0].0x0 (0000000076151be1) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [366520.436668] LustreError: 246176:0:(ldlm_resource.c:1149:ldlm_resource_complain()) Skipped 1 previous similar message
      [366520.444818] LustreError: 246109:0:(vvp_io.c:1921:vvp_io_init()) Skipped 49 previous similar messages
      [366520.445680] Lustre: sfa18k03-MDT0003-mdc-ffff8c70b6dd8800: Connection restored to 172.25.80.53@tcp (at 172.25.80.53@tcp)
      [366520.453850] LustreError: 244654:0:(llite_lib.c:2018:ll_md_setattr()) Skipped 1 previous similar message
      [366522.165525] LustreError: 24960:0:(osc_cache.c:1098:osc_extent_make_ready()) ASSERTION( last_oap_count > 0 ) failed: last_oap_count -61
      [366522.177881] LustreError: 24960:0:(osc_cache.c:1098:osc_extent_make_ready()) LBUG
      [366522.185536] CPU: 11 PID: 24960 Comm: ptlrpcd_05_00 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-513.24.1.el8_9.x86_64 #1
      [366522.198214] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023
      [366522.205928] Call Trace:
      [366522.208576]  dump_stack+0x41/0x60
      [366522.212091]  lbug_with_loc.cold.8+0x5/0x58 [libcfs]
      [366522.217189]  osc_extent_make_ready+0xb8c/0xd00 [osc]
      [366522.222383]  osc_check_rpcs+0x1487/0x1980 [osc]
      [366522.227139]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [366522.232128]  ? set_next_entity+0xb5/0x1e0
      [366522.236342]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [366522.241331]  ? __switch_to+0x10c/0x430
      [366522.245277]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [366522.250266]  ? __schedule+0x2d9/0x870
      [366522.254125]  osc_io_unplug0+0xc0/0x110 [osc]
      [366522.258614]  brw_queue_work+0x2e/0xd0 [osc]
      [366522.263015]  work_interpreter+0x32/0x110 [ptlrpc]
      [366522.267981]  ptlrpc_check_set+0x494/0x1ed0 [ptlrpc]
      [366522.273126]  ptlrpcd+0x85b/0xa70 [ptlrpc]
      [366522.277405]  ? do_wait_intr_irq+0xa0/0xa0
      [366522.281618]  ? ptlrpcd_add_req+0x2f0/0x2f0 [ptlrpc]
      [366522.286763]  kthread+0x134/0x150
      [366522.290184]  ? set_kthread_struct+0x50/0x50
      [366522.294571]  ret_from_fork+0x35/0x40
      [366528.300246] Kernel panic - not syncing: LBUG
      [366528.304722] CPU: 42 PID: 24960 Comm: ptlrpcd_05_00 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-513.24.1.el8_9.x86_64 #1
      [366528.326846] Hardware name: Bull SAS H252-Z10-00/MZ12-HD0-00, BIOS M14a 03/10/2023
      [366528.334559] Call Trace:
      [366528.337201]  dump_stack+0x41/0x60
      [366528.340724]  panic+0xe7/0x2ac
      [366528.343890]  lbug_with_loc.cold.8+0x2f/0x58 [libcfs]
      [366528.349069]  osc_extent_make_ready+0xb8c/0xd00 [osc]
      [366528.354266]  osc_check_rpcs+0x1487/0x1980 [osc]
      [366528.359020]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [366528.364012]  ? set_next_entity+0xb5/0x1e0
      [366528.368224]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [366528.373214]  ? __switch_to+0x10c/0x430
      [366528.377167]  ? srso_alias_return_thunk+0x5/0xfcdfd
      [366528.382158]  ? __schedule+0x2d9/0x870
      [366528.386016]  osc_io_unplug0+0xc0/0x110 [osc]
      [366528.390506]  brw_queue_work+0x2e/0xd0 [osc]
      [366528.394908]  work_interpreter+0x32/0x110 [ptlrpc]
      [366528.399873]  ptlrpc_check_set+0x494/0x1ed0 [ptlrpc]
      [366528.405017]  ptlrpcd+0x85b/0xa70 [ptlrpc]
      [366528.409294]  ? do_wait_intr_irq+0xa0/0xa0
      [366528.413508]  ? ptlrpcd_add_req+0x2f0/0x2f0 [ptlrpc]
      [366528.418646]  kthread+0x134/0x150
      [366528.422078]  ? set_kthread_struct+0x50/0x50
      [366528.426461]  ret_from_fork+0x35/0x40
      

      server and client version: 2.16.0-RC4
      coredump can be found on

      [root@wr-es-33 127.0.0.1-2024-10-28-01:37:41]# pwd
      /var/crash/127.0.0.1-2024-10-28-01:37:41
      

      Attachments

        Activity

          People

            wc-triage WC Triage
            sarah Sarah Liu
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: