Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18385

Multiple client nodes evicted during 48 hours automated Fail Over/Fail Back operations on lustre nodes

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Eight clients evicted during 48 hours lustre FOFB on kjcf05/jupiter p1(2.16.0_RC2_3_gc9482c7). I see slow reply messages on kern log.

      Oct 16 13:05:03 kjcf05n03 kernel: Lustre: 55191:0:(client.c:2363:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1729101817/real 1729101817]  req@00000
      000a759aadd x1813091528783872/t0(0) o104->kjcf05-MDT0001@47@gni:15/16 lens 328/224 e 0 to 1 dl 1729101903 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
      Oct 16 13:05:24 kjcf05n03 kernel: Lustre: 123952:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno  req@00000000f6d443e1 x1813009165892992/t6425513152718(0) o101->562625ed
      -9419-4878-8fa6-4aa09f12b386@172@gni:154/0 lens 376/48792 e 0 to 0 dl 1729101974 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 16 13:05:24 kjcf05n03 kernel: Lustre: 123952:0:(mdt_recovery.c:148:mdt_req_from_lrd()) Skipped 5 previous similar messages 

      from kern log :

      Oct 16 13:05:45 kjcf05n02 kernel: Lustre: MGS: haven't heard from client 4f4dfa42-4fac-4371-8224-b24ea4f77a5b (at 47@gni) in 151 seconds. I think it's dead, and I am evicting it. exp 00000000ae73a149, cur 1729101945 expire 1729101795 last 1729101794
      Oct 16 13:05:45 kjcf05n02 kernel: Lustre: Skipped 3 previous similar messages
      Oct 16 13:05:47 kjcf05n07 kernel: Lustre: kjcf05-OST0003: haven't heard from client 3167877f-3288-467f-b56e-834cf9ac3b3b (at 47@gni) in 151 seconds. I think it's dead, and I am evicting it. exp 0000000056be5bda, cur 1729101947 expire 1729101797 last 1729101796 

      From client console log :

      2024-10-16T13:03:32.276608-05:00 c0-0c0s11n3 LustreError: kjcf05-OST0002-osc-ffff888d62c6b800: This client was evicted by kjcf05-OST0002; in progress operations using this service will fail.
      2024-10-16T13:03:32.276661-05:00 c0-0c0s11n3 LustreError: 12805:0:(import.c:1633:ptlrpc_import_recovery_state_machine()) ASSERTION( !obd_lbug_on_eviction ) failed: LBUG upon eviction
      2024-10-16T13:03:32.276699-05:00 c0-0c0s11n3 LustreError: 12805:0:(import.c:1633:ptlrpc_import_recovery_state_machine()) LBUG
      2024-10-16T13:03:32.276719-05:00 c0-0c0s11n3 CPU: 31 PID: 12805 Comm: ptlrpcd_rcv Tainted: P           O       5.3.18-24.46_6.0.24-cray_ari_c #1 SLE15-SP2 (unreleased)
      2024-10-16T13:03:32.276734-05:00 c0-0c0s11n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 4.6.5 09/05/2019
      2024-10-16T13:03:32.276753-05:00 c0-0c0s11n3 Call Trace:
      2024-10-16T13:03:32.276798-05:00 c0-0c0s11n3 dump_stack+0x7a/0xa5
      2024-10-16T13:03:32.276817-05:00 c0-0c0s11n3 lbug_with_loc+0x42/0xa0 [libcfs]
      2024-10-16T13:03:32.276831-05:00 c0-0c0s11n3 ptlrpc_import_recovery_state_machine+0x53e/0xa20 [ptlrpc]
      2024-10-16T13:03:32.276849-05:00 c0-0c0s11n3 ? import_set_state_nolock+0x13c/0x180 [ptlrpc]
      2024-10-16T13:03:32.276866-05:00 c0-0c0s11n3 ptlrpc_connect_interpret+0x1053/0x2810 [ptlrpc]
      2024-10-16T13:03:32.276913-05:00 c0-0c0s11n3 ptlrpc_check_set+0x22c/0x2120 [ptlrpc]
      2024-10-16T13:03:32.276929-05:00 c0-0c0s11n3 ? __next_timer_interrupt+0xe0/0xe0
      2024-10-16T13:03:32.276941-05:00 c0-0c0s11n3 ptlrpcd+0x94f/0xa20 [ptlrpc]
      2024-10-16T13:03:32.276955-05:00 c0-0c0s11n3 ? trace_hardirqs_on+0x38/0xe0
      2024-10-16T13:03:32.276968-05:00 c0-0c0s11n3 ? do_wait_intr_irq+0x90/0x90
      2024-10-16T13:03:32.276982-05:00 c0-0c0s11n3 kthread+0x120/0x140
      2024-10-16T13:03:32.277034-05:00 c0-0c0s11n3 ? ptlrpcd_ctl_init+0x180/0x180 [ptlrpc]
      2024-10-16T13:03:32.277051-05:00 c0-0c0s11n3 ? kthread_create_worker_on_cpu+0x70/0x70
      2024-10-16T13:03:32.277067-05:00 c0-0c0s11n3 ret_from_fork+0x3a/0x50
      2024-10-16T13:03:32.277080-05:00 c0-0c0s11n3 Kernel panic - not syncing: LBUG
      2024-10-16T13:03:32.277101-05:00 c0-0c0s11n3 CPU: 31 PID: 12805 Comm: ptlrpcd_rcv Tainted: P           O       5.3.18-24.46_6.0.24-cray_ari_c #1 SLE15-SP2 (unreleased)
      2024-10-16T13:03:32.277115-05:00 c0-0c0s11n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 4.6.5 09/05/2019
      2024-10-16T13:03:32.277131-05:00 c0-0c0s11n3 Call Trace:
      2024-10-16T13:03:32.277145-05:00 c0-0c0s11n3 dump_stack+0x7a/0xa5
      2024-10-16T13:03:32.277158-05:00 c0-0c0s11n3 panic+0xfd/0x2c9
      2024-10-16T13:03:32.277172-05:00 c0-0c0s11n3 ? __next_timer_interrupt+0xe0/0xe0
      2024-10-16T13:03:32.277190-05:00 c0-0c0s11n3 ? try_to_del_timer_sync+0x53/0x80
      2024-10-16T13:03:32.277206-05:00 c0-0c0s11n3 lbug_with_loc+0x9b/0xa0 [libcfs]
      2024-10-16T13:03:32.277224-05:00 c0-0c0s11n3 ptlrpc_import_recovery_state_machine+0x53e/0xa20 [ptlrpc]
      2024-10-16T13:03:32.277240-05:00 c0-0c0s11n3 ? import_set_state_nolock+0x13c/0x180 [ptlrpc]
      2024-10-16T13:03:32.277256-05:00 c0-0c0s11n3 ptlrpc_connect_interpret+0x1053/0x2810 [ptlrpc]
      2024-10-16T13:03:32.277274-05:00 c0-0c0s11n3 ptlrpc_check_set+0x22c/0x2120 [ptlrpc]
      2024-10-16T13:03:32.277290-05:00 c0-0c0s11n3 ? __next_timer_interrupt+0xe0/0xe0
      2024-10-16T13:03:32.277309-05:00 c0-0c0s11n3 ptlrpcd+0x94f/0xa20 [ptlrpc]
      2024-10-16T13:03:32.277325-05:00 c0-0c0s11n3 ? trace_hardirqs_on+0x38/0xe0
      2024-10-16T13:03:32.277341-05:00 c0-0c0s11n3 ? do_wait_intr_irq+0x90/0x90
      2024-10-16T13:03:32.277356-05:00 c0-0c0s11n3 kthread+0x120/0x140
      2024-10-16T13:03:32.277372-05:00 c0-0c0s11n3 ? ptlrpcd_ctl_init+0x180/0x180 [ptlrpc]
      2024-10-16T13:03:32.277389-05:00 c0-0c0s11n3 ? kthread_create_worker_on_cpu+0x70/0x70
      2024-10-16T13:03:32.277406-05:00 c0-0c0s11n3 ret_from_fork+0x3a/0x50
      2024-10-16T13:03:32.277422-05:00 c0-0c0s11n3 Shutting down cpus with NMI
      2024-10-16T13:03:32.277439-05:00 c0-0c0s11n3 Kernel Offset: disabled
      2024-10-16T13:03:32.277456-05:00 c0-0c0s11n3 ---[ end Kernel panic - not syncing: LBUG ]---

      FOFB operation during the first eviction:

      2024-10-16t12:47:02 Test 6  -- operation panic failover of kjcf05n06
      2024-10-16t13:13:19 Test 6  -- operation panic failback of kjcf05n06
      2024-10-16t13:29:45 Test 7  -- operation cscli failover of kjcf05n05
      2024-10-16t13:48:23 Test 7  -- operation cscli failback of kjcf05n05 

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: