Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-18385

Multiple client nodes evicted during 48 hours automated Fail Over/Fail Back operations on lustre nodes

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      Eight clients evicted during 48 hours lustre FOFB on kjcf05/jupiter p1(2.16.0_RC2_3_gc9482c7). I see slow reply messages on kern log.

      Oct 16 13:05:03 kjcf05n03 kernel: Lustre: 55191:0:(client.c:2363:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1729101817/real 1729101817]  req@00000
      000a759aadd x1813091528783872/t0(0) o104->kjcf05-MDT0001@47@gni:15/16 lens 328/224 e 0 to 1 dl 1729101903 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295
      Oct 16 13:05:24 kjcf05n03 kernel: Lustre: 123952:0:(mdt_recovery.c:148:mdt_req_from_lrd()) @@@ restoring transno  req@00000000f6d443e1 x1813009165892992/t6425513152718(0) o101->562625ed
      -9419-4878-8fa6-4aa09f12b386@172@gni:154/0 lens 376/48792 e 0 to 0 dl 1729101974 ref 1 fl Interpret:H/202/0 rc 0/0 job:'' uid:1356 gid:11121
      Oct 16 13:05:24 kjcf05n03 kernel: Lustre: 123952:0:(mdt_recovery.c:148:mdt_req_from_lrd()) Skipped 5 previous similar messages 

      from kern log :

      Oct 16 13:05:45 kjcf05n02 kernel: Lustre: MGS: haven't heard from client 4f4dfa42-4fac-4371-8224-b24ea4f77a5b (at 47@gni) in 151 seconds. I think it's dead, and I am evicting it. exp 00000000ae73a149, cur 1729101945 expire 1729101795 last 1729101794
      Oct 16 13:05:45 kjcf05n02 kernel: Lustre: Skipped 3 previous similar messages
      Oct 16 13:05:47 kjcf05n07 kernel: Lustre: kjcf05-OST0003: haven't heard from client 3167877f-3288-467f-b56e-834cf9ac3b3b (at 47@gni) in 151 seconds. I think it's dead, and I am evicting it. exp 0000000056be5bda, cur 1729101947 expire 1729101797 last 1729101796 

      From client console log :

      2024-10-16T13:03:32.276608-05:00 c0-0c0s11n3 LustreError: kjcf05-OST0002-osc-ffff888d62c6b800: This client was evicted by kjcf05-OST0002; in progress operations using this service will fail.
      2024-10-16T13:03:32.276661-05:00 c0-0c0s11n3 LustreError: 12805:0:(import.c:1633:ptlrpc_import_recovery_state_machine()) ASSERTION( !obd_lbug_on_eviction ) failed: LBUG upon eviction
      2024-10-16T13:03:32.276699-05:00 c0-0c0s11n3 LustreError: 12805:0:(import.c:1633:ptlrpc_import_recovery_state_machine()) LBUG
      2024-10-16T13:03:32.276719-05:00 c0-0c0s11n3 CPU: 31 PID: 12805 Comm: ptlrpcd_rcv Tainted: P           O       5.3.18-24.46_6.0.24-cray_ari_c #1 SLE15-SP2 (unreleased)
      2024-10-16T13:03:32.276734-05:00 c0-0c0s11n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 4.6.5 09/05/2019
      2024-10-16T13:03:32.276753-05:00 c0-0c0s11n3 Call Trace:
      2024-10-16T13:03:32.276798-05:00 c0-0c0s11n3 dump_stack+0x7a/0xa5
      2024-10-16T13:03:32.276817-05:00 c0-0c0s11n3 lbug_with_loc+0x42/0xa0 [libcfs]
      2024-10-16T13:03:32.276831-05:00 c0-0c0s11n3 ptlrpc_import_recovery_state_machine+0x53e/0xa20 [ptlrpc]
      2024-10-16T13:03:32.276849-05:00 c0-0c0s11n3 ? import_set_state_nolock+0x13c/0x180 [ptlrpc]
      2024-10-16T13:03:32.276866-05:00 c0-0c0s11n3 ptlrpc_connect_interpret+0x1053/0x2810 [ptlrpc]
      2024-10-16T13:03:32.276913-05:00 c0-0c0s11n3 ptlrpc_check_set+0x22c/0x2120 [ptlrpc]
      2024-10-16T13:03:32.276929-05:00 c0-0c0s11n3 ? __next_timer_interrupt+0xe0/0xe0
      2024-10-16T13:03:32.276941-05:00 c0-0c0s11n3 ptlrpcd+0x94f/0xa20 [ptlrpc]
      2024-10-16T13:03:32.276955-05:00 c0-0c0s11n3 ? trace_hardirqs_on+0x38/0xe0
      2024-10-16T13:03:32.276968-05:00 c0-0c0s11n3 ? do_wait_intr_irq+0x90/0x90
      2024-10-16T13:03:32.276982-05:00 c0-0c0s11n3 kthread+0x120/0x140
      2024-10-16T13:03:32.277034-05:00 c0-0c0s11n3 ? ptlrpcd_ctl_init+0x180/0x180 [ptlrpc]
      2024-10-16T13:03:32.277051-05:00 c0-0c0s11n3 ? kthread_create_worker_on_cpu+0x70/0x70
      2024-10-16T13:03:32.277067-05:00 c0-0c0s11n3 ret_from_fork+0x3a/0x50
      2024-10-16T13:03:32.277080-05:00 c0-0c0s11n3 Kernel panic - not syncing: LBUG
      2024-10-16T13:03:32.277101-05:00 c0-0c0s11n3 CPU: 31 PID: 12805 Comm: ptlrpcd_rcv Tainted: P           O       5.3.18-24.46_6.0.24-cray_ari_c #1 SLE15-SP2 (unreleased)
      2024-10-16T13:03:32.277115-05:00 c0-0c0s11n3 Hardware name: Cray Inc. Cascade/Cascade, BIOS 4.6.5 09/05/2019
      2024-10-16T13:03:32.277131-05:00 c0-0c0s11n3 Call Trace:
      2024-10-16T13:03:32.277145-05:00 c0-0c0s11n3 dump_stack+0x7a/0xa5
      2024-10-16T13:03:32.277158-05:00 c0-0c0s11n3 panic+0xfd/0x2c9
      2024-10-16T13:03:32.277172-05:00 c0-0c0s11n3 ? __next_timer_interrupt+0xe0/0xe0
      2024-10-16T13:03:32.277190-05:00 c0-0c0s11n3 ? try_to_del_timer_sync+0x53/0x80
      2024-10-16T13:03:32.277206-05:00 c0-0c0s11n3 lbug_with_loc+0x9b/0xa0 [libcfs]
      2024-10-16T13:03:32.277224-05:00 c0-0c0s11n3 ptlrpc_import_recovery_state_machine+0x53e/0xa20 [ptlrpc]
      2024-10-16T13:03:32.277240-05:00 c0-0c0s11n3 ? import_set_state_nolock+0x13c/0x180 [ptlrpc]
      2024-10-16T13:03:32.277256-05:00 c0-0c0s11n3 ptlrpc_connect_interpret+0x1053/0x2810 [ptlrpc]
      2024-10-16T13:03:32.277274-05:00 c0-0c0s11n3 ptlrpc_check_set+0x22c/0x2120 [ptlrpc]
      2024-10-16T13:03:32.277290-05:00 c0-0c0s11n3 ? __next_timer_interrupt+0xe0/0xe0
      2024-10-16T13:03:32.277309-05:00 c0-0c0s11n3 ptlrpcd+0x94f/0xa20 [ptlrpc]
      2024-10-16T13:03:32.277325-05:00 c0-0c0s11n3 ? trace_hardirqs_on+0x38/0xe0
      2024-10-16T13:03:32.277341-05:00 c0-0c0s11n3 ? do_wait_intr_irq+0x90/0x90
      2024-10-16T13:03:32.277356-05:00 c0-0c0s11n3 kthread+0x120/0x140
      2024-10-16T13:03:32.277372-05:00 c0-0c0s11n3 ? ptlrpcd_ctl_init+0x180/0x180 [ptlrpc]
      2024-10-16T13:03:32.277389-05:00 c0-0c0s11n3 ? kthread_create_worker_on_cpu+0x70/0x70
      2024-10-16T13:03:32.277406-05:00 c0-0c0s11n3 ret_from_fork+0x3a/0x50
      2024-10-16T13:03:32.277422-05:00 c0-0c0s11n3 Shutting down cpus with NMI
      2024-10-16T13:03:32.277439-05:00 c0-0c0s11n3 Kernel Offset: disabled
      2024-10-16T13:03:32.277456-05:00 c0-0c0s11n3 ---[ end Kernel panic - not syncing: LBUG ]---

      FOFB operation during the first eviction:

      2024-10-16t12:47:02 Test 6  -- operation panic failover of kjcf05n06
      2024-10-16t13:13:19 Test 6  -- operation panic failback of kjcf05n06
      2024-10-16t13:29:45 Test 7  -- operation cscli failover of kjcf05n05
      2024-10-16t13:48:23 Test 7  -- operation cscli failback of kjcf05n05 

      Attachments

        Issue Links

          Activity

            [LU-18385] Multiple client nodes evicted during 48 hours automated Fail Over/Fail Back operations on lustre nodes
            pjones Peter Jones added a comment -

            Note that RC4 now exists with the suspected patch reverted

            pjones Peter Jones added a comment - Note that RC4 now exists with the suspected patch reverted
            pjones Peter Jones added a comment -

            Hi there

            We believe that this is a regression introduced by LU-17906 patch merged between RC1 and RC2. We are experimenting with options to either fix in place or revert. You could verify this once RC4 is tagged or else in the meantime test with RC1 which does not contain that fix.

            Peter

            pjones Peter Jones added a comment - Hi there We believe that this is a regression introduced by LU-17906 patch merged between RC1 and RC2. We are experimenting with options to either fix in place or revert. You could verify this once RC4 is tagged or else in the meantime test with RC1 which does not contain that fix. Peter

            People

              wc-triage WC Triage
              prasannakumar Prasannakumar Nagasubramani
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: