Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16511

recovery-mds-scale test_failover_ost: client was evicted by lustre-MDT0000: lock callback timer expired

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Elena <elena.gryaznova@hpe.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/fb180efc-20f3-44e8-9ce3-5eaae34faed0

      test_failover_ost failed with the following error:

      test_failover_ost returned 1
      

      Test session details:
      clients: https://build.whamcloud.com/job/lustre-reviews/91846 - 4.18.0-348.7.1.el8_5.x86_64
      servers: https://build.whamcloud.com/job/lustre-reviews/91846 - 4.18.0-348.23.1.el8_lustre.x86_64

      recovery-mds-scale.test_failover_ost.dmesg.trevis-101vm5.1674753520.log:

      [  573.886865] LNet: Added LNI 10.240.44.244@tcp [8/256/0/180]
      [  573.888136] LNet: Accept all, port 7988
      [  574.964677] Lustre: Mounted lustre-client
      

      recovery-mds-scale.test_failover_ost.dmesg.trevis-101vm5.1674753520.log:

      [58558.816712] Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
      [58562.233744] Lustre: DEBUG MARKER: ps auxwww | grep -v grep | grep -q run_IOR.sh
      [58563.354293] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost7 failed over 8 times, and counting...
      [58563.979740] Lustre: DEBUG MARKER: ost7 failed over 8 times, and counting...
      [59372.896018] LustreError: 11-0: lustre-MDT0000-mdc-ffff9b0de7ee6800: operation mds_close to node 10.240.44.248@tcp failed: rc = -107
      [59372.897999] Lustre: lustre-MDT0000-mdc-ffff9b0de7ee6800: Connection to lustre-MDT0000 (at 10.240.44.248@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [59372.900740] Lustre: Skipped 2 previous similar messages
      [59372.901933] LustreError: 167-0: lustre-MDT0000-mdc-ffff9b0de7ee6800: This client was evicted by lustre-MDT0000; in progress operations using this service will fail.
      [59372.904876] LustreError: 17673:0:(file.c:242:ll_close_inode_openhandle()) lustre-clilmv-ffff9b0de7ee6800: inode [0x200000bd3:0x4d8d:0x0] mdc close failed: rc = -5
      [59372.910935] LustreError: 2562353:0:(file.c:5188:ll_inode_revalidate_fini()) lustre: revalidate FID [0x200000007:0x1:0x0] error: rc = -108
      [59372.911729] Lustre: lustre-MDT0000-mdc-ffff9b0de7ee6800: Connection restored to 10.240.44.248@tcp (at 10.240.44.248@tcp)
      [59372.914684] Lustre: Skipped 2 previous similar messages
      

      console.trevis-101vm9.log:

      [58245.625662] Lustre: DEBUG MARKER: ==== Check clients loads AFTER failover -- failure NOT OK
      [58250.145970] Lustre: DEBUG MARKER: /usr/sbin/lctl mark ost7 failed over 8 times, and counting...
      [58250.950319] Lustre: DEBUG MARKER: ost7 failed over 8 times, and counting...
      [59018.942424] Lustre: mdt00_008: service thread pid 9989 was inactive for 62.213 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [59018.945437] Pid: 9989, comm: mdt00_008 4.18.0-348.23.1.el8_lustre.x86_64 #1 SMP Wed Jan 4 16:53:58 UTC 2023
      [59018.947001] Call Trace TBD:
      [59018.947614] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
      [59018.948578] [<0>] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
      [59018.949573] [<0>] mdt_object_local_lock+0x506/0xb30 [mdt]
      [59018.950486] [<0>] mdt_object_lock_internal+0x18d/0x4a0 [mdt]
      [59018.951439] [<0>] mdt_reint_object_lock+0x27/0x60 [mdt]
      [59018.952329] [<0>] mdt_reint_striped_lock+0x67/0x490 [mdt]
      [59018.953237] [<0>] mdt_reint_unlink+0xac0/0x1580 [mdt]
      [59018.954097] [<0>] mdt_reint_rec+0x117/0x270 [mdt]
      [59018.954911] [<0>] mdt_reint_internal+0x4bc/0x7d0 [mdt]
      [59018.955784] [<0>] mdt_reint+0x5d/0x110 [mdt]
      [59018.956553] [<0>] tgt_request_handle+0xc8c/0x19c0 [ptlrpc]
      [59018.957506] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
      [59018.958575] [<0>] ptlrpc_main+0xc48/0x1540 [ptlrpc]
      [59018.959401] [<0>] kthread+0x116/0x130
      [59018.960033] [<0>] ret_from_fork+0x35/0x40
      [59059.902193] LustreError: 8396:0:(ldlm_lockd.c:261:expired_lock_main()) ### lock callback timer expired after 103s: evicting client at 10.240.44.244@tcp  ns: mdt-lustre-MDT0000_UUID lock: 000000003bcad40d/0x760476a02b02eeb5 lrc: 3
      /0,0 mode: PR/PR res: [0x200000bd3:0x4d8d:0x0].0x0 bits 0x12/0x0 rrc: 4 type: IBT gid 0 flags: 0x60200400000020 nid: 10.240.44.244@tcp remote: 0x56d2fb307219d267 expref: 12 pid: 10042 timeout: 59058 lvb_type: 0
      [59311.538342] Lustre: DEBUG MARKER: /usr/sbin/lctl mark Duration:               82800
      
      

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-mds-scale test_failover_ost - test_failover_ost returned 1

      Attachments

        Activity

          People

            wc-triage WC Triage
            maloo Maloo
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: