Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12175

sanity test 208 fails with 'lease broken over recovery'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0, Lustre 2.14.0, Lustre 2.12.4, Lustre 2.12.5, Lustre 2.12.6
    • None
    • DNE
    • 3
    • 9223372036854775807

    Description

      sanity test_208 fails in test 4: lease can sustain over recovery with 'lease broken over recovery'. These failures started on 8 April 2019. So far, this test is only failing for DNE testing.

      Looking at the test_suite log for a recent failure, https://testing.whamcloud.com/test_sets/03bcb85a-5ace-11e9-9720-52540065bddc , we see

      == sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264)
      ==== test 1: verify get lease work
      read lease(1) has applied.
      ==== test 2: verify lease can be broken by upcoming open
      no lease applied.
      ==== test 3: verify lease can't be granted if an open already exists
      multiop: cannot get READ lease, ext 0: Device or resource busy (16)
      multiop: apply/unlock lease error: Device or resource busy
      ==== test 4: lease can sustain over recovery
      Failing mds1 on trevis-40vm9
      …
      trevis-40vm6: trevis-40vm6.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      trevis-40vm7: trevis-40vm7.trevis.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      trevis-40vm7: CMD: trevis-40vm7.trevis.whamcloud.com lctl get_param -n at_max
      trevis-40vm7: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      trevis-40vm6: CMD: trevis-40vm6.trevis.whamcloud.com lctl get_param -n at_max
      trevis-40vm6: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      multiop: expect lease exists
      no lease applied.
       sanity test_208: @@@@@@ FAIL: lease broken over recovery 
      

      Looking at the MDS1, 3 (vm9) dmesg log, we see some errors,

      [ 7777.539199] Lustre: DEBUG MARKER: == sanity test 208: Exclusive open =================================================================== 12:01:04 (1554811264)
      [ 7780.799167] Lustre: DEBUG MARKER: grep -c /mnt/lustre-mds1' ' /proc/mounts || true
      [ 7781.106991] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds1
      [ 7781.266880] Lustre: Failing over lustre-MDT0000
      [ 7781.267747] Lustre: Skipped 1 previous similar message
      [ 7781.457907] Lustre: lustre-MDT0000: Not available for connect from 10.9.3.150@tcp (stopping)
      [ 7781.459369] Lustre: Skipped 19 previous similar messages
      [ 7781.460162] LustreError: 11-0: lustre-MDT0000-osp-MDT0002: operation mds_statfs to node 0@lo failed: rc = -107
      [ 7781.460167] Lustre: lustre-MDT0000-osp-MDT0002: Connection to lustre-MDT0000 (at 0@lo) was lost; in progress operations using this service will wait for recovery to complete
      [ 7781.460168] Lustre: Skipped 1 previous similar message
      [ 7781.612729] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.9.5.234@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
      [ 7781.615597] LustreError: Skipped 5 previous similar messages
      [ 7781.631586] Lustre: server umount lustre-MDT0000 complete
      [ 7782.702609] Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null &&
      lctl dl | grep ' ST ' || true
      [ 7783.012452] Lustre: DEBUG MARKER: modprobe dm-flakey;
      			 dmsetup targets | grep -q flakey
      [ 7793.042085] Lustre: 17623:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1554811273/real 1554811273]  req@ffff8bf39b3e0900 x1630329757439344/t0(0) o400->MGC10.9.3.149@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1554811280 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      [ 7793.046530] LustreError: 166-1: MGC10.9.3.149@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail
      [ 7793.384485] Lustre: DEBUG MARKER: hostname
      [ 7793.752701] Lustre: DEBUG MARKER: modprobe dm-flakey;
      			 dmsetup targets | grep -q flakey
      [ 7794.077024] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey >/dev/null 2>&1
      [ 7794.380875] Lustre: DEBUG MARKER: dmsetup status /dev/mapper/mds1_flakey 2>&1
      [ 7794.685614] Lustre: DEBUG MARKER: test -b /dev/mapper/mds1_flakey
      [ 7794.986331] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey
      [ 7795.394940] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds1; mount -t lustre   /dev/mapper/mds1_flakey /mnt/lustre-mds1
      [ 7795.730148] LDISKFS-fs (dm-6): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
      [ 7795.756692] Lustre: osd-ldiskfs create tunables for lustre-MDT0000
      [ 7798.648231] Lustre: MGS: Connection restored to bfee9086-166d-21e0-ae35-a479c8b23d83 (at 10.9.5.234@tcp)
      [ 7798.650043] Lustre: Skipped 66 previous similar messages
      [ 7799.066744] Lustre: Evicted from MGS (at 10.9.3.149@tcp) after server handle changed from 0x8eaf7145dd9f3b79 to 0x8eaf7145ddb23d36
      [ 7799.411459] Lustre: lustre-MDT0000: Imperative Recovery not enabled, recovery window 60-180
      [ 7799.801070] Lustre: 25291:0:(llog.c:606:llog_process_thread()) lustre-MDT0002-osp-MDT0000: invalid length 0 in llog [0x1:0x80000402:0x2]record for index 0/6
      [ 7799.804286] LustreError: 25291:0:(lod_dev.c:434:lod_sub_recovery_thread()) lustre-MDT0002-osp-MDT0000 get update log failed: rc = -22
      [ 7799.808197] Lustre: lustre-MDT0000: in recovery but waiting for the first client to connect
      [ 7799.810304] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [ 7799.812067] Lustre: 25293:0:(ldlm_lib.c:2068:target_recovery_overseer()) Skipped 1 previous similar message
      [ 7800.264478] Lustre: lustre-MDT0000: disconnecting 5 stale clients
      [ 7800.289476] Lustre: lustre-MDT0000: Denying connection for new client lustre-MDT0001-mdtlov_UUID (at 10.9.3.150@tcp), waiting for 5 known clients (0 recovered, 0 in progress, and 5 evicted) already passed deadline 130:00
      [ 7800.596879] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
      

      Although LU-8466 is for the same test failure with the same error message, there was
      Here are links to logs for recent failures:
      https://testing.whamcloud.com/test_sets/19290212-5b04-11e9-8e92-52540065bddc
      https://testing.whamcloud.com/test_sets/d9ae92b0-5a76-11e9-8e92-52540065bddc
      https://testing.whamcloud.com/test_sets/d84fe1d8-5a3c-11e9-8e92-52540065bddc

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: