Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15339

sanity test 208 fails with 'lease not broken over recovery'

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None
    • 9223372036854775807

    Description

      During testing on some internal branches, we've observed sanity test 208 failing.  It turns out this is a timing issue.

      Looking at the test, we see:

      $MULTIOP $DIR/$tfile oO_RDWR:eR_E-eUc &
      PID=$!
      sleep 1
      
      # open file to break lease and then recovery
      $MULTIOP $DIR/$tfile oO_RDWR:c || error "open file error"
      fail mds1

      This is one multiop asking for the lease, then a 1 second wait, then the other trying an open which will break the lease.

      In the test logs, we see:

      ==== test 5: lease broken can't be regained by replay
      multiop: cannot get READ lease, ext 0: Device or resource busy (16)
      multiop: apply/unlock lease error: Device or resource busy
      Failing mds1 on onyx-50vm2 

      So this is the first multiop getting EBUSY, which it would get if the file were already open.

      Looking at the debug logs, we see two processes working on the file, 3066 and 3068.  Obviously 3066 was started earlier.
      We see 3066 opening the file, getting an intent lock, etc, then requesting the lease.  Somewhere during the earlier intent lock request, we start seeing messages from 3068, meaning it has started.  This ioctl is the lease request:

      00000080:00200000:0.0:1638330788.087774:0:3066:0:(file.c:3518:ll_file_ioctl()) VFS Op:inode=[0x200005222:0x27a6:0x0](ffff9fddbb55f098), cmd=c00c66f3
      00000002:00010000:0.0:1638330788.087789:0:3066:0:(mdc_locks.c:1228:mdc_intent_lock()) (name: ,[0x200005222:0x27a6:0x0]) in obj [0x200005222:0x27a6:0x0], intent: open flags 01004040000001
      [...]
      00000100:00100000:0.0:1638330788.087812:0:3066:0:(client.c:1685:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc multiop:e3dc4b61-17a3-a1d0-026f-41b421fc661f:3066:1717912348006336:10.240.24.89@tcp:101
      00000100:00100000:0.0:1638330788.087830:0:3066:0:(client.c:2409:ptlrpc_set_wait()) set ffff9fdd603fc880 going to sleep for 13 seconds

      But before that can complete, we see the open from 3068 complete:

      00000100:00100000:0.0:1638330788.087874:0:3068:0:(client.c:2115:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc multiop:e3dc4b61-17a3-a1d0-026f-41b421fc661f:3068:1717912348006272:10.240.24.89@tcp:101
      [...]
      00000002:00100000:0.0:1638330788.087906:0:3068:0:(mdc_request.c:801:mdc_set_open_replay_data()) @@@ Set up open replay data  req@ffff9fdd55dc0000 x1717912348006272/t51539607555(51539607555) o101->lustre-MDT0000-mdc-ffff9fdd58dea800@10.240.24.89@tcp:12/10 lens 752/616 e 0 to 0 dl 1638330803 ref 2 fl Complete:RP/4/0 rc 0/0  

      The first message logged by 3066 is at 1638330787.082310, and we can see above, the intent lock request is starting at 1638330788.087774 - over a second later.

      So this is just a pure timing issue.  The solution should be to increase the timeouts in this test to 2 seconds.

      Attachments

        Activity

          People

            paf0186 Patrick Farrell (Inactive)
            paf0186 Patrick Farrell (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: