Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12769

replay-dual test 0b hangs in client mount

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.13.0, Lustre 2.12.4
    • Lustre 2.13.0, Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      replay-dual test_0b hangs in when the client tries to mount the file system. Looking at the logs at https://testing.whamcloud.com/test_sets/2165cd40-d70d-11e9-9fc9-52540065bddc, the last lines seen in the suite_log are:

      Started lustre-MDT0000
      Starting client: trevis-16vm11.trevis.whamcloud.com:  -o user_xattr,flock trevis-20vm9@tcp:/lustre /mnt/lustre
      CMD: trevis-16vm11.trevis.whamcloud.com mkdir -p /mnt/lustre
      CMD: trevis-16vm11.trevis.whamcloud.com mount -t lustre -o user_xattr,flock trevis-20vm9@tcp:/lustre /mnt/lustre
      

      On the MDS (vm11), we see the following in the console log

      [ 1298.607167] Lustre: DEBUG MARKER: e2label /dev/mapper/mds1_flakey 2>/dev/null
      [ 1309.346832] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:43
      [ 1314.381684] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:38
      [ 1314.383960] Lustre: Skipped 1 previous similar message
      [ 1319.501601] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:33
      [ 1324.621716] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:28
      [ 1329.741784] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:23
      [ 1339.982308] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 0 evicted) to recover in 0:12
      [ 1339.984388] Lustre: Skipped 1 previous similar message
      [ 1352.877227] Lustre: lustre-MDT0000: recovery is timed out, evict stale exports
      [ 1352.878085] Lustre: lustre-MDT0000: disconnecting 1 stale clients
      [ 1360.462321] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) to recover in 0:27
      [ 1360.464241] Lustre: Skipped 3 previous similar messages
      [ 1396.303505] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 0:07
      [ 1396.305383] Lustre: Skipped 6 previous similar messages
      [ 1462.863612] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 1:14
      [ 1462.865659] Lustre: Skipped 12 previous similar messages
      [ 1590.865592] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 3:22
      [ 1590.867668] Lustre: Skipped 24 previous similar messages
      [ 1846.869597] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 7:38
      [ 1846.871619] Lustre: Skipped 49 previous similar messages
      [ 2358.877590] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 16:10
      [ 2358.879608] Lustre: Skipped 99 previous similar messages
      [ 2963.046705] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 26:14
      [ 2963.048899] Lustre: Skipped 117 previous similar messages
      [ 3567.216150] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 36:18
      [ 3567.218114] Lustre: Skipped 117 previous similar messages
      [ 4171.385464] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 46:23
      [ 4171.387400] Lustre: Skipped 117 previous similar messages
      [ 4775.554195] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 56:27
      [ 4775.556078] Lustre: Skipped 117 previous similar messages
      [ 5379.723493] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 66:31
      [ 5379.725448] Lustre: Skipped 117 previous similar messages
      [ 5983.892872] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 76:35
      [ 5983.894832] Lustre: Skipped 117 previous similar messages
      [ 6588.062279] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 86:39
      [ 6588.064310] Lustre: Skipped 117 previous similar messages
      [ 7192.231245] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 96:43
      [ 7192.233232] Lustre: Skipped 117 previous similar messages
      [ 7796.400758] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 106:48
      [ 7796.403030] Lustre: Skipped 117 previous similar messages
      [ 8400.570060] Lustre: lustre-MDT0000: Denying connection for new client 26b5e8fc-98f4-4 (at 10.9.4.195@tcp), waiting for 4 known clients (0 recovered, 3 in progress, and 1 evicted) already passed deadline 116:52
      [ 8400.572101] Lustre: Skipped 117 previous similar messages
      

      On the OSS (vm4) console log, we see

      [16003.344264] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-20vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [16003.724818] Lustre: DEBUG MARKER: trevis-20vm9.trevis.whamcloud.com: executing set_default_debug -1 all 4
      [16009.720340] Lustre: Evicted from MGS (at 10.9.4.245@tcp) after server handle changed from 0xb142ad53c4c0bc64 to 0xb142ad53c4c0d363
      [16014.732310] LustreError: 10096:0:(ldlm_resource.c:1147:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0001: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff8f95b862c0c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
      [16014.735727] LustreError: 10096:0:(ldlm_resource.c:1147:ldlm_resource_complain()) Skipped 1 previous similar message
      

      This may be related to LU-11038 and/or LU-12340.

      Attachments

        Issue Links

          Activity

            [LU-12769] replay-dual test 0b hangs in client mount

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36937/
            Subject: LU-12769 recovery: use monotonic timer
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: cf4cdefec8145b2b7a2eb4c2de4e1e08ebc862b3

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36937/ Subject: LU-12769 recovery: use monotonic timer Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: cf4cdefec8145b2b7a2eb4c2de4e1e08ebc862b3

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36937
            Subject: LU-12769 recovery: use monotonic timer
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: e46386159b7afe349bd90e845d69adafb1ab62fc

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36937 Subject: LU-12769 recovery: use monotonic timer Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: e46386159b7afe349bd90e845d69adafb1ab62fc
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36274/
            Subject: LU-12769 recovery: use monotonic timer
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 06408a4ef381121fa58783026a0cf0a6b0fa479c

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36274/ Subject: LU-12769 recovery: use monotonic timer Project: fs/lustre-release Branch: master Current Patch Set: Commit: 06408a4ef381121fa58783026a0cf0a6b0fa479c

            Talking with Alex this appears to a problem of clock drift which would explain why older tickets similar to this exist. The reason being that jiffies also experiences clock drift as well. In the case of the hrtimer the clock drift appears to only happen when using the high resolution wall clock on VMs. It appears using the high resolution monotonic clock avoids this problem.

            simmonsja James A Simmons added a comment - Talking with Alex this appears to a problem of clock drift which would explain why older tickets similar to this exist. The reason being that jiffies also experiences clock drift as well. In the case of the hrtimer the clock drift appears to only happen when using the high resolution wall clock on VMs. It appears using the high resolution monotonic clock avoids this problem.
            simmonsja James A Simmons added a comment - - edited

            Actually this was broken before LU-11771. This is a duplicate of https://jira.whamcloud.com/browse/LU-11762. The landing of LU-11771 reduced how often this bug showed up to the point I couldn't reproduce it to fix it. It might be good to land https://review.whamcloud.com/#/c/35627 as well. We will need this for 2.12 as well.

            simmonsja James A Simmons added a comment - - edited Actually this was broken before LU-11771 . This is a duplicate of  https://jira.whamcloud.com/browse/LU-11762 . The landing of LU-11771 reduced how often this bug showed up to the point I couldn't reproduce it to fix it. It might be good to land  https://review.whamcloud.com/#/c/35627  as well. We will need this for 2.12 as well.

            The broken code fixed by Alex's patch was introduced by commit v2_12_53-53-g9334f1d512 (patch https://review.whamcloud.com/34710 "LU-11771 ldlm: use hrtimer for recovery to fix timeout messages" so needs to be fixed for 2.13.

            adilger Andreas Dilger added a comment - The broken code fixed by Alex's patch was introduced by commit v2_12_53-53-g9334f1d512 (patch https://review.whamcloud.com/34710 " LU-11771 ldlm: use hrtimer for recovery to fix timeout messages " so needs to be fixed for 2.13.

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36274
            Subject: LU-12769 recovery: restart recovery timer
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: dd856535f4b98f4edee42a504c62a2766c1557c6

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36274 Subject: LU-12769 recovery: restart recovery timer Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: dd856535f4b98f4edee42a504c62a2766c1557c6

            in my local testing (single VM) replay-dual 0a timeouts very frequently

            bzzz Alex Zhuravlev added a comment - in my local testing (single VM) replay-dual 0a timeouts very frequently

            People

              bzzz Alex Zhuravlev
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: