Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.11.0, Lustre 2.10.4
    • Lustre 2.11.0
    • None
    • onyx, full
      servers: el7, zfs, branch master, v2.10.53.1, b3642
      clients: el7, branch master, v2.10.53.1, b3642
    • 3
    • 9223372036854775807

    Description

      https://testing.hpdd.intel.com/test_sessions/ecf02eea-e0bb-48ce-a644-fed2036c2478

      From suite_log:

      Starting ost1:   lustre-ost1/ost1 /mnt/lustre-ost1
      CMD: onyx-35vm12 mkdir -p /mnt/lustre-ost1; mount -t lustre lustre-ost1/ost1 /mnt/lustre-ost1
      onyx-35vm12: mount.lustre: mount lustre-ost1/ost1 at /mnt/lustre-ost1 failed: 
        Cannot send after transport endpoint shutdown
      

      The subsequent three test sets (sanityn, sanity-hsm, & sanity-lsnapshot) also run no subtests and have suite logs that end with this message:

      according to /etc/mtab lustre-mdt1/mdt1 is already mounted on /mnt/lustre-mds1
      

      Attachments

        Issue Links

          Activity

            [LU-10045] sanity-lfsck no sub tests failed

            John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31301/
            Subject: LU-10045 obdclass: multiple try when register target
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: aa99a7bb77cce480ff5753238d857a0eb797e5fe

            gerrit Gerrit Updater added a comment - John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31301/ Subject: LU-10045 obdclass: multiple try when register target Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: aa99a7bb77cce480ff5753238d857a0eb797e5fe

            Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31301
            Subject: LU-10045 obdclass: multiple try when register target
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 44bb3ee8961d13618e6d670d6e3005c2729a723e

            gerrit Gerrit Updater added a comment - Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31301 Subject: LU-10045 obdclass: multiple try when register target Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 44bb3ee8961d13618e6d670d6e3005c2729a723e
            mdiep Minh Diep added a comment -

            Landed for 2.11

            mdiep Minh Diep added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30761/
            Subject: LU-10045 obdclass: multiple try when register target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 79bfc74869e3f7b052874f4585399c5ba7f599e9

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30761/ Subject: LU-10045 obdclass: multiple try when register target Project: fs/lustre-release Branch: master Current Patch Set: Commit: 79bfc74869e3f7b052874f4585399c5ba7f599e9

            Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30761
            Subject: LU-10045 mgc: multiple try when register target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: d62d4132162c15cacc260e5d27abc2522f59d72d

            gerrit Gerrit Updater added a comment - Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/30761 Subject: LU-10045 mgc: multiple try when register target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: d62d4132162c15cacc260e5d27abc2522f59d72d
            yong.fan nasf (Inactive) added a comment - The reason is described as the comment: https://jira.hpdd.intel.com/browse/LU-10406?focusedCommentId=217655&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-217655

            The OST hit trouble when umount from former tests:
            https://testing.hpdd.intel.com/test_logs/26247ed8-d4fa-11e7-a066-52540065bddc/show_text

            [11519.992843] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1
            [11521.139399] LustreError: 22266:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount nonzero (1) after lock cleanup; forcing cleanup.
            [11521.146796] LustreError: 22266:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount = 2
            [11521.153245] LustreError: 22266:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order):
            [11521.157006] LustreError: 22266:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-OST0000 lock: ffff880056b8ca00/0xd8984bc587bddb59 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x20000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x2a5bf4985ef188db expref: -99 pid: 21127 timeout: 0 lvb_type: 2
            [11546.656082] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948075/real 1511948075]  req@ffff880048344000 x1585380626416000/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948100 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [11546.663986] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
            [11560.656077] LustreError: 166-1: MGC10.9.4.127@tcp: Connection to MGS (at 10.9.4.127@tcp) was lost; in progress operations using this service will fail
            [11591.445188] LustreError: 22271:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880056642a00 x1585380626416192/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
            [11591.451032] LustreError: 22271:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5
            [11591.661039] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948129/real 1511948129]  req@ffff880056640600 x1585380626416160/t0(0) o250->MGC10.9.4.127@tcp@10.9.4.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1511948145 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [11591.669399] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
            [11665.661060] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948194/real 1511948194]  req@ffff880056642700 x1585380626416256/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948219 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [11665.669739] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages
            [11711.461177] LustreError: 22280:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED   req@ffff880056642a00 x1585380626416352/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1
            ...
            
            yong.fan nasf (Inactive) added a comment - The OST hit trouble when umount from former tests: https://testing.hpdd.intel.com/test_logs/26247ed8-d4fa-11e7-a066-52540065bddc/show_text [11519.992843] Lustre: DEBUG MARKER: umount -d -f /mnt/lustre-ost1 [11521.139399] LustreError: 22266:0:(ldlm_resource.c:1094:ldlm_resource_complain()) lustre-MDT0000-lwp-OST0000: namespace resource [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount nonzero (1) after lock cleanup; forcing cleanup. [11521.146796] LustreError: 22266:0:(ldlm_resource.c:1676:ldlm_resource_dump()) --- Resource: [0x200000006:0x20000:0x0].0x0 (ffff88005598f6c0) refcount = 2 [11521.153245] LustreError: 22266:0:(ldlm_resource.c:1679:ldlm_resource_dump()) Granted locks (in reverse order): [11521.157006] LustreError: 22266:0:(ldlm_resource.c:1682:ldlm_resource_dump()) ### ### ns: lustre-MDT0000-lwp-OST0000 lock: ffff880056b8ca00/0xd8984bc587bddb59 lrc: 2/1,0 mode: CR/CR res: [0x200000006:0x20000:0x0].0x0 rrc: 3 type: PLN flags: 0x1106400000000 nid: local remote: 0x2a5bf4985ef188db expref: -99 pid: 21127 timeout: 0 lvb_type: 2 [11546.656082] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948075/real 1511948075] req@ffff880048344000 x1585380626416000/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948100 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11546.663986] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [11560.656077] LustreError: 166-1: MGC10.9.4.127@tcp: Connection to MGS (at 10.9.4.127@tcp) was lost; in progress operations using this service will fail [11591.445188] LustreError: 22271:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880056642a00 x1585380626416192/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 [11591.451032] LustreError: 22271:0:(qsd_reint.c:56:qsd_reint_completion()) lustre-OST0000: failed to enqueue global quota lock, glb fid:[0x200000006:0x1020000:0x0], rc:-5 [11591.661039] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948129/real 1511948129] req@ffff880056640600 x1585380626416160/t0(0) o250->MGC10.9.4.127@tcp@10.9.4.127@tcp:26/25 lens 520/544 e 0 to 1 dl 1511948145 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11591.669399] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 3 previous similar messages [11665.661060] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1511948194/real 1511948194] req@ffff880056642700 x1585380626416256/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.9.4.127@tcp:12/10 lens 520/544 e 0 to 1 dl 1511948219 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [11665.669739] Lustre: 15216:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 4 previous similar messages [11711.461177] LustreError: 22280:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff880056642a00 x1585380626416352/t0(0) o101->lustre-MDT0000-lwp-OST0000@10.9.4.127@tcp:23/10 lens 456/496 e 0 to 0 dl 0 ref 2 fl Rpc:/0/ffffffff rc 0/-1 ...
            yong.fan nasf (Inactive) added a comment - +1 on master: https://testing.hpdd.intel.com/test_sets/25e467c6-d4fa-11e7-a066-52540065bddc

            All the valuable information were shown in the bug description, no more detailed logs. Generally, at the beginning of sanity-lfsck, the test scripts will reformat and remount the whole system to cleanup the test environment. But according to the logs, some trouble happened during such process. According to the given logs, one possible case is that: after reformatted the MDT, it tried to mount the MDT, but the system /etc/mtab recorded that the MDT has already been mounted, it was wrong. So the MDT (and MGS) was not really mounted, then the subsequent mount failure happened on OSTs. As for why the /etc/mtab recorded wrong information, it is different to know. It may because some former test cases (in sanity or more earlier) side-effect.

            So unless we can reproduce the trouble with more detailed logs, it is difficult to locate the root reason.

            yong.fan nasf (Inactive) added a comment - All the valuable information were shown in the bug description, no more detailed logs. Generally, at the beginning of sanity-lfsck, the test scripts will reformat and remount the whole system to cleanup the test environment. But according to the logs, some trouble happened during such process. According to the given logs, one possible case is that: after reformatted the MDT, it tried to mount the MDT, but the system /etc/mtab recorded that the MDT has already been mounted, it was wrong. So the MDT (and MGS) was not really mounted, then the subsequent mount failure happened on OSTs. As for why the /etc/mtab recorded wrong information, it is different to know. It may because some former test cases (in sanity or more earlier) side-effect. So unless we can reproduce the trouble with more detailed logs, it is difficult to locate the root reason.
            pjones Peter Jones added a comment -

            Fan Yong

            Could you please advise on this one?

            Thanks

            Peter

            pjones Peter Jones added a comment - Fan Yong Could you please advise on this one? Thanks Peter

            People

              yong.fan nasf (Inactive)
              jcasper James Casper (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: