Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2008

After hardware reboot (using pm) the node cannot be accessed

Details

    • 3
    • 4110

    Description

      This issue was created by maloo for bobijam <bobijam@whamcloud.com>

      This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/02a8d976-05c6-11e2-b6a7-52540035b04c.

      The sub-test test_0b failed with the following error:

      test failed to respond and timed out

      Info required for matching: replay-ost-single 0b

      11:42:23:== replay-ost-single test 0b: empty replay =========================================================== 11:42:21 (1348425741)
      11:42:34:Failing ost1 on node client-25vm4
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:CMD: client-25vm4 lctl dl
      11:42:34:+ pm -h powerman --off client-25vm4
      11:42:34:pm: warning: server version (2.3.5) != client (2.3.12)
      11:42:34:Command completed successfully
      11:42:34:affected facets:
      11:42:34:+ pm -h powerman --on client-25vm4
      11:42:34:pm: warning: server version (2.3.5) != client (2.3.12)
      11:42:46:Command completed successfully
      11:42:46:CMD: hostname
      11:42:46:pdsh@client-25vm1: gethostbyname("hostname") failed
      11:42:46:CMD: hostname
      11:42:46:pdsh@client-25vm1: gethostbyname("hostname") failed
      11:42:57:CMD: hostname
      11:42:57:pdsh@client-25vm1: gethostbyname("hostname") failed
      11:42:57:CMD: hostname

      Attachments

        Issue Links

          Activity

            [LU-2008] After hardware reboot (using pm) the node cannot be accessed
            jlevi Jodi Levi (Inactive) made changes -
            Fix Version/s New: Lustre 2.5.0 [ 10295 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.4.1 [ 10294 ]
            yujian Jian Yu made changes -
            Link New: This issue is duplicated by LU-2415 [ LU-2415 ]
            pjones Peter Jones made changes -
            Fix Version/s New: Lustre 2.4.0 [ 10154 ]
            Resolution New: Fixed [ 1 ]
            Status Original: In Progress [ 3 ] New: Resolved [ 5 ]
            pjones Peter Jones added a comment -

            Landed for 2.4

            pjones Peter Jones added a comment - Landed for 2.4
            yujian Jian Yu added a comment -

            The issue was introduced by http://review.whamcloud.com/3611.

            Patch for master branch is in http://review.whamcloud.com/5867.

            yujian Jian Yu added a comment - The issue was introduced by http://review.whamcloud.com/3611 . Patch for master branch is in http://review.whamcloud.com/5867 .
            yujian Jian Yu added a comment - - edited

            This is a Lustre issue on master branch. Mounting an ldiskfs server target with MMP feature enabled will fail at ldiskfs_label_lustre() which uses e2label:

            [root@fat-amd-2 ~]# mkfs.lustre --mgsnode=client-1@tcp:client-3@tcp --fsname=lustre --ost --index=0 --failnode=fat-amd-3@tcp --param=sys.timeout=20 --backfstype=ldiskfs --device-size=16000000 --quiet --reformat /dev/disk/by-id/scsi-1IET_00020001
            
               Permanent disk data:
            Target:     lustre:OST0000
            Index:      0
            Lustre FS:  lustre
            Mount type: ldiskfs
            Flags:      0x62
                          (OST first_time update )
            Persistent mount opts: errors=remount-ro
            Parameters: mgsnode=10.10.4.1@tcp:10.10.4.3@tcp failover.node=10.10.4.134@tcp sys.timeout=20
            
            [root@fat-amd-2 ~]# e2label /dev/disk/by-id/scsi-1IET_00020001
            lustre:OST0000
            
            [root@fat-amd-2 ~]# mkdir -p /mnt/ost1; mount -t lustre /dev/disk/by-id/scsi-1IET_00020001 /mnt/ost1
               e2label: MMP: device currently active while trying to open /dev/sdf
               MMP error info: last update: Wed Mar 27 07:29:05 2013
                node: fat-amd-2 device: sdf
            
            [root@fat-amd-2 ~]# e2label /dev/disk/by-id/scsi-1IET_00020001
            lustre:OST0000
            
            yujian Jian Yu added a comment - - edited This is a Lustre issue on master branch. Mounting an ldiskfs server target with MMP feature enabled will fail at ldiskfs_label_lustre() which uses e2label: [root@fat-amd-2 ~]# mkfs.lustre --mgsnode=client-1@tcp:client-3@tcp --fsname=lustre --ost --index=0 --failnode=fat-amd-3@tcp --param=sys.timeout=20 --backfstype=ldiskfs --device-size=16000000 --quiet --reformat /dev/disk/by-id/scsi-1IET_00020001 Permanent disk data: Target: lustre:OST0000 Index: 0 Lustre FS: lustre Mount type: ldiskfs Flags: 0x62 (OST first_time update ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=10.10.4.1@tcp:10.10.4.3@tcp failover.node=10.10.4.134@tcp sys.timeout=20 [root@fat-amd-2 ~]# e2label /dev/disk/by-id/scsi-1IET_00020001 lustre:OST0000 [root@fat-amd-2 ~]# mkdir -p /mnt/ost1; mount -t lustre /dev/disk/by-id/scsi-1IET_00020001 /mnt/ost1 e2label: MMP: device currently active while trying to open /dev/sdf MMP error info: last update: Wed Mar 27 07:29:05 2013 node: fat-amd-2 device: sdf [root@fat-amd-2 ~]# e2label /dev/disk/by-id/scsi-1IET_00020001 lustre:OST0000
            pjones Peter Jones made changes -
            Assignee Original: Chris Gearing [ chris ] New: Jian Yu [ yujian ]
            pjones Peter Jones made changes -
            Assignee Original: Jian Yu [ yujian ] New: Chris Gearing [ chris ]
            yujian Jian Yu added a comment -

            There is a common issue in lustre-initialization-1 reports in the above test sessions. After formatting all of the server targets, mounting them hit the following issues:

            11:37:12:Setup mgs, mdt, osts
            11:37:12:CMD: client-25vm3 mkdir -p /mnt/mds1
            11:37:12:CMD: client-25vm3 test -b /dev/lvm-MDS/P1
            11:37:12:Starting mds1: -o user_xattr,acl  /dev/lvm-MDS/P1 /mnt/mds1
            11:37:12:CMD: client-25vm3 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl  		                   /dev/lvm-MDS/P1 /mnt/mds1
            11:37:43:   e2label: MMP: device currently active while trying to open /dev/dm-0
            11:37:43:   MMP error info: last update: Sun Sep 23 11:37:37 2012
            11:37:43:    node: client-25vm3.lab.whamcloud.com device: dm-0
            11:37:43:CMD: client-25vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 
            11:37:43:CMD: client-25vm3 e2label /dev/lvm-MDS/P1 2>/dev/null
            11:37:43:Started lustre:MDT0000
            11:37:43:CMD: client-25vm4 mkdir -p /mnt/ost1
            11:37:43:CMD: client-25vm4 test -b /dev/lvm-OSS/P1
            11:37:43:Starting ost1:   /dev/lvm-OSS/P1 /mnt/ost1
            11:37:43:CMD: client-25vm4 mkdir -p /mnt/ost1; mount -t lustre   		                   /dev/lvm-OSS/P1 /mnt/ost1
            11:38:14:   e2label: MMP: device currently active while trying to open /dev/dm-0
            11:38:14:   MMP error info: last update: Sun Sep 23 11:38:09 2012
            11:38:15:    node: client-25vm4.lab.whamcloud.com device: dm-0
            11:38:15:CMD: client-25vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 
            11:38:15:CMD: client-25vm4 e2label /dev/lvm-OSS/P1 2>/dev/null
            11:38:15:Started lustre:OST0000
            

            The labels of server target devices were "lustre:MDT0000", "lustre:OST0000", etc., instead of "lustre-MDT0000", "lustre-OST0000", which caused facet_up() always return false, and then affected_facets() always returned empty under HARD failure mode.

            LU-2415 also has the same issue.

            yujian Jian Yu added a comment - There is a common issue in lustre-initialization-1 reports in the above test sessions. After formatting all of the server targets, mounting them hit the following issues: 11:37:12:Setup mgs, mdt, osts 11:37:12:CMD: client-25vm3 mkdir -p /mnt/mds1 11:37:12:CMD: client-25vm3 test -b /dev/lvm-MDS/P1 11:37:12:Starting mds1: -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1 11:37:12:CMD: client-25vm3 mkdir -p /mnt/mds1; mount -t lustre -o user_xattr,acl /dev/lvm-MDS/P1 /mnt/mds1 11:37:43: e2label: MMP: device currently active while trying to open /dev/dm-0 11:37:43: MMP error info: last update: Sun Sep 23 11:37:37 2012 11:37:43: node: client-25vm3.lab.whamcloud.com device: dm-0 11:37:43:CMD: client-25vm3 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 11:37:43:CMD: client-25vm3 e2label /dev/lvm-MDS/P1 2>/dev/null 11:37:43:Started lustre:MDT0000 11:37:43:CMD: client-25vm4 mkdir -p /mnt/ost1 11:37:43:CMD: client-25vm4 test -b /dev/lvm-OSS/P1 11:37:43:Starting ost1: /dev/lvm-OSS/P1 /mnt/ost1 11:37:43:CMD: client-25vm4 mkdir -p /mnt/ost1; mount -t lustre /dev/lvm-OSS/P1 /mnt/ost1 11:38:14: e2label: MMP: device currently active while trying to open /dev/dm-0 11:38:14: MMP error info: last update: Sun Sep 23 11:38:09 2012 11:38:15: node: client-25vm4.lab.whamcloud.com device: dm-0 11:38:15:CMD: client-25vm4 PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/utils/gss:/usr/lib64/lustre/utils:/usr/lib64/openmpi/1.4-gcc/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin: NAME=autotest_config sh rpc.sh set_default_debug \"0x33f0404\" \" 0xffb7e3ff\" 32 11:38:15:CMD: client-25vm4 e2label /dev/lvm-OSS/P1 2>/dev/null 11:38:15:Started lustre:OST0000 The labels of server target devices were "lustre:MDT0000", "lustre:OST0000", etc., instead of "lustre-MDT0000", "lustre-OST0000", which caused facet_up() always return false, and then affected_facets() always returned empty under HARD failure mode. LU-2415 also has the same issue.

            People

              yujian Jian Yu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: