Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-615

sanity test 180a: awk: (FILENAME=/proc/fs/lustre/osc/lustre-OST0000-osc-ffff88032273d800/import FNR=32) fatal error: internal error

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • None
    • Lustre 2.1.0, Lustre 2.2.0
    • None

    • Lustre Tag: v2_1_0_0_RC0
      Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/267/
      Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6)
    • 3
    • 5475

    Description

      sanity test 180a failed as follows:

      == sanity test 180a: test obdecho on osc == 02:30:13 (1314005413)
      awk: (FILENAME=/proc/fs/lustre/osc/lustre-OST0000-osc-ffff88032273d800/import FNR=32) fatal error: internal error
      

      Maloo report: https://maloo.whamcloud.com/test_sets/cdf213b2-ccac-11e0-8d02-52540025f9af

      Attachments

        Issue Links

          Activity

            [LU-615] sanity test 180a: awk: (FILENAME=/proc/fs/lustre/osc/lustre-OST0000-osc-ffff88032273d800/import FNR=32) fatal error: internal error
            pjones Peter Jones added a comment - - edited

            duplicate of LU-760

            pjones Peter Jones added a comment - - edited duplicate of LU-760
            sarah Sarah Liu added a comment -

            hit the same issue when testing interop between 2.1-wc1 client and 2.1.55 server

            Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 19:30:56 (1328067056)
            Lustre: Echo OBD driver; http://www.lustre.org/

                • glibc detected *** awk: malloc(): memory corruption: 0x0000000000f3a1b0 ***
            sarah Sarah Liu added a comment - hit the same issue when testing interop between 2.1-wc1 client and 2.1.55 server Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 19:30:56 (1328067056) Lustre: Echo OBD driver; http://www.lustre.org/ glibc detected *** awk: malloc(): memory corruption: 0x0000000000f3a1b0 ***

            Redo 'import' proc file via seq_file API:

            http://review.whamcloud.com/#change,1544

            I've made local tests to make sure it is able to read big amount of data, to be sure I've added several line of just garbage so output become bigger than 1024 and run awk with strace:

            open("/proc/fs/lustre/osc/lustre-OST0001-osc-d1a26e00/import", O_RDONLY|O_LARGEFILE) = 3
            fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
            ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe6bec8) = -1 ENOTTY (Inappropriate ioctl for device)
            fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
            fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
            read(3, "import:\n    name: lustre-OST0001"..., 1024) = 1024
            read(3, "hat is why I put this line to ex"..., 1024) = 341
            fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
            mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb765f000
            write(1, "747\n", 4747
            )                    = 4
            read(3, "", 1024)                       = 0
            close(3)                                = 0
            exit_group(0)                           = ?
            

            the command was:
            strace awk '/usec_per_rpc:/

            {print $2}

            ' /proc/fs/lustre/osc/lustre-OST0001-osc-d1a26e00/import

            it searched usec_per_rpc string which is at end of file and we can see that awk did two reads - 1024 bytes and 341 bytes, string was found in second part.

            Previous implementation lprocfs_rd_import() wasn't able to return data from offset at all and just cut data at 1024 causing awk to behave unpredictable

            tappro Mikhail Pershin added a comment - Redo 'import' proc file via seq_file API: http://review.whamcloud.com/#change,1544 I've made local tests to make sure it is able to read big amount of data, to be sure I've added several line of just garbage so output become bigger than 1024 and run awk with strace: open( "/proc/fs/lustre/osc/lustre-OST0001-osc-d1a26e00/ import " , O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe6bec8) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 read(3, " import :\n name: lustre-OST0001" ..., 1024) = 1024 read(3, "hat is why I put this line to ex" ..., 1024) = 341 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb765f000 write(1, "747\n" , 4747 ) = 4 read(3, "", 1024) = 0 close(3) = 0 exit_group(0) = ? the command was: strace awk '/usec_per_rpc:/ {print $2} ' /proc/fs/lustre/osc/lustre-OST0001-osc-d1a26e00/import it searched usec_per_rpc string which is at end of file and we can see that awk did two reads - 1024 bytes and 341 bytes, string was found in second part. Previous implementation lprocfs_rd_import() wasn't able to return data from offset at all and just cut data at 1024 causing awk to behave unpredictable

            It seems this is related to our implementation of proc read method there. The amount of data in 'import' is quite big, more than 1024 bytes. Strace shows the following while reading proc file:

            open("/proc/fs/lustre/osc/lustre-OST0001-osc-d9f1ea00/import", O_RDONLY|O_LARGEFILE) = 3
            fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
            ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfdb9348) = -1 ENOTTY (Inappropriate ioctl for device)
            fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
            fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
            read(3, "import:\n    name: lustre-OST0001"..., 1024) = 1013
            

            1013 bytes is on my local system, on Toro it is more than 1024, e.g. my local address is just 0@lo while on Toro it is full IP address.
            The issue here is fstat64 - it can't get file size for proc file, so following read is using 1024 for read which is less than proc data, probably that is the reason of issue. The lprocfs_rd_import() uses snprintf to output data but it will truncate data so last line may have no final end-of-line symbol, can this cause awk to fail?

            If we will use just tmp file with output of proc data that works fine because awk gets file size by fstat64, so this solution will work.
            But better to fix our proc read method for 'import' data, using seq_file.

            tappro Mikhail Pershin added a comment - It seems this is related to our implementation of proc read method there. The amount of data in 'import' is quite big, more than 1024 bytes. Strace shows the following while reading proc file: open( "/proc/fs/lustre/osc/lustre-OST0001-osc-d9f1ea00/ import " , O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfdb9348) = -1 ENOTTY (Inappropriate ioctl for device) fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 read(3, " import :\n name: lustre-OST0001" ..., 1024) = 1013 1013 bytes is on my local system, on Toro it is more than 1024, e.g. my local address is just 0@lo while on Toro it is full IP address. The issue here is fstat64 - it can't get file size for proc file, so following read is using 1024 for read which is less than proc data, probably that is the reason of issue. The lprocfs_rd_import() uses snprintf to output data but it will truncate data so last line may have no final end-of-line symbol, can this cause awk to fail? If we will use just tmp file with output of proc data that works fine because awk gets file size by fstat64, so this solution will work. But better to fix our proc read method for 'import' data, using seq_file.

            we are seeing that bug very often in Orion over rhel6. I have an idea that awk has troubles parsing proc files. The patch copy proc file to the regular one and use it, there is fix:

            http://review.whamcloud.com/1467

            the Maloo run with it shows test 180a pass, but this is only single run.

            tappro Mikhail Pershin added a comment - we are seeing that bug very often in Orion over rhel6. I have an idea that awk has troubles parsing proc files. The patch copy proc file to the regular one and use it, there is fix: http://review.whamcloud.com/1467 the Maloo run with it shows test 180a pass, but this is only single run.

            Probably related to LU-566.

            adilger Andreas Dilger added a comment - Probably related to LU-566 .
            yujian Jian Yu added a comment -

            Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/275/
            Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.1), SLES11/x86_64(client, in-kernel OFED)

            sanity test 180a failed with the same issue: https://maloo.whamcloud.com/test_sets/e4d30bfa-d561-11e0-8d02-52540025f9af

            yujian Jian Yu added a comment - Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/275/ Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.1), SLES11/x86_64(client, in-kernel OFED) sanity test 180a failed with the same issue: https://maloo.whamcloud.com/test_sets/e4d30bfa-d561-11e0-8d02-52540025f9af

            People

              wc-triage WC Triage
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: