[LU-615] sanity test 180a: awk: (FILENAME=/proc/fs/lustre/osc/lustre-OST0000-osc-ffff88032273d800/import FNR=32) fatal error: internal error Created: 22/Aug/11  Updated: 15/Jul/13  Resolved: 06/Feb/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Lustre Tag: v2_1_0_0_RC0
Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/267/
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6)


Issue Links:
Related
is related to LU-760 Test failure on test suite sanity, su... Resolved
Severity: 3
Rank (Obsolete): 5475

 Description   

sanity test 180a failed as follows:

== sanity test 180a: test obdecho on osc == 02:30:13 (1314005413)
awk: (FILENAME=/proc/fs/lustre/osc/lustre-OST0000-osc-ffff88032273d800/import FNR=32) fatal error: internal error

Maloo report: https://maloo.whamcloud.com/test_sets/cdf213b2-ccac-11e0-8d02-52540025f9af



 Comments   
Comment by Jian Yu [ 02/Sep/11 ]

Lustre Build: http://newbuild.whamcloud.com/job/lustre-master/275/
Distro/Arch: RHEL5/x86_64(server, OFED 1.5.3.1), SLES11/x86_64(client, in-kernel OFED)

sanity test 180a failed with the same issue: https://maloo.whamcloud.com/test_sets/e4d30bfa-d561-11e0-8d02-52540025f9af

Comment by Andreas Dilger [ 13/Sep/11 ]

Probably related to LU-566.

Comment by Mikhail Pershin [ 05/Oct/11 ]

we are seeing that bug very often in Orion over rhel6. I have an idea that awk has troubles parsing proc files. The patch copy proc file to the regular one and use it, there is fix:

http://review.whamcloud.com/1467

the Maloo run with it shows test 180a pass, but this is only single run.

Comment by Mikhail Pershin [ 18/Oct/11 ]

It seems this is related to our implementation of proc read method there. The amount of data in 'import' is quite big, more than 1024 bytes. Strace shows the following while reading proc file:

open("/proc/fs/lustre/osc/lustre-OST0001-osc-d9f1ea00/import", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfdb9348) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
read(3, "import:\n    name: lustre-OST0001"..., 1024) = 1013

1013 bytes is on my local system, on Toro it is more than 1024, e.g. my local address is just 0@lo while on Toro it is full IP address.
The issue here is fstat64 - it can't get file size for proc file, so following read is using 1024 for read which is less than proc data, probably that is the reason of issue. The lprocfs_rd_import() uses snprintf to output data but it will truncate data so last line may have no final end-of-line symbol, can this cause awk to fail?

If we will use just tmp file with output of proc data that works fine because awk gets file size by fstat64, so this solution will work.
But better to fix our proc read method for 'import' data, using seq_file.

Comment by Mikhail Pershin [ 19/Oct/11 ]

Redo 'import' proc file via seq_file API:

http://review.whamcloud.com/#change,1544

I've made local tests to make sure it is able to read big amount of data, to be sure I've added several line of just garbage so output become bigger than 1024 and run awk with strace:

open("/proc/fs/lustre/osc/lustre-OST0001-osc-d1a26e00/import", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbfe6bec8) = -1 ENOTTY (Inappropriate ioctl for device)
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
read(3, "import:\n    name: lustre-OST0001"..., 1024) = 1024
read(3, "hat is why I put this line to ex"..., 1024) = 341
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb765f000
write(1, "747\n", 4747
)                    = 4
read(3, "", 1024)                       = 0
close(3)                                = 0
exit_group(0)                           = ?

the command was:
strace awk '/usec_per_rpc:/

{print $2}

' /proc/fs/lustre/osc/lustre-OST0001-osc-d1a26e00/import

it searched usec_per_rpc string which is at end of file and we can see that awk did two reads - 1024 bytes and 341 bytes, string was found in second part.

Previous implementation lprocfs_rd_import() wasn't able to return data from offset at all and just cut data at 1024 causing awk to behave unpredictable

Comment by Sarah Liu [ 31/Jan/12 ]

hit the same issue when testing interop between 2.1-wc1 client and 2.1.55 server

Lustre: DEBUG MARKER: == sanity test 180a: test obdecho on osc == 19:30:56 (1328067056)
Lustre: Echo OBD driver; http://www.lustre.org/

      • glibc detected *** awk: malloc(): memory corruption: 0x0000000000f3a1b0 ***
Comment by Peter Jones [ 06/Feb/12 ]

duplicate of LU-760

Generated at Sat Feb 10 01:08:46 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.