[LU-6347] Radom 'forrtl: severe (39): error during read' Errors Created: 06/Mar/15  Updated: 30/Apr/15  Resolved: 30/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Hongchao Zhang
Resolution: Incomplete Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 17768

 Description   

We have a user when they try to read some restart files the will get this fortran error message. The error can move around to different files and is not always consistent.
I was able to capture lustre debugging during one of these failures.

this was the specific error and the FID of the file

 Reading file unit:        1232
forrtl: severe (39): error during read, unit 1232, file /nobackupp9/pbalakum/TURBULENCE/3D_TURBULENCE/TURB1_10595_DNS_COMPACT_512_512_512/fort.1232
Image              PC                Routine            Line        Source             
read_file          000000000047C351  Unknown               Unknown  Unknown
read_file          000000000047B325  Unknown               Unknown  Unknown
read_file          000000000043687A  Unknown               Unknown  Unknown
read_file          0000000000408872  Unknown               Unknown  Unknown
read_file          00000000004080A1  Unknown               Unknown  Unknown
read_file          000000000041949F  Unknown               Unknown  Unknown
read_file          0000000000402F5D  Unknown               Unknown  Unknown
read_file          0000000000402BFC  Unknown               Unknown  Unknown
libc.so.6          00007FFFED0F5C36  Unknown               Unknown  Unknown
read_file          0000000000402AF9  Unknown               Unknown  Unknown
r401i2n10 /nobackupp9/pbalakum/TURBULENCE/3D_TURBULENCE/TURB1_10595_DNS_COMPACT_512_512_512 # lfs path2fid  /nobackupp9/pbalakum/TURBULENCE/3D_TURBULENCE/TURB1_10595_DNS_COMPACT_512_512_512/fort.1232
[0x20009c845:0x14ddf:0x0]

I will upload the debug logs to ftp site and post the file



 Comments   
Comment by Mahmoud Hanafi [ 06/Mar/15 ]

I uploaded debug logs to the following file ftp.whamcloud.com:uploads/LU6347/lu6347.tar.

service161-service176 are the OSSes

contents of lu6347.tar

clientdebug.error.gz
filerror.out.gz.service161
filerror.out.gz.service162
filerror.out.gz.service163
filerror.out.gz.service164
filerror.out.gz.service165
filerror.out.gz.service166
filerror.out.gz.service167
filerror.out.gz.service168
filerror.out.gz.service169
filerror.out.gz.service170
filerror.out.gz.service171
filerror.out.gz.service172
filerror.out.gz.service173
filerror.out.gz.service174
filerror.out.gz.service175
filerror.out.gz.service176
mdsdebug.out.bz2
Comment by Peter Jones [ 08/Mar/15 ]

Hongchao

Could you please advise on this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 09/Mar/15 ]

Hi Mahmoud,

Thanks for the detailed logs about this ticket! and I have checked it but I can't find the error which could cause
the failure of file reading, the pages of the file [0x20009c845:0x14ddf:0x0] are all read successfully.

Can the file offset be deduced from the log "Reading file unit: 1232 forrtl: severe (39): error during read, unit 1232"
according to your Fortran application? besides, are there any useful logs in the syslog&console? Thanks!

Comment by Mahmoud Hanafi [ 11/Mar/15 ]

We are not able to get any additional info from the user code or fortran lib. But we have intel compiler ticket with regards to this issue, number 6000089383. We would like for you to engage your compiler developers to help provide additional info about the specifics of the error.

Comment by Hongchao Zhang [ 12/Mar/15 ]

Thanks, I will contact them for it.

could you please copy the affected file [0x20009c845:0x14ddf:0x0] out of Lustre and check with the application to see
whether the problem disappeared or not? Thanks!

Comment by Mahmoud Hanafi [ 12/Mar/15 ]

Key here is that it is not always the same file and the error doesn't happen every time on the same file. But I will try to see if I can reproduce it on a different filesystem.

Comment by Hongchao Zhang [ 13/Mar/15 ]

Hi Mahmoud,

could you please use the "GETLASTERROR" call mentioned in 6000089383 to get the actual error returned by OS? Thanks!

Comment by John Fuchs-Chesney (Inactive) [ 24/Apr/15 ]

Hello Mahmoud,

Do you need any more Lustre related work done on this ticket?

If not then I would like to close it.

Thanks,
~ jfc.

Comment by Mahmoud Hanafi [ 30/Apr/15 ]

This can be closed as we have LU-6545 opened.

Comment by John Fuchs-Chesney (Inactive) [ 30/Apr/15 ]

Thanks Mahmoud.
~ jfc.

Generated at Sat Feb 10 01:59:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.