[LU-1353] mdt_reint_open() @@@ OPEN & CREAT not in open replay Created: 27/Apr/12 Updated: 29/May/17 Resolved: 29/May/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.1.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Ned Bass | Assignee: | Lai Siyao |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | llnl | ||
| Environment: | |||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Sub-Tasks: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9748 | ||||||||||||||||
| Description |
|
We occasionally see the message in the summary show up in the MDS console log during server recovery. What might cause this? |
| Comments |
| Comment by Peter Jones [ 29/Apr/12 ] |
|
Lai Could you please comment on this one? Thanks Peter |
| Comment by Lai Siyao [ 02/May/12 ] |
|
In MDS recovery, all opened files (on client) needs to be opened again, this is called open replay. But in your case, the replayed open failed with -ENOENT, and open is not called with O_CREAT (it means this open should not create a new file), this is abnormal and this error message is printed. Did you see anything wrong after this? If you met this again, could you get /var/log/messages and dump debug log of both client and server? |
| Comment by Ned Bass [ 04/May/12 ] |
|
Attaching Lustre syslog messages from MDS. |
| Comment by Ned Bass [ 04/May/12 ] |
|
I wasn't able to get the debug log files but the attachment has syslog messages from the MDS. The OPEN & CREAT messages appear starting at May 3 12:16:33. We don't see any problems obviously connected with this error, although we are running into other recovery-related bugs in 2.1, namely |
| Comment by Lai Siyao [ 07/May/12 ] |
|
The syslog messages doesn't provide much information. IMHO a possible cause may be that a file was created, and then opened, but MDS failed and unfortunately this inode was not synced to disk yet, then during MDS recovery, client tried to replay open, but MDS couldn't find this file, and printed this error message. If it's all that has happened, current handling looks reasonable. Will you do such recovery test again? If so, I can provide a debug patch to print more information. |
| Comment by Ned Bass [ 07/May/12 ] |
|
We see this on production servers, not during recovery testing. We can include a debug patch in our tree and wait for it to happen again. If we learn that this is a valid case under Lustre's consistency protocol then perhaps the message should not go to the console. |
| Comment by Christopher Morrone [ 07/May/12 ] |
|
If a patch is needed to debug further, then please do work on one for b2_1. |
| Comment by Lai Siyao [ 08/May/12 ] |
|
The debug patch for b2_1 is at http://review.whamcloud.com/#change,2679. |
| Comment by Brian Behlendorf [ 10/May/12 ] |
|
I observed this same issue during recovery on the Orion branch last night, I'll pull the debug patch is to this branch as well. |
| Comment by Christopher Morrone [ 14/May/12 ] |
|
I've pulled it into the LLNL 2.1 branch too. It will first appear int 2.1.1-12chaos. |
| Comment by Peter Jones [ 04/Jun/12 ] |
|
Chris When do you expect that version of chaos to be deployed into production? Peter |
| Comment by Christopher Morrone [ 04/Jun/12 ] |
|
2.1.1-13chaos (or newer) will likely start rolling out into production about 2 weeks from now. |
| Comment by Prakash Surya (Inactive) [ 27/Nov/12 ] |
|
I got a hit of this with the debug patch applied yesterday. Here's the relevant console output: Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 265 clients reconnect. LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f2d1:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd78c5050 x1419179046425207/t0(201876066869) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:06 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_close.0->[0x2000182dc:0x1f2d2:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd7fe1c50 x1419179046425214/t0(201876066900) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975967 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_open.0->[0x2000182dc:0x1f380:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd80c9450 x1419179046426510/t0(201876068937) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975968 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:07 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 43 previous similar messages 2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) @@@ [0x2000182dc:0x156a9:0x0]/simul_lseek.0->[0x2000182dc:0x1f45a:0x0] cr_flags=03 mode=0100000 msg_flag=0x4 not found in open replay. req@ffff881fd675f850 x1419179046428033/t0(201876070976) o101->bfcbc1b8-9a26-425f-5198-348ff50beb40@172.20.3.120@o2ib500:0/0 lens 568/1136 e 0 to 0 dl 1353975969 ref 1 fl Complete:/4/0 rc 0/0 2012-11-26 16:25:08 LustreError: 33073:0:(mdt_open.c:1328:mdt_reint_open()) Skipped 56 previous similar messages 2012-11-26 16:25:09 Lustre: lstest-MDT0000: Recovery over after 1:15, of 265 clients 256 recovered and 9 were evicted. |
| Comment by Lai Siyao [ 29/Nov/12 ] |
|
Could you help verify the fid not found really doesn't exist on MDS? Firstly you can use `stat <mountpoint>/.lustre/fid/[0x2000182dc:0x1f2d1:0x0]` to verify whether fid exists. If this fid exists, then you can use `lfs fid2path <fsname> [0x2000182dc:0x1f2d1:0x0]` to check the path of this fid. Else you can mount MDS filesystem as ldiskfs, and check <mountpoint>/PENDING, is it empty? BTW, do you know how simul runs in during this MDS recovery? Is there any hint on whether these not found files been unlinked? |
| Comment by Prakash Surya (Inactive) [ 29/Nov/12 ] |
|
Here you go: # grove513 /root > stat /p/lstest/.lustre/fid/[0x2000182dc:0x1f2d1:0x0] stat: cannot stat `/p/lstest/.lustre/fid/[0x2000182dc:0x1f2d1:0x0]': No such file or directory # grove-mds2 /mnt/snap > ls -Rlan PENDING/ PENDING/: total 29 drwxr-xr-x 2 0 0 2 Dec 31 1969 . drwxr-xr-x 2 0 0 2 Dec 31 1969 . drwxr-xr-x 4 0 0 2 Sep 28 15:39 .. drwx------ 2 37693 37693 2 Nov 28 18:21 0000000200018ee8:0000000f:00000000: 0 PENDING/0000000200018ee8:0000000f:00000000: 0: total 16 drwx------ 2 37693 37693 2 Nov 28 18:21 . drwx------ 2 37693 37693 2 Nov 28 18:21 . drwxr-xr-x 2 0 0 2 Dec 31 1969 .. Would it still be around a few days later?
I'm not sure off the top of my head.. |
| Comment by Lai Siyao [ 10/Dec/12 ] |
|
The code looks correct in all open, unlink handling, we need to find a simple way to reproduce this to collect more debug logs with 'vfstrace,inode' enabled on both MDS and clients. |
| Comment by Andreas Dilger [ 29/May/17 ] |
|
Close old ticket. |