[LU-10055] mdt_fill_lvbo() message spew on MDS console Created: 02/Oct/17  Updated: 12/Jul/19  Resolved: 25/Jan/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.7

Type: Bug Priority: Major
Reporter: Andreas Dilger Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File secghost.txt    
Issue Links:
Related
is related to LU-9825 Multiple errors on OST/MDS Open
is related to LU-11252 MDS kernel panic when try to umount Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running the (almost) latest version of b2_10 (see LU-9983 for details), seeing quite a few of these on the MDS console:

/scratch/logs/syslog/soak-8.log:Oct  1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 944 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages
/scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 872 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 10 previous similar messages
/scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 4181:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:44:04 soak-10 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:44:04 soak-9 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 848 actual 416.
/scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 8 previous similar messages
/scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 9 previous similar messages
/scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 800 actual 416.
/scratch/logs/syslog/soak-9.log:Oct  1 22:59:06 soak-9 kernel: LustreError: 2357:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 776 actual 416.


 Comments   
Comment by Andreas Dilger [ 02/Oct/17 ]

Initially I thought this was a harmless message caused by the layout xattr being smaller than expected, but in fact it is the reverse. The LVB buffer is not large enough for the xattr being read from the file.

I suspect that this is caused by PFL and doenn't , as it has also been seen on previous testing (LU-9825).

Comment by Jinshan Xiong (Inactive) [ 02/Oct/17 ]

It should be introduced by PFL where the size of layout becomes larger than it's reserved due to component instantiation. This problem can be solved by extending the reply buffer as we discussed before.

Comment by Andreas Dilger [ 02/Oct/17 ]

It would make sense for clients to just assume enough space for a PFL file to begin with, maybe 3-4 component headers in addition to the stripes in the file. That would quiet the errors on the MDS.

Jinshan, any idea what code path this is affecting? Layout return in LVB for lock enquirer? It doesn't appear to be causing visible errors, but I'm not sure what application that Cliff is running that generates this, or whether it is checking for correctness.

Comment by Jinshan Xiong (Inactive) [ 03/Oct/17 ]

I think the major change would be in ldlm_handle_enqueue0(), where it should expand the reply buffer if found too small.

Actually I tend to think it has nothing to do with client. If the reply buffer is turned out too small on client, the reply will be truncated and client should be able to resend the RPC with bigger reply buffer.

Comment by Mikhail Pershin [ 10/Nov/17 ]

It is fixed in that patch:

https://review.whamcloud.com/#/c/30004/

I will change ticket number if patch will be refreshed

 

Comment by Mikhail Pershin [ 11/Nov/17 ]

I don't think we need to change ldlm_handle_enqueue() for this. This problem occurs in two cases:

1) mdt_max_mdsize is smaller than the layout size and client pack request with not enough size, in that case there will be resend with bigger buffer. This is how that code in mdt_lvbo_fill() is intended to work originally. I think this case don't need to be fixed, it causes such messages quite rare if mdt_max_mdsize is not synced on server and client.

2) mdt_max_mdsize is already big enough and client knows it. But  mdt_intent_layout() pack reply buffer with smaller size. It is not about max_mdsize on client and server at all, it is just wrong size packed because it uses current EA size of file which will be updated to the new EA, so this packed size is wrong from the beginning in most cases. And exactly this case produced a lot of messages in log, because it happens each time with bigger EA size than packed.

Patch solves case 2) by setting reply size to max_mdsize if layout is going to be updated and shrinking it later. This is better than intercepting that in ldlm_handle_enqueue0() and expanding buffer because expanding is more expensive operation then shrinking, the shrinking is part of every reply processing now while expanding is an exception for rare cases. 

Comment by Gerrit Updater [ 25/Jan/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30004/
Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4f27911cadf10d0b2fd6451569e688233eaf50d1

Comment by Peter Jones [ 25/Jan/18 ]

Landed for 2.11

Comment by Mahmoud Hanafi [ 05/Sep/18 ]

Can we get a backport to 2.10.5. We are see this error on our 2.10.5 servers.

Comment by Gerrit Updater [ 10/Sep/18 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33133
Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 6e89fbc66336b20821c219fbe38de643ff925afa

Comment by Gerrit Updater [ 10/Sep/18 ]

deleted unused patch

Comment by Sarah Liu [ 07/Dec/18 ]

Hit the problem on 2.10.6-RC3 EL7.6 on soak. IOR-fpp and IOR-ssf failed most of the cases(with only run these 2 kinds of tests), only about 10% passed

see these messages on 4 MDS

[16142.979907] Lustre: Skipped 1 previous similar message
[16161.065259] LustreError: 11-0: soaked-MDT0000-osp-MDT0002: operation out_update to node 192.168.1.109@o2ib failed: rc = -107
[16161.080956] Lustre: soaked-MDT0000-osp-MDT0002: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for re
covery to complete
[16167.921747] LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation obd_ping to node 192.168.1.109@o2ib failed: rc = -107
[16167.937337] Lustre: soaked-MDT0000-lwp-MDT0002: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for re
covery to complete
[16199.921160] Lustre: 12223:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544058876/real 1544058876]  req@ffff8ac2615
02a00 x1619047196845600/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.109@o2ib:26/25 lens 224/224 e 0 to 1 dl 1544058883 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
[16199.962451] Lustre: 12223:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
[16199.976625] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.109@o2ib) was lost; in progress operations using this service will fail
[16199.999063] LustreError: 167-0: soaked-MDT0000-lwp-MDT0002: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
[16200.024960] Lustre: soaked-MDT0000-lwp-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
[16200.039830] Lustre: Skipped 1 previous similar message
[16224.998656] Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib_0) after server handle changed from 0xbd1dde8c58853e94 to 0xec588ff078b4018
[16250.512699] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[16250.526220] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 33 previous similar messages
[17087.376275] LustreError: 12385:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[91933.141411] LustreError: 17081:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[91933.155290] LustreError: 17081:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 31 previous similar messages
[92285.346959] LustreError: 12401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[92285.360864] LustreError: 12401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 2 previous similar messages
[93937.039221] LustreError: 17076:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[93937.053175] LustreError: 17076:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages
[94742.219984] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[94742.233998] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message
[96280.081566] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[96280.095067] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message
[96281.231050] LustreError: 12407:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
[96281.244965] LustreError: 12407:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 11 previous similar messages
[96284.246271] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 776 actual 416.
Comment by Gerrit Updater [ 16/Jan/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33133/
Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 9c57139f261292bbd20e5f304beba4e209a54a40

Generated at Sat Feb 10 02:31:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.