[LU-11873] sanity test_801a: FAIL: (2) unexpected barrier status 'expired' Created: 18/Jan/19  Updated: 12/Sep/19  Resolved: 21/Aug/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0, Lustre 2.12.1
Fix Version/s: Lustre 2.13.0, Lustre 2.12.3

Type: Bug Priority: Minor
Reporter: Jian Yu Assignee: Patrick Farrell (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test 801a failed in review-ldiskfs test session on master branch as follows:

trevis-34vm4: Fail to freeze barrier for lustre: Timer expired
CMD: trevis-34vm4 /usr/sbin/lctl get_param -n version 2>/dev/null ||
				/usr/sbin/lctl lustre_build_version 2>/dev/null ||
				/usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2
CMD: trevis-34vm4 /usr/sbin/lctl barrier_stat -s lustre
 sanity test_801a: @@@@@@ FAIL: (2) unexpected barrier status 'expired' 

Maloo reports:
https://testing.whamcloud.com/test_sets/f49fde42-1b26-11e9-8388-52540065bddc
https://testing.whamcloud.com/test_sets/900f4ab4-1ac0-11e9-8388-52540065bddc



 Comments   
Comment by Minh Diep [ 20/Mar/19 ]

+1 on b2_12 https://testing.whamcloud.com/test_sets/e298fc1e-4ad4-11e9-92fe-52540065bddc

Comment by Alex Zhuravlev [ 23/Apr/19 ]

I see this issue with master on a local setup very frequently.

Comment by Chris Horn [ 25/Jun/19 ]

+1 on master https://testing.whamcloud.com/test_sets/9d49e0f0-9756-11e9-8262-52540065bddc

Comment by Patrick Farrell (Inactive) [ 28/Jun/19 ]

Logs fromĀ https://testing.whamcloud.com/test_sets/c432303a-9988-11e9-af8b-52540065bddc
(https://review.whamcloud.com/#/c/35352/)

00000004:00080000:0.0:1561700209.521557:0:11001:0:(osp_sync.c:1615:osp_sync_add_commit_cb()) lustre-OST0000-osc-MDT0001: add commit cb at 12164268321642ns, next at 11075831108412ns, rc = 0

12164268321642-11075831108412

On the node where the barrier command is being done, we start setting the barrier:
20000000:00000001:0.0:1561700207.130890:0:28927:0:(mgs_barrier.c:306:mgs_barrier_freeze()) Process entered
20000000:00000001:0.0:1561700209.536789:0:28927:0:(mgs_barrier.c:128:mgs_barrier_glimpse_lock()) Process entered

And it doesn't complete until almost 10 seconds later:
20000000:00000001:1.0:1561700218.029393:0:28927:0:(mgs_barrier.c:205:mgs_barrier_glimpse_lock()) Process leaving via out (rc=0 : 0 : 0x0)
By which time our barrier has expired.

Looking at the glimpses, this one:
00000100:00000040:0.0:1561700209.537058:0:28927:0:(lustre_net.h:2491:ptlrpc_rqphase_move()) @@@ move req "New" > "Rpc" req@ffff93ba50f0cd80 x1637548785242880/t0(0) o106>MGS@10.2.8.58@tcp:15/16 lens 368/224 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1
is sent normally, nice and quick...

But it doesn't finish until:
00000100:00000200:1.0:1561700218.029267:0:28927:0:(events.c:93:reply_in_callback()) @@@ type 6, status 0 req@ffff93ba50f0cd80 x1637548785242880/t0(0) o106->MGS@10.2.8.58@tcp:15/16 lens 368/224 e 1 to 0 dl 1561700241 ref 1 fl Rpc:RQ/0/ffffffff rc 0/-1

Much later, after the barrier has expired.

Looking at the node where this glimpse was sent, we can see it arriving, and then generating a sync operation as part of turning on the barrier:
00000100:00000200:1.0:1561700209.521162:0:24526:0:(service.c:2234:ptlrpc_server_handle_request()) got req 1637548785242880
[...]
00000004:00080000:1.0:1561700209.521565:0:24526:0:(osp_sync.c:1615:osp_sync_add_commit_cb()) lustre-OST0004-osc-MDT0003: add commit cb at 12164268329849ns, next at 12152647346967ns, rc = 0
12164268329849ns-12152647346967ns --> ~11 seconds

And we see:
00080000:00000020:1.0:1561700209.521610:0:24526:0:(osd_handler.c:658:osd_sync()) syncing OSD osd-zfs

And then, after a nice long wait:
00080000:00000020:0.0:1561700218.012522:0:24526:0:(osd_handler.c:660:osd_sync()) synced OSD osd-zfs
00000004:00000001:0.0:1561700218.012525:0:24526:0:(lod_dev.c:1583:lod_sync()) Process leaving (rc=0 : 0 : 0)
00200000:40000000:0.0:1561700218.012528:0:24526:0:(barrier.c:221:barrier_freeze()) lustre-MDT0003-osd: barrier freezing phase1 done.
00200000:00000001:0.0:1561700218.012530:0:24526:0:(barrier.c:226:barrier_freeze()) Process leaving (rc=0 : 0 : 0)
00200000:00000001:0.0:1561700218.012531:0:24526:0:(barrier.c:344:barrier_handler()) Process leaving via fini (rc=0 : 0 : 0x0)

And then (finally) we reply.

So the issue is the ZFS sync interval being around 10 seconds, which is the same as the length of our barrier. So if we get unlucky, we'll overrun it.

It's probably enough to change the barrier length to 15 seconds.

Comment by Gerrit Updater [ 28/Jun/19 ]

Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35361
Subject: LU-11873 tests: Increase barrier freeze time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f34ecbc3b15c6897be545f8241bf0f626997f183

Comment by Gerrit Updater [ 21/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35361/
Subject: LU-11873 tests: Increase barrier freeze time
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 96771280b330af07781326ff8811facd1ca39deb

Comment by Peter Jones [ 21/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 28/Aug/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35952
Subject: LU-11873 tests: Increase barrier freeze time
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: b35866abc21a854e75cde15a6562ee2004c64b20

Comment by Gerrit Updater [ 12/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35952/
Subject: LU-11873 tests: Increase barrier freeze time
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 3a164a8d81395092add1426549821159db2e33a7

Generated at Sat Feb 10 02:47:40 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.