[LU-11873] sanity test_801a: FAIL: (2) unexpected barrier status 'expired' Created: 18/Jan/19 Updated: 12/Sep/19 Resolved: 21/Aug/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0, Lustre 2.12.1 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.3 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jian Yu | Assignee: | Patrick Farrell (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
sanity test 801a failed in review-ldiskfs test session on master branch as follows: trevis-34vm4: Fail to freeze barrier for lustre: Timer expired CMD: trevis-34vm4 /usr/sbin/lctl get_param -n version 2>/dev/null || /usr/sbin/lctl lustre_build_version 2>/dev/null || /usr/sbin/lctl --version 2>/dev/null | cut -d' ' -f2 CMD: trevis-34vm4 /usr/sbin/lctl barrier_stat -s lustre sanity test_801a: @@@@@@ FAIL: (2) unexpected barrier status 'expired' Maloo reports: |
| Comments |
| Comment by Minh Diep [ 20/Mar/19 ] |
|
+1 on b2_12 https://testing.whamcloud.com/test_sets/e298fc1e-4ad4-11e9-92fe-52540065bddc |
| Comment by Alex Zhuravlev [ 23/Apr/19 ] |
|
I see this issue with master on a local setup very frequently. |
| Comment by Chris Horn [ 25/Jun/19 ] |
|
+1 on master https://testing.whamcloud.com/test_sets/9d49e0f0-9756-11e9-8262-52540065bddc |
| Comment by Patrick Farrell (Inactive) [ 28/Jun/19 ] |
|
Logs fromĀ https://testing.whamcloud.com/test_sets/c432303a-9988-11e9-af8b-52540065bddc 00000004:00080000:0.0:1561700209.521557:0:11001:0:(osp_sync.c:1615:osp_sync_add_commit_cb()) lustre-OST0000-osc-MDT0001: add commit cb at 12164268321642ns, next at 11075831108412ns, rc = 0 12164268321642-11075831108412 On the node where the barrier command is being done, we start setting the barrier: And it doesn't complete until almost 10 seconds later: Looking at the glimpses, this one: But it doesn't finish until: Much later, after the barrier has expired. Looking at the node where this glimpse was sent, we can see it arriving, and then generating a sync operation as part of turning on the barrier: And we see: And then, after a nice long wait: And then (finally) we reply. So the issue is the ZFS sync interval being around 10 seconds, which is the same as the length of our barrier. So if we get unlucky, we'll overrun it. It's probably enough to change the barrier length to 15 seconds. |
| Comment by Gerrit Updater [ 28/Jun/19 ] |
|
Patrick Farrell (pfarrell@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35361 |
| Comment by Gerrit Updater [ 21/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35361/ |
| Comment by Peter Jones [ 21/Aug/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 28/Aug/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35952 |
| Comment by Gerrit Updater [ 12/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35952/ |