[LU-10576] sanity test_803: before (425) - 8 < after (425) Created: 29/Jan/18 Updated: 04/Dec/18 Resolved: 14/Feb/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.11.0 |
| Fix Version/s: | Lustre 2.11.0, Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | dne, ldiskfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
sanity test_803 - before (425) - 8 < after (425) This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: test_803 failed with the following error: before (425) - 8 < after (425) Sanity/803 fails in review-dne-part-1 This seems to be ldiskfs-only bug. |
| Comments |
| Comment by Andreas Dilger [ 30/Jan/18 ] |
|
This test has failed 15 times in the past week. |
| Comment by nasf (Inactive) [ 31/Jan/18 ] |
|
https://testing.hpdd.intel.com/test_logs/5059bc9e-050c-11e8-a7cd-52540065bddc/show_text UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 838864 415 838449 0% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 838864 316 838548 0% /mnt/lustre[MDT:1] After create: UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 838864 425 838439 0% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 838864 326 838538 0% /mnt/lustre[MDT:1] That means the 10 cross-MDTs directories were created on the MDT0001, and 10 agent objects were created on the MDT0000. That is expected. But after the unlink, the object usage was the same as before unlink: UUID Inodes IUsed IFree IUse% Mounted on lustre-MDT0000_UUID 838864 425 838439 0% /mnt/lustre[MDT:0] lustre-MDT0001_UUID 838864 326 838538 0% /mnt/lustre[MDT:1] That means although the "rm -rf" returned successfully, the target directories were not unlinked. That is why the test failed. I will investigate more about why the unlink did not work. |
| Comment by Gerrit Updater [ 31/Jan/18 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31102 |
| Comment by James Nunez (Inactive) [ 05/Feb/18 ] |
|
As you know, this test only fails for review-dne-part-1 with ldiskfs. Looking at the failures for this test over the past 4 days, sanity test 803 ran 34 times for review-dne-part-1 and failed 10 times. I would expect that your debug patch should have failed by now. After running your debug patch a few more times, we have two failures at: |
| Comment by nasf (Inactive) [ 06/Feb/18 ] |
|
https://testing.hpdd.intel.com/test_sets/36b3614c-0ac9-11e8-bd00-52540065bddc 98950:00000004:00000001:1.0:1517870947.709841:0:15350:0:(mdt_handler.c:424:mdt_statfs()) Process entered 98955:00000004:00000024:1.0:1517870947.709853:0:15350:0:(mdt_handler.c:458:mdt_statfs()) blocks cached 0 granted 4243456 pending 0 free 1168535552 avail 1062862848 98973:00000004:00000020:1.0:1517870947.709881:0:15350:0:(mdt_handler.c:468:mdt_statfs()) 291475 blocks: 285287 free, 259488 avail; 838864 objects: 838438 free; state 0 98976:00000004:00000001:1.0:1517870947.709891:0:15350:0:(mdt_handler.c:486:mdt_statfs()) Process leaving (rc=0 : 0 : 0) After the unlink: 126056:00000004:00000001:0.0:1517870949.625148:0:14574:0:(mdt_handler.c:424:mdt_statfs()) Process entered 126061:00000004:00000024:0.0:1517870949.625154:0:14574:0:(mdt_handler.c:458:mdt_statfs()) blocks cached 0 granted 4243456 pending 0 free 1168535552 avail 1062862848 126079:00000004:00000020:0.0:1517870949.625171:0:14574:0:(mdt_handler.c:468:mdt_statfs()) 291475 blocks: 285287 free, 259488 avail; 838864 objects: 838438 free; state 0 126082:00000004:00000001:0.0:1517870949.625178:0:14574:0:(mdt_handler.c:486:mdt_statfs()) Process leaving (rc=0 : 0 : 0) But the destroy agent object really happened on the MDT0: 00080000:00000001:0.0:1517870947.894254:0:14579:0:(osd_handler.c:1883:osd_trans_stop()) Process entered 00000004:00000040:0.0:1517870947.894256:0:14579:0:(mdt_recovery.c:87:mdt_txn_stop_cb()) More than one transaction 00000001:00000001:0.0:1517870947.894257:0:14579:0:(tgt_lastrcvd.c:1204:tgt_last_rcvd_update()) Process entered 00000001:00000002:0.0:1517870947.894258:0:14579:0:(tgt_lastrcvd.c:1245:tgt_last_rcvd_update()) transno = 55834730919, last_committed = 55834730918 00000001:00000010:0.0:1517870947.894259:0:14579:0:(tgt_lastrcvd.c:909:tgt_last_commit_cb_add()) kmalloced 'ccb': 96 at ffff8800576b39c0. 00000001:00000040:0.0:1517870947.894260:0:14579:0:(tgt_lastrcvd.c:914:tgt_last_commit_cb_add()) callback GETting export ffff880057dd3400 : new cb_count 1 00000020:00000040:0.0:1517870947.894260:0:14579:0:(genops.c:996:class_export_get()) GETting export ffff880057dd3400 : new refcount 17 00000001:00000001:0.0:1517870947.894265:0:14579:0:(tgt_lastrcvd.c:1387:tgt_last_rcvd_update()) Process leaving (rc=0 : 0 : 0) 00040000:00000001:0.0:1517870947.894267:0:14579:0:(qsd_handler.c:1080:qsd_op_end()) Process entered 00040000:00000001:0.0:1517870947.894267:0:14579:0:(qsd_handler.c:1111:qsd_op_end()) Process leaving 00080000:00000010:0.0:1517870947.894268:0:14579:0:(osd_handler.c:3595:osd_process_scheduled_agent_removals()) kfreed 'oor': 32 at ffff8800517297c0. 00080000:00000001:0.0:1517870947.894275:0:14579:0:(osd_handler.c:4705:osd_remote_fid()) Process entered 00080000:00000001:0.0:1517870947.894275:0:14579:0:(osd_handler.c:1838:osd_seq_exists()) Process entered 80000000:00000001:0.0:1517870947.894275:0:14579:0:(fld_handler.c:207:fld_local_lookup()) Process entered 80000000:00000001:0.0:1517870947.894276:0:14579:0:(fld_cache.c:530:fld_cache_lookup()) Process entered 80000000:00000001:0.0:1517870947.894276:0:14579:0:(fld_cache.c:549:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0) 80000000:00000001:0.0:1517870947.894277:0:14579:0:(fld_handler.c:224:fld_local_lookup()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:0.0:1517870947.894277:0:14579:0:(osd_handler.c:1851:osd_seq_exists()) Process leaving (rc=0 : 0 : 0) 00080000:00000001:0.0:1517870947.894278:0:14579:0:(osd_handler.c:4721:osd_remote_fid()) Process leaving (rc=1 : 1 : 1) 00080000:00000001:0.0:1517870947.894310:0:14579:0:(osd_handler.c:1954:osd_trans_stop()) Process leaving (rc=0 : 0 : 0) So the possibility reason is that, the mdt_statfs() returned the cached status. |
| Comment by Bob Glossman (Inactive) [ 09/Feb/18 ] |
|
I'm wondering if this might be related to |
| Comment by James A Simmons [ 09/Feb/18 ] |
|
Bob can you try https://review.whamcloud.com/#/c/31158 to see if it fixes it. |
| Comment by Bob Glossman (Inactive) [ 09/Feb/18 ] |
|
James, Bob |
| Comment by nasf (Inactive) [ 10/Feb/18 ] |
|
What I learned from the log is that the life-cycle of the cached statfs seems inaccurate. On some degree, that means the test scripts are not robust enough. So even if the https://review.whamcloud.com/#/c/31158 is helpful for this ticket, I still suggest to land current patch to make the test scripts more robust. |
| Comment by Bob Glossman (Inactive) [ 10/Feb/18 ] |
|
I only see a debug patch shown; https://review.whamcloud.com/#/c/31102 |
| Comment by nasf (Inactive) [ 10/Feb/18 ] |
|
Originally, the patch https://review.whamcloud.com/#/c/31102 was for debug purpose. But the latest version (set 6) offers the solution. |
| Comment by Gerrit Updater [ 14/Feb/18 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31102/ |
| Comment by Peter Jones [ 14/Feb/18 ] |
|
Landed for 2.11 |
| Comment by Gerrit Updater [ 16/Nov/18 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33677 |
| Comment by Patrick Farrell (Inactive) [ 22/Nov/18 ] |
|
Hit this: Which is presumably why there's another patch |
| Comment by Artem Blagodarenko (Inactive) [ 03/Dec/18 ] |
|
Looks like the same https://testing.whamcloud.com/test_sets/ee2851ce-f4df-11e8-815b-52540065bddc |
| Comment by Gerrit Updater [ 04/Dec/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33677/ |