[LU-10576] sanity test_803: before (425) - 8 < after (425) Created: 29/Jan/18  Updated: 04/Dec/18  Resolved: 14/Feb/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: Lustre 2.11.0, Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Maloo Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: dne, ldiskfs

Issue Links:
Duplicate
is duplicated by LU-10605 sanity test_803: <<add short descript... Closed
Related
is related to LU-10570 sanity test_27y: Error: 'Of 2 OSTs, ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity test_803 - before (425) - 8 < after (425)
^^^^^^^^^^^^^ DO NOT REMOVE LINE ABOVE ^^^^^^^^^^^^^

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/4e6def2c-050c-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/aee6ba76-0306-11e8-a10a-52540065bddc
https://testing.hpdd.intel.com/test_sets/d1635240-0300-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/251bec82-0363-11e8-a10a-52540065bddc

test_803 failed with the following error:

before (425) - 8 < after (425)

Sanity/803 fails in review-dne-part-1

This seems to be ldiskfs-only bug.



 Comments   
Comment by Andreas Dilger [ 30/Jan/18 ]

This test has failed 15 times in the past week.

Comment by nasf (Inactive) [ 31/Jan/18 ]

https://testing.hpdd.intel.com/test_logs/5059bc9e-050c-11e8-a7cd-52540065bddc/show_text
As shown in the test logs, before create, the inode usage:

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       838864         415      838449   0% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       838864         316      838548   0% /mnt/lustre[MDT:1]

After create:

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       838864         425      838439   0% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       838864         326      838538   0% /mnt/lustre[MDT:1]

That means the 10 cross-MDTs directories were created on the MDT0001, and 10 agent objects were created on the MDT0000. That is expected. But after the unlink, the object usage was the same as before unlink:

UUID                      Inodes       IUsed       IFree IUse% Mounted on
lustre-MDT0000_UUID       838864         425      838439   0% /mnt/lustre[MDT:0]
lustre-MDT0001_UUID       838864         326      838538   0% /mnt/lustre[MDT:1]

That means although the "rm -rf" returned successfully, the target directories were not unlinked. That is why the test failed. I will investigate more about why the unlink did not work.

Comment by Gerrit Updater [ 31/Jan/18 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/31102
Subject: LU-10576 tests: debug patch for sanity test_803
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d9cb938d91ad1cea476198826ca24037f944f26c

Comment by James Nunez (Inactive) [ 05/Feb/18 ]

As you know, this test only fails for review-dne-part-1 with ldiskfs. Looking at the failures for this test over the past 4 days, sanity test 803 ran 34 times for review-dne-part-1 and failed 10 times. I would expect that your debug patch should have failed by now.

After running your debug patch a few more times, we have two failures at:
https://testing.hpdd.intel.com/test_sets/36b3614c-0ac9-11e8-bd00-52540065bddc
https://testing.hpdd.intel.com/test_sets/ed40d792-0ac8-11e8-a10a-52540065bddc

Comment by nasf (Inactive) [ 06/Feb/18 ]

https://testing.hpdd.intel.com/test_sets/36b3614c-0ac9-11e8-bd00-52540065bddc
The logs on the MDT0 shows that:
before the unlink:

98950:00000004:00000001:1.0:1517870947.709841:0:15350:0:(mdt_handler.c:424:mdt_statfs()) Process entered
98955:00000004:00000024:1.0:1517870947.709853:0:15350:0:(mdt_handler.c:458:mdt_statfs()) blocks cached 0 granted 4243456 pending 0 free 1168535552 avail 1062862848
98973:00000004:00000020:1.0:1517870947.709881:0:15350:0:(mdt_handler.c:468:mdt_statfs()) 291475 blocks: 285287 free, 259488 avail; 838864 objects: 838438 free; state 0
98976:00000004:00000001:1.0:1517870947.709891:0:15350:0:(mdt_handler.c:486:mdt_statfs()) Process leaving (rc=0 : 0 : 0)

After the unlink:

126056:00000004:00000001:0.0:1517870949.625148:0:14574:0:(mdt_handler.c:424:mdt_statfs()) Process entered
126061:00000004:00000024:0.0:1517870949.625154:0:14574:0:(mdt_handler.c:458:mdt_statfs()) blocks cached 0 granted 4243456 pending 0 free 1168535552 avail 1062862848
126079:00000004:00000020:0.0:1517870949.625171:0:14574:0:(mdt_handler.c:468:mdt_statfs()) 291475 blocks: 285287 free, 259488 avail; 838864 objects: 838438 free; state 0
126082:00000004:00000001:0.0:1517870949.625178:0:14574:0:(mdt_handler.c:486:mdt_statfs()) Process leaving (rc=0 : 0 : 0)

But the destroy agent object really happened on the MDT0:

00080000:00000001:0.0:1517870947.894254:0:14579:0:(osd_handler.c:1883:osd_trans_stop()) Process entered
00000004:00000040:0.0:1517870947.894256:0:14579:0:(mdt_recovery.c:87:mdt_txn_stop_cb()) More than one transaction
00000001:00000001:0.0:1517870947.894257:0:14579:0:(tgt_lastrcvd.c:1204:tgt_last_rcvd_update()) Process entered
00000001:00000002:0.0:1517870947.894258:0:14579:0:(tgt_lastrcvd.c:1245:tgt_last_rcvd_update()) transno = 55834730919, last_committed = 55834730918
00000001:00000010:0.0:1517870947.894259:0:14579:0:(tgt_lastrcvd.c:909:tgt_last_commit_cb_add()) kmalloced 'ccb': 96 at ffff8800576b39c0.
00000001:00000040:0.0:1517870947.894260:0:14579:0:(tgt_lastrcvd.c:914:tgt_last_commit_cb_add()) callback GETting export ffff880057dd3400 : new cb_count 1
00000020:00000040:0.0:1517870947.894260:0:14579:0:(genops.c:996:class_export_get()) GETting export ffff880057dd3400 : new refcount 17
00000001:00000001:0.0:1517870947.894265:0:14579:0:(tgt_lastrcvd.c:1387:tgt_last_rcvd_update()) Process leaving (rc=0 : 0 : 0)
00040000:00000001:0.0:1517870947.894267:0:14579:0:(qsd_handler.c:1080:qsd_op_end()) Process entered
00040000:00000001:0.0:1517870947.894267:0:14579:0:(qsd_handler.c:1111:qsd_op_end()) Process leaving
00080000:00000010:0.0:1517870947.894268:0:14579:0:(osd_handler.c:3595:osd_process_scheduled_agent_removals()) kfreed 'oor': 32 at ffff8800517297c0.
00080000:00000001:0.0:1517870947.894275:0:14579:0:(osd_handler.c:4705:osd_remote_fid()) Process entered
00080000:00000001:0.0:1517870947.894275:0:14579:0:(osd_handler.c:1838:osd_seq_exists()) Process entered
80000000:00000001:0.0:1517870947.894275:0:14579:0:(fld_handler.c:207:fld_local_lookup()) Process entered
80000000:00000001:0.0:1517870947.894276:0:14579:0:(fld_cache.c:530:fld_cache_lookup()) Process entered
80000000:00000001:0.0:1517870947.894276:0:14579:0:(fld_cache.c:549:fld_cache_lookup()) Process leaving (rc=0 : 0 : 0)
80000000:00000001:0.0:1517870947.894277:0:14579:0:(fld_handler.c:224:fld_local_lookup()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1517870947.894277:0:14579:0:(osd_handler.c:1851:osd_seq_exists()) Process leaving (rc=0 : 0 : 0)
00080000:00000001:0.0:1517870947.894278:0:14579:0:(osd_handler.c:4721:osd_remote_fid()) Process leaving (rc=1 : 1 : 1)
00080000:00000001:0.0:1517870947.894310:0:14579:0:(osd_handler.c:1954:osd_trans_stop()) Process leaving (rc=0 : 0 : 0)

So the possibility reason is that, the mdt_statfs() returned the cached status.

Comment by Bob Glossman (Inactive) [ 09/Feb/18 ]

I'm wondering if this might be related to LU-10570. That bug also has statfs issues as an underlying cause. Might the fix in flight for that ticket also help here?

Comment by James A Simmons [ 09/Feb/18 ]

Bob can you try https://review.whamcloud.com/#/c/31158 to see if it fixes it.

Comment by Bob Glossman (Inactive) [ 09/Feb/18 ]

James,
I don't have a good reproducer for this bug so I can't tell if the patch fixes anything.

Bob

Comment by nasf (Inactive) [ 10/Feb/18 ]

What I learned from the log is that the life-cycle of the cached statfs seems inaccurate. On some degree, that means the test scripts are not robust enough. So even if the https://review.whamcloud.com/#/c/31158 is helpful for this ticket, I still suggest to land current patch to make the test scripts more robust.

Comment by Bob Glossman (Inactive) [ 10/Feb/18 ]

I only see a debug patch shown; https://review.whamcloud.com/#/c/31102
Is there an actual fix patch in flight?

Comment by nasf (Inactive) [ 10/Feb/18 ]

Originally, the patch https://review.whamcloud.com/#/c/31102 was for debug purpose. But the latest version (set 6) offers the solution.

Comment by Gerrit Updater [ 14/Feb/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31102/
Subject: LU-10576 tests: sleep seconds to avoid using cached statfs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: b11cf51be71fb4be431f1c343e4f2df426226d1a

Comment by Peter Jones [ 14/Feb/18 ]

Landed for 2.11

Comment by Gerrit Updater [ 16/Nov/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33677
Subject: LU-10576 tests: allow log files to be created/removed
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 52df628d2b795f759d0094bed8d3c96ec2b74ced

Comment by Patrick Farrell (Inactive) [ 22/Nov/18 ]

Hit this:
https://testing.whamcloud.com/test_sets/b5b80bfa-ee44-11e8-86c0-52540065bddc

Which is presumably why there's another patch

Comment by Artem Blagodarenko (Inactive) [ 03/Dec/18 ]

Looks like the same https://testing.whamcloud.com/test_sets/ee2851ce-f4df-11e8-815b-52540065bddc

Comment by Gerrit Updater [ 04/Dec/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33677/
Subject: LU-10576 tests: allow log files to be created/removed
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0da3d74c1117a6c859da4dec34e739988ed6c5fd

Generated at Sat Feb 10 02:36:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.