[LU-9820] sanity-scrub test_9 fails with "(9) Expected 'scanning' on mds*" Created: 02/Aug/17  Updated: 26/Feb/21  Resolved: 26/Feb/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.3, Lustre 2.10.6, Lustre 2.10.7
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-scrub test 9 fails with

sanity-scrub test_9: @@@@@@ FAIL: (9) Expected 'scanning' on mds3 

From the test_log, the last thing we see is:

CMD: trevis-8vm4 /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t scrub 8 -s 100 -r
Started LFSCK on the device lustre-MDT0000: scrub
CMD: trevis-8vm8 /usr/sbin/lctl lfsck_start -M lustre-MDT0001 -t scrub 8 -s 100 -r
Started LFSCK on the device lustre-MDT0001: scrub
CMD: trevis-8vm4 /usr/sbin/lctl lfsck_start -M lustre-MDT0002 -t scrub 8 -s 100 -r
Started LFSCK on the device lustre-MDT0002: scrub
CMD: trevis-8vm8 /usr/sbin/lctl lfsck_start -M lustre-MDT0003 -t scrub 8 -s 100 -r
Started LFSCK on the device lustre-MDT0003: scrub
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0000.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0000.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm8 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0001.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm8 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0001.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
Waiting 6 secs for update
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
CMD: trevis-8vm4 /usr/sbin/lctl get_param -n 			osd-ldiskfs.lustre-MDT0002.oi_scrub |
			awk '/^status/ { print \$2 }'
Update not seen after 6s: wanted 'scanning' got 'completed'
 sanity-scrub test_9: @@@@@@ FAIL: (9) Expected 'scanning' on mds3 

There are no obvious errors in the console and debug logs.

So far, this looks like it only impacts DNE Lustre configurations (reivew-dne-part-*)
This test started to fail on June 2, 2017.

Logs for this test failures are at:
https://testing.hpdd.intel.com/test_sets/f934f42a-772a-11e7-8e04-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6112f68a-76f1-11e7-8db2-5254006e85c2
https://testing.hpdd.intel.com/test_sets/57e14740-7626-11e7-9a53-5254006e85c2
https://testing.hpdd.intel.com/test_sets/f429dfd0-733e-11e7-8d7d-5254006e85c2
https://testing.hpdd.intel.com/test_sets/cd0041d8-6eac-11e7-a055-5254006e85c2
https://testing.hpdd.intel.com/test_sets/200eed24-6d32-11e7-a052-5254006e85c2
https://testing.hpdd.intel.com/test_sets/76ef4bb6-6841-11e7-a74b-5254006e85c2



 Comments   
Comment by Bob Glossman (Inactive) [ 08/Sep/17 ]

more on master:
https://testing.hpdd.intel.com/test_sets/b09ea26e-8cbb-11e7-b94a-5254006e85c2
https://testing.hpdd.intel.com/test_sets/c519fe96-951c-11e7-b89e-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 04/Oct/17 ]

One more on master :
https://testing.hpdd.intel.com/test_sessions/b7e37542-ad38-4029-8e4b-bc63b0d31e9d

Comment by Bob Glossman (Inactive) [ 08/Oct/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/d0440872-abcc-11e7-b78a-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 20/Nov/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/e1b33714-cc0f-11e7-8027-52540065bddc

Comment by Jinshan Xiong (Inactive) [ 25/Nov/17 ]

https://testing.hpdd.intel.com/test_sessions/2b869baa-72d0-4873-9cb9-4dc7e80ae8bd

Comment by Mikhail Pershin [ 13/Dec/17 ]

https://testing.hpdd.intel.com/test_sets/65acf810-df85-11e7-9840-52540065bddc

Comment by Jian Yu [ 27/Dec/17 ]

+2 on master branch:
https://testing.hpdd.intel.com/test_sets/7d9da346-e658-11e7-9840-52540065bddc
https://testing.hpdd.intel.com/test_sets/d848a0aa-ed35-11e7-a169-52540065bddc

Comment by Nathaniel Clark [ 08/Jan/18 ]

This looks like it fails because for DNE each MDT is waited for serially in sanity-scrub.sh::scrub_check_status(), thus if one of the middle MDTs takes a long time (test log shows "Waiting 6 secs for update") then the end ones may complete before being able to show up as "scanning".

Comment by Andreas Dilger [ 27/Feb/18 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/98f2733e-18ac-11e8-bd00-52540065bddc

Comment by Saurabh Tandan (Inactive) [ 11/Apr/18 ]

+1 on 2.10.3 for

RHEL 7.4 Server/DNE/ldiskfs
RHEL 7.4 Client

build 96

https://testing.hpdd.intel.com/test_sets/6bba7824-3d84-11e8-960d-52540065bddc

Comment by Gerrit Updater [ 10/Nov/20 ]

Lai Siyao (lai.siyao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40591
Subject: LU-9820 osd-ldiskfs: OI scrub speed lmit fix
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 94ce8731c1ff9e95366b57c9ee919bd621af96a5

Comment by Gerrit Updater [ 26/Feb/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40591/
Subject: LU-9820 osd-ldiskfs: OI scrub speed limit fix
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fc8f1381698369565752f32815c4b38f9c1e8cdb

Comment by Peter Jones [ 26/Feb/21 ]

Landed for 2.15

Generated at Sat Feb 10 02:29:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.