[LU-9157] replay-single test_80c: rmdir failed Created: 24/Feb/17  Updated: 26/Feb/19  Resolved: 12/Sep/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0, Lustre 2.10.2, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: James Casper Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None
Environment:

trevis-46vm1/2/3/7/8, Full Group test w/DNE,
RHEL7.3/zfs, branch master, v2.9.53.19.gdad9ce4, b3534


Issue Links:
Duplicate
duplicates LU-10740 replay-single test_2d: FAIL: checksta... Resolved
duplicates LU-10143 LBUG dt_object.h:2166:dt_declare_reco... Resolved
Related
is related to LU-11330 replay-single test_70d: Directory not... Resolved
is related to LU-11366 replay-single timeout test 80f: rm: c... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/13a7c94c-f92a-11e6-aa39-5254006e85c2

From test_log:

rm: cannot remove '/mnt/lustre/d80c.replay-single/remote_dir': Input/output error
 replay-single test_80c: @@@@@@ FAIL: rmdir failed 
  Trace dump:
  = /usr/lib64/lustre/tests/test-framework.sh:4841:error()
  = /usr/lib64/lustre/tests/replay-single.sh:2637:test_80c()
  = /usr/lib64/lustre/tests/test-framework.sh:5117:run_one()
  = /usr/lib64/lustre/tests/test-framework.sh:5156:run_one_logged()
  = /usr/lib64/lustre/tests/test-framework.sh:5003:run_test()
  = /usr/lib64/lustre/tests/replay-single.sh:2641:main()


 Comments   
Comment by Sarah Liu [ 20/Jun/17 ]

MDS console

https://testing.hpdd.intel.com/test_sets/93814ab4-53f0-11e7-bacf-5254006e85c2

10:16:00:[59442.222216] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-mds2; mount -t lustre   		                   lustre-mdt2/mdt2 /mnt/lustre-mds2
10:16:00:[59442.618812] Lustre: lustre-MDT0001: Imperative Recovery not enabled, recovery window 60-180
10:16:00:[59442.622180] LustreError: 3386:0:(osd_oi.c:503:osd_oid()) lustre-MDT0001-osd: unsupported quota oid: 0x16
10:16:00:[59442.829897] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n health_check
10:16:00:[59443.177304] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/usr/lib64/lustre/u
10:16:00:[59443.788151] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-55vm3.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
10:16:00:[59443.792192] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-55vm3.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
10:16:00:[59444.131692] Lustre: DEBUG MARKER: trevis-55vm3.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
10:19:58:[59444.136692] Lustre: DEBUG MARKER: trevis-55vm3.trevis.hpdd.intel.com: executing set_default_debug vfstrace rpctrace dlmtrace neterror ha config ioctl super lfsck all 4
10:19:58:[59444.397031] Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
10:19:58:[59444.739127] Lustre: DEBUG MARKER: zfs get -H -o value 				lustre:svname lustre-mdt2/mdt2 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
10:19:58:[59445.082988] Lustre: DEBUG MARKER: zfs get -H -o value 				lustre:svname lustre-mdt2/mdt2 2>/dev/null | 				grep -E ':[a-zA-Z]{3}[0-9]{4}'
10:19:58:[59445.432320] Lustre: DEBUG MARKER: zfs get -H -o value lustre:svname 		                           lustre-mdt2/mdt2 2>/dev/null
10:19:58:[59446.305312] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-55vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
10:19:58:[59446.305685] Lustre: DEBUG MARKER: /usr/sbin/lctl mark trevis-55vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
10:19:58:[59446.626587] Lustre: DEBUG MARKER: trevis-55vm1.trevis.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
10:19:58:[59446.628706] Lustre: DEBUG MARKER: trevis-55vm2.trevis.hpdd.intel.com: executing wait_import_state_mount FULL mdc.lustre-MDT0001-mdc-*.mds_server_uuid
10:19:58:[59447.057549] Lustre: lustre-MDT0001: Will be in recovery for at least 1:00, or until 5 clients reconnect
10:19:58:[59447.570240] Lustre: lustre-MDT0001: Recovery over after 0:01, of 5 clients 5 recovered and 0 were evicted.
10:19:58:[59448.124116] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 1 sec
10:19:58:[59448.124291] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 1 sec
10:19:58:[59448.512706] Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 1 sec
10:19:58:[59448.527843] Lustre: DEBUG MARKER: mdc.lustre-MDT0001-mdc-*.mds_server_uuid in FULL state after 1 sec
10:19:58:[59450.364950] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  replay-single test_80c: @@@@@@ FAIL: rmdir failed 
10:19:58:[59450.685468] Lustre: DEBUG MARKER: replay-single test_80c: @@@@@@ FAIL: rmdir failed
10:19:58:[59451.049620] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /test_logs/2017-06-16/lustre-master-el7-x86_64--full--1_19_1__3603__-70308503758260-171737/replay-single.test_80c.debug_log.$(hostname -s).1497694559.log;
10:19:58:[59451.049620]          dmesg > /test_logs/2017-06-16/lustre-master-el7-x86_64--full--1_19_1__3
10:19:58:[59453.691066] Lustre: DEBUG MARKER: lctl set_param -n fail_loc=0 	    fail_val=0 2>/dev/null
10:19:58:[59455.834157] Lustre: DEBUG MARKER: for i in $(find /proc/spl -type f); do
10:19:58:[59455.834157] 				echo ====${i}=======================;
10:19:58:[59455.834157] 				cat ${i};
10:19:58:[59455.834157] 				done
10:19:58:[59456.934168] Lustre: DEBUG MARKER: echo ====zpool events -v=======================;
10:19:58:[59456.934168] 				zpool events -v
10:19:58:[59457.623449] LustreError: 3434:0:(llog_cat.c:763:llog_cat_cancel_records()) lustre-MDT0000-osp-MDT0001: fail to cancel 0 of 1 llog-records: rc = -116
10:19:58:[59457.752332] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == replay-single test 80d: DNE: create remote dir, drop update rep from MDT1, fail 2 MDTs ============ 10:16:06 \(1497694566\)
Comment by James Nunez (Inactive) [ 08/Aug/18 ]

Some recent failures:
https://testing.whamcloud.com/test_sets/fb3a0a6c-99e8-11e8-b0aa-52540065bddc
https://testing.whamcloud.com/test_sets/fc2b404a-9a78-11e8-a9f7-52540065bddc

It fails 100% of the time on master with review-dne-zfs-part-4.

Comment by Peter Jones [ 08/Aug/18 ]

Lai

Could you please advise?

Thanks

Peter

Comment by Lai Siyao [ 04/Sep/18 ]

The failure should be the same as LU-10740, the logs shows the same FID was allocated twice, which cause llog_osd_create() failed with -EEXIST, and mdt_reint_unlink() failed with -EIO in the end:

40000000:00000040:0.0:1535086843.183909:0:24271:0:(fid_request.c:400:seq_client_alloc_fid()) cli-cli-lustre-MDT0000-osp-MDT0001: Allocated FID [0x20000bb98:0x1:0x0]
00000004:00000040:0.0:1535086843.183917:0:24271:0:(mdt_handler.c:5324:mdt_object_init()) object init, fid = [0x20000bb98:0x1:0x0]
00000004:00000040:1.0:1535086843.184867:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = create params = 1 batchid = 0 size = 256 repsize 0
00000040:00001000:0.0:1535086843.213847:0:24271:0:(llog_osd.c:399:llog_osd_write_rec()) new record 106a0000 to [0x20000bb98:0x1:0x0]
00000004:00000040:0.0:1535086843.213858:0:24271:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = attr_get params = 0 batchid = 0 size = 40 repsize 208
00000004:00000040:0.0:1535086843.214723:0:24271:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 0 length = 32768
00000004:00000040:0.0:1535086843.214752:0:24271:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 32768 length = 504
00000040:00080000:0.0:1535086843.214755:0:24271:0:(llog_osd.c:698:llog_osd_write_rec()) added record [0x20000bb98:0x1:0x0].1, 504 off33272
00000004:00000040:1.0:1535086843.214788:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = write params = 2 batchid = 0 size = 32832 repsize 0
00000004:00000040:1.0:1535086843.214789:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 1 fid = [0x20000bb98:0x1:0x0] op = write params = 2 batchid = 0 size = 568 repsize 0
00000040:00001000:1.0:1535086846.270681:0:24271:0:(llog_osd.c:399:llog_osd_write_rec()) new record 106a0000 to [0x20000bb98:0x1:0x0]
00000004:00000040:1.0:1535086846.270684:0:24271:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 0 length = 88
00000004:00000040:1.0:1535086846.270692:0:24271:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 88 length = 4
00000004:00000040:1.0:1535086846.270695:0:24271:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 32760 length = 8
00000004:00000040:1.0:1535086846.270699:0:24271:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 33272 length = 1160
00000040:00080000:1.0:1535086846.270702:0:24271:0:(llog_osd.c:698:llog_osd_write_rec()) added record [0x20000bb98:0x1:0x0].2, 1160 off34432
00000004:00000040:1.0:1535086872.182169:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 3 fid = [0x20000bb98:0x1:0x0] op = write params = 2 batchid = 0 size = 152 repsize 0
00000004:00000040:1.0:1535086872.182172:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 4 fid = [0x20000bb98:0x1:0x0] op = write params = 2 batchid = 0 size = 72 repsize 0
00000004:00000040:1.0:1535086872.182174:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 5 fid = [0x20000bb98:0x1:0x0] op = write params = 2 batchid = 0 size = 72 repsize 0
00000004:00000040:1.0:1535086872.182176:0:23089:0:(osp_trans.c:265:object_update_request_dump()) i = 6 fid = [0x20000bb98:0x1:0x0] op = write params = 2 batchid = 0 size = 1224 repsize 0
00000004:00000040:0.0:1535086877.998347:0:26196:0:(mdt_handler.c:5344:mdt_object_free()) object free, fid = [0x20000bb98:0x1:0x0]
00000004:00000040:1.0:1535086890.715650:0:27095:0:(mdt_handler.c:5324:mdt_object_init()) object init, fid = [0x20000bb98:0x1:0x0]
00000004:00000040:1.0:1535086890.715674:0:27095:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = attr_get params = 0 batchid = 0 size = 40 repsize 208
00000004:00000040:1.0:1535086890.716560:0:27095:0:(osp_md_object.c:1238:osp_md_read()) lustre-MDT0000-osp-MDT0001 [0x20000bb98:0x1:0x0] read offset 0 size 32768
00000004:00000040:1.0:1535086890.716562:0:27095:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = read params = 2 batchid = 0 size = 72 repsize 32768
00000004:00000040:1.0:1535086890.721227:0:27095:0:(osp_md_object.c:1238:osp_md_read()) lustre-MDT0000-osp-MDT0001 [0x20000bb98:0x1:0x0] read offset 32768 size 32768
00000004:00000040:1.0:1535086890.721229:0:27095:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = read params = 2 batchid = 0 size = 72 repsize 32768
40000000:00000040:0.0:1535086899.333519:0:26923:0:(fid_request.c:400:seq_client_alloc_fid()) cli-cli-lustre-MDT0000-osp-MDT0001: Allocated FID [0x20000bb98:0x1:0x0]
00000004:00000040:0.0:1535086899.333537:0:26923:0:(mdt_handler.c:5324:mdt_object_init()) object init, fid = [0x20000bb98:0x1:0x0]
00000004:00000040:1.0:1535086899.334459:0:26954:0:(osp_trans.c:265:object_update_request_dump()) i = 0 fid = [0x20000bb98:0x1:0x0] op = create params = 1 batchid = 0 size = 256 repsize 0
00000040:00001000:0.0:1535086899.371627:0:26923:0:(llog_osd.c:399:llog_osd_write_rec()) new record 106a0000 to [0x20000bb98:0x1:0x0]
00000004:00000040:0.0:1535086899.371630:0:26923:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 0 length = 32768
00000004:00000040:0.0:1535086899.371679:0:26923:0:(osp_md_object.c:1163:osp_md_write()) write [0x20000bb98:0x1:0x0] offset = 32768 length = 504
00000040:00080000:0.0:1535086899.371684:0:26923:0:(llog_osd.c:698:llog_osd_write_rec()) added record [0x20000bb98:0x1:0x0].1, 504 off33272
00000004:00080000:1.0:1535086899.394315:0:26954:0:(osp_object.c:1304:osp_invalidate()) Invalidate osp_object [0x20000bb98:0x1:0x0]
Comment by Peter Jones [ 12/Sep/18 ]

duplicate of LU-10740

Generated at Sat Feb 10 02:23:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.