[LU-639] replay-dual test_0b: @@@@@@ FAIL: mount1 fails Created: 25/Aug/11  Updated: 13/Dec/12  Resolved: 04/Jan/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.2.0

Type: Bug Priority: Minor
Reporter: James A Simmons Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 4779

 Description   

== replay-dual test 0b: lost client during waiting for next transno ================================== 11:11:19 (1314285079)
Filesystem 1K-blocks Used Available Use% Mounted on
10.37.248.61@o2ib1:/lustre
22047088 922544 20002752 5% /lustre/barry
Failing mds1 on node barry-mds1
Stopping /tmp/mds1 (opts on barry-mds1
affected facets: mds1
Failover mds1 to barry-mds1
11:11:34 (1314285094) waiting for barry-mds1 network 900 secs ...
11:11:34 (1314285094) network interface is UP
Starting mds1: -o user_xattr,acl /dev/md5 /tmp/mds1
Started lustre-MDT0000
Starting client: spoon01: -o user_xattr,acl,flock 10.37.248.61@o2ib1:/lustre /lustre/barry
mount.lustre: mount 10.37.248.61@o2ib1:/lustre at /lustre/barry failed: File exists
replay-dual test_0b: @@@@@@ FAIL: mount1 fais

Client dmesg
Lustre: DEBUG MARKER: == replay-dual test 0b: lost client during waiting for next transno ================================== 11:11:19 (1314285079)
Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
LustreError: 31491:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 31491:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff810173cb2400 umount complete
Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
Lustre: setting import lustre-OST0000_UUID INACTIVE by administrator request
LustreError: 31613:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 31613:0:(ldlm_request.c:1799:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff81017c9d6800 umount complete
LustreError: 31623:0:(genops.c:304:class_newdev()) Device MGC10.37.248.61@o2ib1 already exists, won't add
LustreError: 31623:0:(obd_config.c:327:class_attach()) Cannot create device MGC10.37.248.61@o2ib1 of type mgc : -17
LustreError: 31623:0:(obd_mount.c:512:lustre_start_simple()) MGC10.37.248.61@o2ib1 attach error -17
LustreError: 31623:0:(obd_mount.c:2160:lustre_fill_super()) Unable to mount (-17)
Lustre: DEBUG MARKER: replay-dual test_0b: @@@@@@ FAIL: mount1 fais

MDS dmesg

Lustre: DEBUG MARKER: == replay-dual test 0b: lost client during waiting for next transno ================================== 11:11:19 (1314285079)
LustreError: 10361:0:(osd_handler.c:938:osd_ro()) *** setting device osd-ldiskfs read-only ***
Turning device md5 (0x900005) read-only
Lustre: DEBUG MARKER: mds1 REPLAY BARRIER on lustre-MDT0000
Lustre: Failing over lustre-MDT0000
Lustre: 10460:0:(quota_master.c:793:close_quota_files()) quota[0] is off already
Lustre: 10460:0:(quota_master.c:793:close_quota_files()) Skipped 1 previous similar message
Lustre: Failing over mdd_obd-lustre-MDT0000
Lustre: mdd_obd-lustre-MDT0000: shutting down for failover; client state will be preserved.
Removing read-only on unknown block (0x900005)
Lustre: server umount lustre-MDT0000 complete
LDISKFS-fs (md5): recovery complete
LDISKFS-fs (md5): mounted filesystem with ordered data mode
JBD: barrier-based sync failed on md5-8 - disabling barriers
LDISKFS-fs (md5): mounted filesystem with ordered data mode
Lustre: Enabling ACL
Lustre: Enabling user_xattr
Lustre: lustre-MDT0000: used disk, loading
Lustre: 10592:0:(ldlm_lib.c:1903:target_recovery_init()) RECOVERY: service lustre-MDT0000, 66 recoverable clients, last_transno 4294967297
LustreError: 10599:0:(ldlm_lib.c:1740:target_recovery_thread()) lustre-MDT0000: started recovery thread pid 10599
LustreError: 10601:0:(mdt_handler.c:2785:mdt_recovery()) operation 400 on unconnected MDS from 12345-10.37.248.45@o2ib1
LustreError: 10601:0:(ldlm_lib.c:2128:target_send_reply_msg()) @@@ processing error (107) req@ffff81040ca1c400 x1378037410570255/t0(0) o-1><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1314285137 ref 1 fl Interpret:H/ffffffff/ffffffff rc -107/-1
LustreError: 10601:0:(ldlm_lib.c:2128:target_send_reply_msg()) Skipped 1 previous similar message
LustreError: 137-5: UUID 'lustre-MDT0000_UUID' is not available for connect (not set up)
Lustre: 10592:0:(mdt_lproc.c:257:lprocfs_wr_identity_upcall()) lustre-MDT0000: identity upcall set to /usr/sbin/l_getidentity
Lustre: 10592:0:(mds_lov.c:1004:mds_notify()) MDS mdd_obd-lustre-MDT0000: add target lustre-OST0000_UUID
Lustre: 10592:0:(mds_lov.c:1004:mds_notify()) Skipped 4 previous similar messages
JBD: barrier-based sync failed on md5-8 - disabling barriers
Lustre: 5799:0:(mds_lov.c:1024:mds_notify()) MDS mdd_obd-lustre-MDT0000: in recovery, not resetting orphans on lustre-OST0000_UUID
Lustre: 5799:0:(mds_lov.c:1024:mds_notify()) MDS mdd_obd-lustre-MDT0000: in recovery, not resetting orphans on lustre-OST0001_UUID
LustreError: 10601:0:(mdt_handler.c:2785:mdt_recovery()) operation 400 on unconnected MDS from 12345-10.37.248.44@o2ib1
Lustre: lustre-MDT0000: temporarily refusing client connection from 10.37.248.44@o2ib1
Lustre: Skipped 1 previous similar message
LNet: 10801:0:(debug.c:326:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release.
Lustre: DEBUG MARKER: replay-dual test_0b: @@@@@@ FAIL: mount1 fais
LustreError: 10601:0:(mdt_handler.c:2785:mdt_recovery()) operation 400 on unconnected MDS from 12345-10.37.248.4@o2ib1
Lustre: 10601:0:(ldlm_lib.c:2029:target_queue_recovery_request()) Next recovery transno: 4294967298, current: 4294967306, replaying
Lustre: 10601:0:(ldlm_lib.c:2029:target_queue_recovery_request()) Next recovery transno: 4294967298, current: 4294967303, replaying
LustreError: 10606:0:(mdt_handler.c:2785:mdt_recovery()) operation 400 on unconnected MDS from 12345-10.37.248.16@o2ib1
LustreError: 10606:0:(mdt_handler.c:2785:mdt_recovery()) Skipped 58 previous similar messages


Info required for matching: replay-dual test_0b 0b



 Comments   
Comment by Peter Jones [ 13/Dec/11 ]

I believe that this is being worked under LU-661

Comment by James A Simmons [ 13/Dec/11 ]

I can confirm that the patch from LU-661 fixes the issue. You can mark this bug as a duplicate of LU-661

Comment by Peter Jones [ 13/Dec/11 ]

Thanks James

Comment by James A Simmons [ 15/Dec/11 ]

Sorry I was mistaken. The test fails even with the fix from LU-661. The error I'm getting is

Lustre: DEBUG MARKER: == replay-dual test 0b: lost client during waiting for next transno ================================== 09:33:44 (1323959624)
Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-MDT0000
LustreError: 24437:0:(ldlm_request.c:1173:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 24437:0:(ldlm_request.c:1800:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff810161506c00 umount complete
Lustre: setting import lustre-MDT0000_UUID INACTIVE by administrator request
Lustre: Skipped 25 previous similar messages
LustreError: 24560:0:(ldlm_request.c:1173:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
LustreError: 24560:0:(ldlm_request.c:1800:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Lustre: client ffff8101ad6f0000 umount complete
LustreError: 24570:0:(genops.c:311:class_newdev()) Device MGC10.37.248.56@o2ib1 already exists at 0, won't add
LustreError: 24570:0:(obd_config.c:327:class_attach()) Cannot create device MGC10.37.248.56@o2ib1 of type mgc : -17
LustreError: 24570:0:(obd_mount.c:512:lustre_start_simple()) MGC10.37.248.56@o2ib1 attach error -17
LustreError: 24570:0:(obd_mount.c:2306:lustre_fill_super()) Unable to mount (-17)
Lustre: DEBUG MARKER: replay-dual test_0b: @@@@@@ FAIL: mount1 fais

Comment by Peter Jones [ 15/Dec/11 ]

Reopening

Comment by Peter Jones [ 15/Dec/11 ]

Bobijam

Could you please look at this test failure from ORNL?

Thanks

Peter

Comment by Zhenyu Xu [ 15/Dec/11 ]

James,

Can you please uploading the debug logs of the nodes, esp. of the client?

Comment by James A Simmons [ 19/Dec/11 ]

Turned on full debug info. The logs are at /uploads/LU-639/replay-dual-1324327350.tar.bz2

Comment by Zhenyu Xu [ 20/Dec/11 ]

Got the failure reason: the time when obd_zombie_exports finishes the last mgc export cleanup is later than the next test mount start.

client log (node:spoon01) excerpt in time order

00000020:01000000:3.0:1324327346.171837:0:30269:0:(obd_config.c:1517:class_manual_cleanup()) Manual cleanup of MGC10.37.248.56@o2ib1 (flags='') ====> start to cleanup mgc obd
...
00000020:00000001:3.0:1324327346.172024:0:30269:0:(obd_config.c:531:class_detach()) Process entered
00000020:00000080:3.0:1324327346.172025:0:30269:0:(obd_config.c:548:class_detach()) detach on obd MGC10.37.248.56@o2ib1 (uuid 7ee6c547-9a41-e9ad-d0f7-addf6fefe81a)
00000020:00000040:3.0:1324327346.172027:0:30269:0:(obd_config.c:670:class_decref()) Decref MGC10.37.248.56@o2ib1 (ffff810170f32038) now 1
====> will unlink self export and add to zombie export list
...
00000020:00000001:5.0:1324327346.172051:0:22196:0:(obd_class.h:1173:obd_destroy_export()) Process leaving (rc=0 : 0 : 0)
00000020:00000040:5.0:1324327346.172053:0:22196:0:(obd_config.c:670:class_decref()) Decref MGC10.37.248.56@o2ib1 (ffff810170f32038) now 0 ====> zombie start to destroy mgc obd
...
00000020:01000000:5.0:1324327346.172055:0:22196:0:(obd_config.c:687:class_decref()) finishing cleanup of obd MGC10.37.248.56@o2ib1 (7ee6c547-9a41-e9ad-d0f7-addf6fefe81a)
...
10000000:00000001:5.0:1324327346.172059:0:22196:0:(mgc_request.c:737:mgc_cleanup()) Process entered ====> part of destroy (obd_cleanup)
...
00000020:01000004:7.0:1324327346.188201:0:30279:0:(obd_mount.c:508:lustre_start_simple()) Starting obd MGC10.37.248.56@o2ib1 (typ=mgc) ====> next test mount start
...
00000020:00000080:7.0:1324327346.188231:0:30279:0:(obd_config.c:319:class_attach()) attach type mgc name: MGC10.37.248.56@o2ib1 uuid: 3a7a7701-cac0-3abf-0fde-bf7368296a73
00000020:00000001:7.0:1324327346.188233:0:30279:0:(genops.c:284:class_newdev()) Process entered ===> try to create mgc obd device
...
00000020:00020000:7.0:1324327346.188242:0:30279:0:(genops.c:311:class_newdev()) Device MGC10.37.248.56@o2ib1 already exists at 0, won't add ===> old mgc obd device still there
...
00000020:00020000:7.0:1324327346.188257:0:30279:0:(obd_config.c:327:class_attach()) Cannot create device MGC10.37.248.56@o2ib1 of type mgc : -17 ===> error EEXIST
...
10000000:00000001:5.0:1324327346.190745:0:22196:0:(mgc_request.c:751:mgc_cleanup()) Process leaving (rc=0 : 0 : 0)
00000020:00000001:5.0:1324327346.190747:0:22196:0:(obd_class.h:641:obd_cleanup()) Process leaving (rc=0 : 0 : 0)
00000020:00000040:5.0:1324327346.190750:0:22196:0:(genops.c:365:class_release_dev()) Release obd device MGC10.37.248.56@o2ib1 at 0 obd_type name =mgc ===> finished cleanup, remove mgc obd device

Comment by Zhenyu Xu [ 20/Dec/11 ]

patch tracking at http://review.whamcloud.com/1896

Comment by James A Simmons [ 20/Dec/11 ]

It passed the test. Thank you.

Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,ofa #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,el6,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,sles11,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,server,el6,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,ubuntu1004,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,client,el5,ofa #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,server,el5,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Peter Jones [ 04/Jan/12 ]

Landed for 2.2

Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » x86_64,server,el6,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,client,el6,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,server,el5,ofa #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,server,el5,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,client,el5,inkernel #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Build Master (Inactive) [ 04/Jan/12 ]

Integrated in lustre-master » i686,client,el5,ofa #404
LU-639 obdclass: wait obd cleanup before mount (Revision 11d258632de7255ae8282b0615a19bd5c9cf707a)

Result = SUCCESS
Oleg Drokin : 11d258632de7255ae8282b0615a19bd5c9cf707a
Files :

  • lustre/obdclass/obd_mount.c
Comment by Alexey Lyashkov [ 13/Dec/12 ]

In fact that is fix invalid.
you just wait until exports kill finished in mount, but that is don't fix an issue when obdfilter-survey failed to start due same race aka class setup vs class release.
that is may be solved by adding zombi barrier to cleanup export phase but looks we need more generic fix instead.

Generated at Sat Feb 10 01:08:59 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.