[LU-8295] conf-sanity test_41c: test failed to respond and timed out - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
None

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for nasf <fan.yong@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/dc7b4098-3372-11e6-bbf5-5254006e85c2.

The logs on the OST show that two mount thread were blocked:

[12811.430006] mount           S ffff88007b4e72f0     0 25441  25436 0x00000080
[12811.430006]  ffff8800465b7e70 0000000000000086 ffff88007b4e7300 ffff8800465b7fd8
[12811.430006]  ffff8800465b7fd8 ffff8800465b7fd8 ffff88007b4e7300 ffff8800465b7ef8
[12811.430006]  ffff88007b4e77f0 ffff88007b4e7300 ffff88007b4e7300 ffff88007b4e72f0
[12811.430006] Call Trace:
[12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
[12811.430006]  [<ffffffff81080f83>] do_wait+0x203/0x260
[12811.430006]  [<ffffffff810820a0>] SyS_wait4+0x80/0x110
[12811.430006]  [<ffffffff8107fb20>] ? task_stopped_code+0x60/0x60
[12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
[12811.430006] sh              S ffff88007b4e2df0     0 25442  25440 0x00000080
[12811.430006]  ffff88007a95fe70 0000000000000086 ffff88007b4e2e00 ffff88007a95ffd8
[12811.430006]  ffff88007a95ffd8 ffff88007a95ffd8 ffff88007b4e2e00 ffff88007a95fef8
[12811.430006]  ffff88007b4e32f0 ffff88007b4e2e00 ffff88007b4e2e00 ffff88007b4e2df0
[12811.430006] Call Trace:
[12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
[12811.430006]  [<ffffffff81080f83>] do_wait+0x203/0x260
[12811.430006]  [<ffffffff810820a0>] SyS_wait4+0x80/0x110
[12811.430006]  [<ffffffff8107fb20>] ? task_stopped_code+0x60/0x60
[12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
[12811.430006] mount.lustre    S 0000000000000000     0 25443  25441 0x00000080
[12811.430006]  ffff8800438d7ce0 0000000000000082 ffff8800438c5c00 ffff8800438d7fd8
[12811.430006]  ffff8800438d7fd8 ffff8800438d7fd8 ffff8800438c5c00 ffff880047f4c800
[12811.430006]  ffff8800438c5c00 ffff8800438d7e08 ffff880043836800 0000000000000000
[12811.430006] Call Trace:
[12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
[12811.430006]  [<ffffffffa0abaed6>] server_fill_super+0x1656/0x184c [obdclass]
[12811.430006]  [<ffffffff810a6ac0>] ? wake_up_atomic_t+0x30/0x30
[12811.430006]  [<ffffffffa0a8f9d8>] lustre_fill_super+0x328/0x950 [obdclass]
[12811.430006]  [<ffffffffa0a8f6b0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[12811.430006]  [<ffffffff811e1f2d>] mount_nodev+0x4d/0xb0
[12811.430006]  [<ffffffffa0a87908>] lustre_mount+0x38/0x60 [obdclass]
[12811.430006]  [<ffffffff811e28d9>] mount_fs+0x39/0x1b0
[12811.430006]  [<ffffffff811fe1af>] vfs_kern_mount+0x5f/0xf0
[12811.430006]  [<ffffffff812006fe>] do_mount+0x24e/0xa40
[12811.430006]  [<ffffffff8116e15e>] ? __get_free_pages+0xe/0x50
[12811.430006]  [<ffffffff81200f86>] SyS_mount+0x96/0xf0
[12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
[12811.430006] mount           S ffff88003dd65070     0 25446  25442 0x00000080
[12811.430006]  ffff8800434d3e70 0000000000000086 ffff88003dd65080 ffff8800434d3fd8
[12811.430006]  ffff8800434d3fd8 ffff8800434d3fd8 ffff88003dd65080 ffff8800434d3ef8
[12811.430006]  ffff88003dd65570 ffff88003dd65080 ffff88003dd65080 ffff88003dd65070
[12811.430006] Call Trace:
[12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
[12811.430006]  [<ffffffff81080f83>] do_wait+0x203/0x260
[12811.430006]  [<ffffffff810820a0>] SyS_wait4+0x80/0x110
[12811.430006]  [<ffffffff8107fb20>] ? task_stopped_code+0x60/0x60
[12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b
[12811.430006] mount.lustre    S 0000000000000000     0 25449  25446 0x00000080
[12811.430006]  ffff88003db0bce0 0000000000000082 ffff88007c141700 ffff88003db0bfd8
[12811.430006]  ffff88003db0bfd8 ffff88003db0bfd8 ffff88007c141700 ffff88005146c000
[12811.430006]  ffff88007c141700 ffff88003db0be08 ffff88007a89c000 0000000000000000
[12811.430006] Call Trace:
[12811.430006]  [<ffffffff8163b349>] schedule+0x29/0x70
[12811.430006]  [<ffffffffa0abaed6>] server_fill_super+0x1656/0x184c [obdclass]
[12811.430006]  [<ffffffff810a6ac0>] ? wake_up_atomic_t+0x30/0x30
[12811.430006]  [<ffffffffa0a8f9d8>] lustre_fill_super+0x328/0x950 [obdclass]
[12811.430006]  [<ffffffffa0a8f6b0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[12811.430006]  [<ffffffff811e1f2d>] mount_nodev+0x4d/0xb0
[12811.430006]  [<ffffffffa0a87908>] lustre_mount+0x38/0x60 [obdclass]
[12811.430006]  [<ffffffff811e28d9>] mount_fs+0x39/0x1b0
[12811.430006]  [<ffffffff811fe1af>] vfs_kern_mount+0x5f/0xf0
[12811.430006]  [<ffffffff812006fe>] do_mount+0x24e/0xa40
[12811.430006]  [<ffffffff8116e15e>] ? __get_free_pages+0xe/0x50
[12811.430006]  [<ffffffff81200f86>] SyS_mount+0x96/0xf0
[12811.430006]  [<ffffffff816463c9>] system_call_fastpath+0x16/0x1b

Attachments

Activity

[LU-8295] conf-sanity test_41c: test failed to respond and timed out

Peter Jones added a comment - 17/Dec/16 2:50 PM

Landed for 2.10

Peter Jones added a comment - 17/Dec/16 2:50 PM Landed for 2.10

Gerrit Updater added a comment - 17/Dec/16 5:41 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21117/
Subject: ~~LU-8295~~ tests: comply with OBD_RACE() behavior
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a3becc1092524366e13b3c72d61c6b0ea82709f7

Gerrit Updater added a comment - 17/Dec/16 5:41 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/21117/ Subject: LU-8295 tests: comply with OBD_RACE() behavior Project: fs/lustre-release Branch: master Current Patch Set: Commit: a3becc1092524366e13b3c72d61c6b0ea82709f7

Gerrit Updater added a comment - 30/Jun/16 11:18 PM

Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/21117
Subject: ~~LU-8295~~ tests: comply with OBD_RACE() behavior
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e3050d4414667a941c501728de2bfb0b78e8b097

Gerrit Updater added a comment - 30/Jun/16 11:18 PM Faccini Bruno (bruno.faccini@intel.com) uploaded a new patch: http://review.whamcloud.com/21117 Subject: LU-8295 tests: comply with OBD_RACE() behavior Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3050d4414667a941c501728de2bfb0b78e8b097

John Hammond added a comment - 30/Jun/16 4:21 PM

> Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ?

Yes, please.

I'm still not confident that I understand how this is working but I can see several potential issues with this code: cfs_race_state should be declared volatile. There is a race between the two threads setting cfs_race_state and one sleeping the other waking. proc_fail_loc() calls wake_up(&cfs_race_waitq) but it does not set cfs_race_state to 1 so it seems like the sleeping task would just go back to sleep.

My best guess is that usually "do_facet ost1 $LCTL set_param fail_loc=0x0" is running before either mount command so the CFS_RACE() is never triggered.

John Hammond added a comment - 30/Jun/16 4:21 PM > Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ? Yes, please. I'm still not confident that I understand how this is working but I can see several potential issues with this code: cfs_race_state should be declared volatile. There is a race between the two threads setting cfs_race_state and one sleeping the other waking. proc_fail_loc() calls wake_up(&cfs_race_waitq) but it does not set cfs_race_state to 1 so it seems like the sleeping task would just go back to sleep. My best guess is that usually "do_facet ost1 $LCTL set_param fail_loc=0x0" is running before either mount command so the CFS_RACE() is never triggered.

Bruno Faccini (Inactive) added a comment - 30/Jun/16 3:41 PM

John, since I am the "responsible" of this test and associated code, I feel a bit interested by this problem and ticket

My understanding was that OBD_RACE()/CFS_RACE()/cfs_race() only/safely work with 2 threads, the 1st stops waiting for the 2nd to awake it.

And I think, like you, that for this to work as expected, fail_loc must not be cleared in between, when there is a potential window for this to happen due to the second mount/thread to be run asynchronously, with the clear of fail_loc being launched simultaneously.

On the other hand, I thought before that CFS_FAIL_ONCE was useless for OBD_RACE() to work, but after better regarding the code in __cfs_fail_check_set() core routine this may also, as you point, be the reason for both threads being stuck if they are closely racing and have entered the routine at around the same time, and particularly if CFS_FAILED bit has still not been set.

Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ?

Bruno Faccini (Inactive) added a comment - 30/Jun/16 3:41 PM John, since I am the "responsible" of this test and associated code, I feel a bit interested by this problem and ticket My understanding was that OBD_RACE()/CFS_RACE()/cfs_race() only/safely work with 2 threads, the 1st stops waiting for the 2nd to awake it. And I think, like you, that for this to work as expected, fail_loc must not be cleared in between, when there is a potential window for this to happen due to the second mount/thread to be run asynchronously, with the clear of fail_loc being launched simultaneously. On the other hand, I thought before that CFS_FAIL_ONCE was useless for OBD_RACE() to work, but after better regarding the code in __cfs_fail_check_set() core routine this may also, as you point, be the reason for both threads being stuck if they are closely racing and have entered the routine at around the same time, and particularly if CFS_FAILED bit has still not been set. Do you want me to push a patch to fix these 2 holes (4 in fact since fail_loc is wrongly set/cleared for MDSs and OSSs cases) in test ?

John Hammond added a comment - 20/Jun/16 9:22 PM

And we need CFS_FAIL_ONCE along with OBD_FAIL_TGT_MOUNT_RACE when setting fail_loc.

John Hammond added a comment - 20/Jun/16 9:22 PM And we need CFS_FAIL_ONCE along with OBD_FAIL_TGT_MOUNT_RACE when setting fail_loc .

John Hammond added a comment - 20/Jun/16 7:34 PM

There's a bug in the test:

        #define OBD_FAIL_TGT_MOUNT_RACE 0x716                                             
        do_facet ost1 "$LCTL set_param fail_loc=0x716"
        start ost1 $(ostdevname 1) $OST_MOUNT_OPTS &
        pid=$!
        start ost1 $(ostdevname 1) $OST_MOUNT_OPTS &
        do_facet ost1 "$LCTL set_param fail_loc=0x0"
	pid2=$!

Since both mount commands are asynchronously, we may clear fail_loc between the two invocations of OBD_RACE(). In this case both threads will hang.

> Also there's only one print in the logs about the OBD_RACE actually being hit

Shouldn't we expect that CDEBUG rate limitation will prevent the second message from being printed.

John Hammond added a comment - 20/Jun/16 7:34 PM There's a bug in the test: #define OBD_FAIL_TGT_MOUNT_RACE 0x716 do_facet ost1 "$LCTL set_param fail_loc=0x716" start ost1 $(ostdevname 1) $OST_MOUNT_OPTS & pid=$! start ost1 $(ostdevname 1) $OST_MOUNT_OPTS & do_facet ost1 "$LCTL set_param fail_loc=0x0" pid2=$! Since both mount commands are asynchronously, we may clear fail_loc between the two invocations of OBD_RACE() . In this case both threads will hang. > Also there's only one print in the logs about the OBD_RACE actually being hit Shouldn't we expect that CDEBUG rate limitation will prevent the second message from being printed.

Oleg Drokin added a comment - 20/Jun/16 5:37 PM

so OBD_RACE() is supposed to stop for the first caller and then continue when somebody else reaches there.

Yet we see both threads stuck in the same place in OBD_RACE? Something strange is going on here. Also there's only one print in the logs about the OBD_RACE actually being hit

Oleg Drokin added a comment - 20/Jun/16 5:37 PM so OBD_RACE() is supposed to stop for the first caller and then continue when somebody else reaches there. Yet we see both threads stuck in the same place in OBD_RACE? Something strange is going on here. Also there's only one print in the logs about the OBD_RACE actually being hit

People

Assignee:: John Hammond

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 17/Jun/16 6:45 AM

Updated:: 17/Dec/16 2:50 PM

Resolved:: 17/Dec/16 2:50 PM