[LU-8399] MDT hung at lu_object_find_at during umount Created: 14/Jul/16 Updated: 06/Aug/16 Resolved: 06/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.9.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | James Nunez (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
autotest review-dne |
||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
sanity-scrub test_1c and test_4a are timing out with a hang in unmount of a MDT. All failures so far are in review-dne. From the test_log, we see the test is umounting MDTs: … stop mds1 CMD: onyx-33vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts CMD: onyx-33vm7 umount -d /mnt/lustre-mds1 CMD: onyx-33vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' stop mds2 CMD: onyx-33vm3 grep -c /mnt/lustre-mds2' ' /proc/mounts CMD: onyx-33vm3 umount -d /mnt/lustre-mds2 CMD: onyx-33vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' stop mds3 CMD: onyx-33vm7 grep -c /mnt/lustre-mds3' ' /proc/mounts CMD: onyx-33vm7 umount -d /mnt/lustre-mds3 Looking at the MDS that is in the process of unmounting an MDT, in the console logs, we see errors when trying to unmount and the stack trace: 06:49:24:[12121.531608] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds3 06:49:24:[12141.673117] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:49:24:[12141.681956] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:49:24:[12141.688142] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:49:24:[12161.692141] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:49:24:[12161.701736] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:49:24:[12161.708145] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:49:24:[12181.712143] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:49:24:[12181.722112] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:49:24:[12181.729016] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:49:24:[12201.733130] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:49:24:[12201.742520] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:49:24:[12201.749824] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:49:24:[12221.754157] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:49:24:[12221.763268] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:49:24:[12221.770918] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:49:24:[12241.776150] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:49:24:[12241.784285] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:49:24:[12241.792494] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:49:24:[12261.798128] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:54:55:[12261.805202] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:54:55:[12261.813777] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:54:55:[12301.819116] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:54:55:[12301.825881] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 1 previous similar message 06:54:55:[12301.831998] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:54:55:[12301.840643] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) Skipped 1 previous similar message 06:54:55:[12301.843920] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. 06:54:55:[12301.849581] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) Skipped 1 previous similar message 06:54:55:[12360.587118] INFO: task osp_up0-2:22450 blocked for more than 120 seconds. 06:54:55:[12360.592257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 06:54:55:[12360.595199] osp_up0-2 D ffffc90004019000 0 22450 2 0x00000080 06:54:55:[12360.597925] ffff88007a277a98 0000000000000046 ffff8800441b8000 ffff88007a277fd8 06:54:55:[12360.600858] ffff88007a277fd8 ffff88007a277fd8 ffff8800441b8000 ffff88004435c240 06:54:55:[12360.603645] ffff8800435be000 ffff880042fd6020 0000000000000000 ffffc90004019000 06:54:55:[12360.606403] Call Trace: 06:54:55:[12360.608738] [<ffffffff8163ba29>] schedule+0x29/0x70 06:54:55:[12360.611237] [<ffffffffa07f797d>] lu_object_find_at+0x4d/0xe0 [obdclass] 06:54:55:[12360.613840] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 06:54:55:[12360.616305] [<ffffffffa07f7e3f>] lu_object_find_slice+0x1f/0x90 [obdclass] 06:54:55:[12360.618842] [<ffffffffa0f7c5f4>] osp_trans_stop_cb+0x1b4/0x2c0 [osp] 06:54:55:[12360.621315] [<ffffffffa0f7eeeb>] osp_update_interpret+0x21b/0x4b0 [osp] 06:54:55:[12360.623814] [<ffffffffa0a0d725>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc] 06:54:55:[12360.626273] [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70 06:54:55:[12360.628649] [<ffffffffa0a0f12b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc] 06:54:55:[12360.631038] [<ffffffffa0a0f6a1>] ptlrpc_set_wait+0x4f1/0x900 [ptlrpc] 06:54:55:[12360.633299] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 06:54:55:[12360.635543] [<ffffffffa0a0fb2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] 06:54:55:[12360.637787] [<ffffffffa0f7f6b2>] osp_send_update_req+0x1c2/0x830 [osp] 06:54:55:[12360.640042] [<ffffffffa0f80573>] osp_send_update_thread+0x233/0x5e0 [osp] 06:54:55:[12360.642227] [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20 06:54:55:[12360.644346] [<ffffffffa0f80340>] ? osp_invalidate_request+0x370/0x370 [osp] 06:54:55:[12360.646486] [<ffffffff810a5aef>] kthread+0xcf/0xe0 06:54:55:[12360.648492] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 06:54:55:[12360.650536] [<ffffffff816469d8>] ret_from_fork+0x58/0x90 06:54:55:[12360.652551] [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140 06:54:55:[12372.704108] Lustre: 6959:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468417841/real 1468417841] req@ffff880054be9500 x1539734990553872/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1468417866 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 06:54:55:[12372.714842] Lustre: 6959:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 46 previous similar messages 06:54:55:[12381.852134] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0) 06:54:55:[12381.861581] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 3 previous similar messages 06:54:55:[12381.864156] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107 06:54:55:[12381.870152] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) Skipped 3 previous similar messages 06:54:55:[12381.872522] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out. … sanity-scrub test_1c started failing with this error on July 13, 2016. Logs for some of these failures are at: sanity-scrub test_4a started failing with this error on July 13, 2016. Logs for some of these failures are at: |
| Comments |
| Comment by Sebastien Buisson (Inactive) [ 15/Jul/16 ] |
|
Another occurence of failure on sanity-scrub test 1c on master: |
| Comment by Jian Yu [ 15/Jul/16 ] |
|
This is blocking patch review testing on master branch: |
| Comment by nasf (Inactive) [ 15/Jul/16 ] |
|
According to the logs, it is not the OI scrub logic blocked the lu_object_find(), instead, it happened before the OI scrub. To test OI scrub logic, we need to generate some files and directories. For DNE case, we will: The hung happened at the second step. Because MDT1 umount before the MDT2 (or MDT3) flushing the async update OUT RPC, as to the MDT2 (or MDT3) get failure, then osp_trans_stop_cb() will try to purge related OSP object attribute cache, it needs to locate related object firstly. But we are umounting MDT2 (or MDT4) at that time, related object are marked as to be purged from RAM. On the other hand, someone may be referencing the object and waiting for the transaction callback. But the transaction callback is blocked, then deadlock. In fact, if the object is purging out of RAM, the osp_invalidate() needs to do nothing. So we can use non-blocked method to locate related object for osp_invalidate() to avoid deadlock. |
| Comment by Gerrit Updater [ 15/Jul/16 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/21330 |
| Comment by Andreas Dilger [ 20/Jul/16 ] |
|
It appears that this problem is related to the patch http://review.whamcloud.com/18801 " |
| Comment by Oleg Drokin [ 20/Jul/16 ] |
|
I reopened |
| Comment by Peter Jones [ 20/Jul/16 ] |
|
makes sense to me |
| Comment by nasf (Inactive) [ 21/Jul/16 ] |
|
We hit again after the reversion of It seems that |
| Comment by Jian Yu [ 22/Jul/16 ] |
|
One more instance on master branch: https://testing.hpdd.intel.com/test_sets/55571e7a-4fa6-11e6-bf87-5254006e85c2 |
| Comment by nasf (Inactive) [ 24/Jul/16 ] |
|
We still need the patch: |
| Comment by Bob Glossman (Inactive) [ 26/Jul/16 ] |
|
another on master, in replay-single test_100b: |
| Comment by Jian Yu [ 03/Aug/16 ] |
|
+1 on master, in replay-single test_100b: |
| Comment by Gerrit Updater [ 06/Aug/16 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/21330/ |
| Comment by nasf (Inactive) [ 06/Aug/16 ] |
|
The patch has been landed to master. |