[LU-4727] Lhsmtool_posix process stuck in ll_layout_refresh() when restoring Created: 07/Mar/14  Updated: 25/Jan/22  Resolved: 23/Apr/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.5.1
Fix Version/s: Lustre 2.8.0

Type: Bug Priority: Major
Reporter: Li Xi (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Fixed Votes: 0
Labels: HSM, cea

Issue Links:
Duplicate
is duplicated by LU-5196 HSM: client task stuck waiting for mu... Resolved
Related
is related to LUDOC-252 Copytool Recommendations - Add/Clarify Open
is related to LU-4728 NULL pointer dereference in ldlm_cli_... Resolved
is related to LU-6460 LLIF_FILE_RESTORING is not cleared at... Resolved
is related to LU-4002 HSM restore vs unlink deadlock Resolved
Severity: 3
Rank (Obsolete): 13001

 Description   

This is easy to reproduce. I hit this problem every time when I trying to run following commands.

rm /mnt/lustre/XXXX -f;
echo XXX > /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_archive --archive=5 /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX;
lfs hsm_release /mnt/lustre/XXXX;
cat /mnt/lustre/XXXX; # This will restore automatically
lfs hsm_release /mnt/lustre/XXXX;
lfs hsm_restore /mnt/lustre/XXXX; # Lhsmtool_posix actually hang here
cat /mnt/lustre/XXXX; # this will stuck

And after some time, following messages shew up.

INFO: task flush-lustre-1:4106 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-lustre- D 0000000000000005 0 4106 2 0x00000080
ffff8808165b7830 0000000000000046 0000000000000000 0000000000000000
0000000000013180 0000000000000000 ffff880851fc10f8 ffff88082d4e0c00
ffff88082cb7fab8 ffff8808165b7fd8 000000000000fb88 ffff88082cb7fab8
Call Trace:
[<ffffffff814fc9fe>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff814fc89b>] mutex_lock+0x2b/0x50
[<ffffffffa0c2814c>] ll_layout_refresh+0x26c/0x1080 [lustre]
[<ffffffff813104bb>] ? mix_pool_bytes_extract+0x16b/0x180
[<ffffffff81135cf9>] ? zone_statistics+0x99/0xc0
[<ffffffffa059e007>] ? cfs_hash_bd_lookup_intent+0x37/0x130 [libcfs]
[<ffffffffa0c51230>] ? ll_md_blocking_ast+0x0/0x7f0 [lustre]
[<ffffffffa08b7450>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
[<ffffffffa06dbba1>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
[<ffffffffa0c78410>] vvp_io_init+0x340/0x490 [lustre]
[<ffffffffa05a11aa>] ? cfs_hash_find_or_add+0x9a/0x190 [libcfs]
[<ffffffffa06daff8>] cl_io_init0+0x98/0x160 [obdclass]
[<ffffffffa06ddc14>] cl_io_init+0x64/0xe0 [obdclass]
[<ffffffffa0c1894d>] cl_sync_file_range+0x12d/0x500 [lustre]
[<ffffffffa0c46cac>] ll_writepages+0x9c/0x220 [lustre]
[<ffffffff81128d81>] do_writepages+0x21/0x40
[<ffffffff811a43bd>] writeback_single_inode+0xdd/0x290
[<ffffffff811a47ce>] writeback_sb_inodes+0xce/0x180
[<ffffffff811a492b>] writeback_inodes_wb+0xab/0x1b0
[<ffffffff811a4ccb>] wb_writeback+0x29b/0x3f0
[<ffffffff814fb3a0>] ? thread_return+0x4e/0x76e
[<ffffffff8107eb42>] ? del_timer_sync+0x22/0x30
[<ffffffff811a4fb9>] wb_do_writeback+0x199/0x240
[<ffffffff811a50c3>] bdi_writeback_task+0x63/0x1b0
[<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81137a66>] bdi_start_fn+0x86/0x100
[<ffffffff811379e0>] ? bdi_start_fn+0x0/0x100
[<ffffffff81091d66>] kthread+0x96/0xa0
[<ffffffff8100c14a>] child_rip+0xa/0x20
[<ffffffff81091cd0>] ? kthread+0x0/0xa0
[<ffffffff8100c140>] ? child_rip+0x0/0x20

It seems copy tool is waiting for md_enqueue(MDS_INODELOCK_LAYOUT). Other processes who are trying to lock lli->lli_layout_mutex will be stuck. This problem won't recover until lock enque times out and client reconnects.



 Comments   
Comment by Robert Read (Inactive) [ 07/Mar/14 ]

I'm seeing a similar hang here. My copytool stack:

Call Trace:
[<ffffffff81058d53>] ? __wake_up+0x53/0x70
[<ffffffffa0d2ec00>] ? ldlm_expired_completion_wait+0x0/0x360 [ptlrpc]
[<ffffffffa0d334a5>] ldlm_completion_ast+0x545/0x920 [ptlrpc]
[<ffffffff81065df0>] ? default_wake_function+0x0/0x20
[<ffffffffa0d2d9f6>] ldlm_cli_enqueue_fini+0x936/0xe70 [ptlrpc]
[<ffffffffa0d2e2d5>] ldlm_cli_enqueue+0x3a5/0x790 [ptlrpc]
[<ffffffffa0d32f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[<ffffffffa0b30240>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
[<ffffffffa03bcd3e>] mdc_enqueue+0x2be/0x1a10 [mdc]
[<ffffffffa037da84>] lmv_enqueue+0x2f4/0xfc0 [lmv]
[<ffffffffa0b0b145>] ll_layout_refresh+0x515/0xfe0 [lustre]
[<ffffffffa0b30240>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
[<ffffffffa0d32f60>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
[<ffffffffa0b57113>] vvp_io_fini+0x173/0x260 [lustre]
[<ffffffffa0c46cc7>] cl_io_fini+0x77/0x280 [obdclass]
[<ffffffffa0afad37>] ll_file_io_generic+0xe7/0x610 [lustre]
[<ffffffffa0afbeb2>] ll_file_aio_write+0x142/0x2c0 [lustre]
[<ffffffffa0afc19c>] ll_file_write+0x16c/0x2a0 [lustre]
[<ffffffff81188f68>] vfs_write+0xb8/0x1a0
[<ffffffff81189861>] sys_write+0x51/0x90
[<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Jinshan Xiong (Inactive) [ 07/Mar/14 ]

I believe this is broken due to a recent optimization on the MD stack. Will create a workaround patch tomorrow.

Comment by John Hammond [ 07/Mar/14 ]

Hi Li Xi, I can only reproduce this if I run the copytool on the same mount point as used for the accesses. Can you confirm this? I strongly recommend using a dedicated mount point for the copytool. This should be somewhere in the HSM documentation.

Comment by Li Xi (Inactive) [ 07/Mar/14 ]

Hi John,

Yeah, you are correct! I wan running the copytool on the same mount point as used for the accesses. Sorry for misusing it.

However, I don't understand why we have to use a dedicated mount point for the copytool. Is there any special reason for it?

Comment by Robert Read (Inactive) [ 07/Mar/14 ]

I've seen this when running my copytool in a dedicated mount point.

Comment by Jinshan Xiong (Inactive) [ 07/Mar/14 ]

I suspect that running copy tool on dedicated mount point can not eliminate the problem but it can surely reduce the chance of it to occur.

The root cause of this problem is as follows:

in mdt_getattr_internal() it checks if the file is being restored. If this is the case, it will mark it as MS_RESTORE so that the client won't do glimpse.

        /* if file is released, check if a restore is running */
        if ((ma->ma_valid & MA_HSM) && (ma->ma_hsm.mh_flags & HS_RELEASED) &&
            mdt_hsm_restore_is_running(info, mdt_object_fid(o))) {
                repbody->t_state = MS_RESTORE;
                repbody->valid |= OBD_MD_TSTATE;
        }

However, when the restore request reaches the MDT, it doesn't flush UPDATE lock so if the client has UPDATE lock cached, it won't see this flag at all.

The fix would be to adjust the sequence of adding RESTORE request into action list. Right now it does as follow:

                        mdt_lock_reg_init(&crh->crh_lh, LCK_EX);
                        obj = mdt_object_find_lock(mti, &crh->crh_fid,
                                                   &crh->crh_lh,
                                                   MDS_INODELOCK_LAYOUT);
                        if (IS_ERR(obj)) {
                                rc = PTR_ERR(obj);
                                CERROR("%s: cannot take layout lock for "
                                       DFID": rc = %d\n", mdt_obd_name(mdt),
                                       PFID(&crh->crh_fid), rc);
                                OBD_SLAB_FREE_PTR(crh, mdt_hsm_cdt_kmem);
                                GOTO(out, rc);
                        }

                        /* we choose to not keep a keep a reference
                         * on the object during the restore time which can be
                         * very long */
                        mdt_object_put(mti->mti_env, obj);

                        mutex_lock(&cdt->cdt_restore_lock);
                        list_add_tail(&crh->crh_list, &cdt->cdt_restore_hdl);
                        mutex_unlock(&cdt->cdt_restore_lock);

So it requests LAYOUT lock and then add the request into a global list, we should change it to:
1. add to global list
2. flush UPDATE lock
3. request LAYOUT lock

In this way, we can make sure that all clients will see the MS_RESTORE flag.

Another way to fix the problem is to not getattr in copy tool process. In that case, we can't call ct_stat_by_fid() in llapi_hsm_action_begin(). The stat fetched by ct_stat_by_fid() will be used for setting UID/GID and ATIME/MTIME to volatile file. I think we can do this in other ways.

Robert, as a workaround fix, I think you can disable the function call ct_stat_by_fid() in llapi_hsm_action_begin() and do some corresponding cleanup - so that you can continue your work while we're fixing the problem.

John, do you remember why we should set UID/GID to the volatile file?

Comment by John Hammond [ 07/Mar/14 ]

We do it to maintain quota. Specifically, layout swap will fail it the ownership of the two files are not equal.

Comment by John Hammond [ 07/Mar/14 ]

Yes it would be nice if the MDT sent what was needed to the copytool. I found that it's hard to fit that extra information into the current protocol. Otherwise a backdoor stat of something would be helpful.

Comment by Jinshan Xiong (Inactive) [ 07/Mar/14 ]

Another way is to save attribute in archive and restore it at restore time(copy tool may have already supported this, IIRC). The only problem is UID/GID setting because it may change when the file is in release state; mtime/atime won't be a problem at all.

Or we can use a dedicated ioctl() to get the attribute in kernel but it won't try to glimpse the file size.

Comment by John Hammond [ 07/Mar/14 ]

If we assume that the copytool has a dedicated mount point (and we're feeling devious) then it may be enough to do:

static int ct_stat_by_fid(const struct hsm_copytool_private *ct,
                          const struct lu_fid *fid,
                          struct stat *buf)
{
        char fid_name[FID_NOBRACE_LEN + 1];

        snprintf(fid_name, sizeof(fid_name), DFID_NOBRACE, PFID(fid));
+       unlinkat(ct->open_by_fid_fd, fid_name, 0);
        return fstatat(ct->open_by_fid_fd, fid_name, buf, 0);
}

But that's not something to be proud of.

Comment by Robert Read (Inactive) [ 07/Mar/14 ]

It turns out my issue was self-inflicted. (In my version of the copytool I had neglected to call flush() before calling llapi_hsm_action_end(), and this left data in the file pointer buffers that wasn't flushed until after the end call when I closed my file handle. So either you need to flush or just close the volatile file handle before calling end. But that is unrelated to this bug.)

Comment by Andreas Dilger [ 11/Mar/14 ]

Robert, are there any restrictions on using this HSM API (e.g. capability needed, only on dedicated agent nodes set up by the sysadmin)? Otherwise, it seems like a potential problem for bad users to be able to lock up the system. Also, what is the extent of the problem? Is it only this one process that is hung (an acceptable loss for a self-inflicted problem) or does it affect the whole client, or even the MDS?

Comment by Robert Read (Inactive) [ 11/Mar/14 ]

I believe the copytool must be run as root, but it can be run on any client. In my case only the copytool process was hung and unkillable. It also prevents the file in question from being restored, at least until the coordinator times out the action request and sends it to another copytool.

Comment by Li Xi (Inactive) [ 19/May/14 ]

Hi all,

Is there any progress in this issue? This issue is really annoying when I am testing HSM. Is there any easy way to walk around it at least? Using a dedicated mount point for the copytool is not helping....

Thanks!

Comment by Patrick Farrell (Inactive) [ 12/Aug/14 ]

Jinshan - Looking at your description of a possible solution...
"So it requests LAYOUT lock and then add the request into a global list, we should change it to:
1. add to global list
2. flush UPDATE lock
3. request LAYOUT lock"

When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time?

Also, this comment at the top of the function is confusing me:
" * in case of restore, caller must hold layout lock"
Is this comment in error? If not, what layout lock is it referring to/what sort of lock on layout? I ask because if it's a restore request, we take a layout lock, which seems to imply the caller did not have a layout lock already.

Comment by Jinshan Xiong (Inactive) [ 18/Dec/14 ]

I'm thinking about a solution for this problem.

When you say "flush UPDATE lock", how are you suggesting this be done? Take an update lock on the object, then take the layout lock? If so, when do we release the update lock? Before taking the layout lock, after getting the layout lock, or some other time?

By "flush UPDATE lock", I meant to acquire the UPDATE lock and release it immediately.

Is this comment in error? If not, what layout lock is it referring to/what sort of lock on layout? I ask because if it's a restore request, we take a layout lock, which seems to imply the caller did not have a layout lock already.

that means layout lock to take in the function, i.e., the code

                        mdt_lock_reg_init(&crh->crh_lh, LCK_EX);
                        obj = mdt_object_find_lock(mti, &crh->crh_fid,
                                                   &crh->crh_lh,
                                                   MDS_INODELOCK_LAYOUT);
Comment by Gerrit Updater [ 19/Dec/14 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13138
Subject: LU-4727 hsm: flush UPDATE lock for restore
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: bf2e4b958f60cb7eda9303ad0c079fd23ff2d16b

Comment by Jinshan Xiong (Inactive) [ 19/Dec/14 ]

Please try patch 13138 and check if it can fix the problem.

Comment by Vinayak Hariharmath (Inactive) [ 19/Dec/14 ]

http://review.whamcloud.com/13138 solves the problem on single node setup on local vm. Thanks for the patch Jinshan

Comment by Jinshan Xiong (Inactive) [ 05/Feb/15 ]

the patch has been in Gerrit for a long time. Please let me know what I can do to move this forward, sigh.

Comment by Gerrit Updater [ 12/Feb/15 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: http://review.whamcloud.com/13750
Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8ec6354ded37f3e1f39d6e0336c9e17b1a97785b

Comment by Gerrit Updater [ 08/Mar/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13750/
Subject: LU-4727 hsm: use IOC_MDC_GETFILEINFO in restore
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 96dbac2eaef7a5d1090807bedc9951279c06d037

Comment by Frank Zago (Inactive) [ 23/Mar/15 ]

Jinshan,

I got this trace when using stat on a file I was restoring (or was restored). I haven't been able to reproduce it so far. Is it the error you mention in your patch? John's patch is applied on that tree, and works well otherwise.

Mar 20 15:00:55 tasclient01 kernel: INFO: task stat:951 blocked for more than 120 seconds.
Mar 20 15:00:55 tasclient01 kernel:      Tainted: P           ---------------    2.6.32-431.17.1.el6.x86_64 #1
Mar 20 15:00:55 tasclient01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables thi
s message.
Mar 20 15:00:55 tasclient01 kernel: stat          D 0000000000000000     0   951  29769 0x00000084
Mar 20 15:00:55 tasclient01 kernel: ffff8801d0471a58 0000000000000082 0000000000000000 000000000000000d
Mar 20 15:00:55 tasclient01 kernel: 0000000000000004 ffff880237fee800 ffff880116db5610 0000000000000630
Mar 20 15:00:55 tasclient01 kernel: ffff8802143f1098 ffff8801d0471fd8 000000000000fbc8 ffff8802143f1098
Mar 20 15:00:55 tasclient01 kernel: Call Trace:
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8152935e>] __mutex_lock_slowpath+0x13e/0x180
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff815291fb>] mutex_lock+0x2b/0x50
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09e009a>] ll_layout_refresh+0x1da/0xc60 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a04ab0>] ? ll_md_blocking_ast+0x0/0x7d0 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa06996a0>] ? ldlm_completion_ast+0x0/0x920 [ptlrpc]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04dfe31>] ? cl_io_slice_add+0xc1/0x190 [obdclass]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a2d8f0>] vvp_io_init+0x340/0x490 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100ad96>] ? ftrace_call+0x5/0x2b
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04df278>] cl_io_init0+0x98/0x160 [obdclass]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa04e1ea4>] cl_io_init+0x64/0xe0 [obdclass]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa0a23161>] cl_glimpse_size0+0x91/0x1d0 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c25>] ll_inode_revalidate_it+0x1a5/0x1d0 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4c99>] ll_getattr_it+0x49/0x170 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffffa09d4df7>] ll_getattr+0x37/0x40 [lustre]
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff81227163>] ? security_inode_getattr+0x23/0x30
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e631>] vfs_getattr+0x51/0x80
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e6c4>] vfs_fstatat+0x64/0xa0
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810f0e46>] ? ftrace_test_stop_func+0x16/0x20
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e76e>] vfs_lstat+0x1e/0x20
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8118e794>] sys_newlstat+0x24/0x50
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff810e1cc7>] ? audit_syscall_entry+0x1d7/0x200
Mar 20 15:00:55 tasclient01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Comment by Jinshan Xiong (Inactive) [ 23/Mar/15 ]

What's this process? From the name I guess it's not copy tool. John's patch can only fix copy tool case.

You will need patch 13138 too to address this case if the restoring will take longer than 120 seconds.

Comment by Frank Zago (Inactive) [ 23/Mar/15 ]

The process is stat, which does a stat() on a file.
Is it the problem your patch should fix, or is it fixing something else?

Comment by Jinshan Xiong (Inactive) [ 24/Mar/15 ]

Yes, the patch can fix this problem

Comment by Peter Jones [ 23/Apr/15 ]

This patch landed quite some while back. Please speak up (or open a new ticket) if there are still residual issues to address.

Comment by Li Xi (Inactive) [ 06/May/15 ]

Hmm, I still got following error on recent master branch with 13750 patch.

May 6 19:13:44 server1 kernel: INFO: task test.py:32411 blocked for more than 120 seconds.
May 6 19:13:44 server1 kernel: Not tainted 2.6.32-431.29.2.el6_lustre.2.5.29.ddnpf4.x86_64 #1
May 6 19:13:44 server1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 6 19:13:44 server1 kernel: test.py D 0000000000000000 0 32411 23005 0x00000080
May 6 19:13:44 server1 kernel: ffff88055238fb08 0000000000000082 0000000000000000 ffff88055351c5c0
May 6 19:13:44 server1 kernel: ffff88055238fab8 ffffffffa0e0685e ffff88086d4a49d8 ffff880597c9b980
May 6 19:13:44 server1 kernel: ffff880553ee85f8 ffff88055238ffd8 000000000000fbc8 ffff880553ee85f8
May 6 19:13:44 server1 kernel: Call Trace:
May 6 19:13:44 server1 kernel: [<ffffffffa0e0685e>] ? mdc_set_open_replay_data+0x29e/0x500 [mdc]
May 6 19:13:44 server1 kernel: [<ffffffff8152b9de>] __mutex_lock_slowpath+0x13e/0x180
May 6 19:13:44 server1 kernel: [<ffffffff8152b87b>] mutex_lock+0x2b/0x50
May 6 19:13:44 server1 kernel: [<ffffffffa0ec7ea1>] ll_layout_refresh+0x191/0x300 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffffa0ed78ac>] ll_setattr_raw+0x5cc/0xbe0 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffffa0ed7f25>] ll_setattr+0x65/0xd0 [lustre]
May 6 19:13:44 server1 kernel: [<ffffffff811a8158>] notify_change+0x168/0x340
May 6 19:13:44 server1 kernel: [<ffffffff81187374>] do_truncate+0x64/0xa0
May 6 19:13:44 server1 kernel: [<ffffffff8119c4b1>] do_filp_open+0x861/0xd20
May 6 19:13:44 server1 kernel: [<ffffffff81290c5a>] ? strncpy_from_user+0x4a/0x90
May 6 19:13:44 server1 kernel: [<ffffffff811a9032>] ? alloc_fd+0x92/0x160
May 6 19:13:44 server1 kernel: [<ffffffff81186039>] do_sys_open+0x69/0x140
May 6 19:13:44 server1 kernel: [<ffffffff81186150>] sys_open+0x20/0x30
May 6 19:13:44 server1 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

Comment by John Hammond [ 06/May/15 ]

> Hmm, I still got following error on recent master branch with 13750 patch.
> ...

Please post your reproducer.

Comment by Li Xi (Inactive) [ 06/May/15 ]

Hi John,

Sorry, the problem I got today is a different issue. And I am not sure whether it is a Bug or not. It is confirmed that LU-4727 is fixed for me too.

I got the new problem when the HSM tool (which is a private tool, not standard lhsmtool_posix) exits right after 'lfs hsm_restore' completes. But the action status in "lctl get_param mdt.server1-MDT0000.hsm.actions" is still STARTED, not SUCCEED. As long as the tool keeps itself alive until the action status become SUCCEED, this problem is gone.

However, this might be a issue too, because the HSM tool can exist at any time. And even it restarts, the action will never continue. I don't know any other way to recover HSM status except rebooting the machine. Do you have any advice? Thanks!

Generated at Sat Feb 10 01:45:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.