using_io_timeouts_and_interrupts_on_nt.rst

This technical memo is a cautionary note on using NetScape Portable

Runtime's (NSPR) IO timeout and interrupt on Windows NT 3.51 and 4.0.

Due to a limitation of the present implementation of NSPR IO on NT,

programs must follow the following guideline:

If a thread calls an NSPR IO function on a file descriptor and the IO

function fails with <tt>PR_IO_TIMEOUT_ERROR</tt> or

<tt>PR_PENDING_INTERRUPT_ERROR</tt>, the file descriptor must be closed

before the thread exits.

In this memo we explain the problem this guideline is trying to work

around and discuss its limitations.

.. _NSPR_IO_on_NT:

NSPR IO on NT

-------------

The IO model of NSPR 2.0 is synchronous and blocking. A thread calling

an IO function is blocked until the IO operation finishes, either due to

a successful IO completion or an error. If the IO operation cannot

complete before the specified timeout, the IO function returns with

<tt>PR_IO_TIMEOUT_ERROR</tt>. If the thread gets interrupted by another

thread's <tt>PR_Interrupt()</tt> call, the IO function returns with

<tt>PR_PENDING_INTERRUPT_ERROR</tt>.

On Windows NT, NSPR IO is implemented using NT's *overlapped* (also

called *asynchronous*) *IO*. When a thread calls an IO function, the

thread issues an overlapped IO request using the overlapped buffer in

its <tt>PRThread</tt> structure. Then the thread is put to sleep. In the

meantime, there are dedicated internal threads (called the *idle

threads*) monitoring the IO completion port for completed IO requests.

If a completed IO request appears at the IO completion port, an idle

thread fetches it and wakes up the thread that issued the IO request

earlier. This is the normal way the thread is awakened.

.. _IO_Timeout_and_Interrupt:

IO Timeout and Interrupt

------------------------

However, NSPR may wake up the thread in two other situations:

-  if the overlapped IO request is not completed before the specified

   timeout. (Note that we can't specify timeout on overlapped IO

   requests, so the timeouts are all handled at the NSPR level.) In this

   case, the error is <tt>PR_IO_TIMEOUT_ERROR</tt>.

-  if the thread gets interrupted by another thread's

   <tt>PR_Interrupt()</tt> call. In this case, the error is

   <tt>PR_PENDING_INTERRUPT_ERROR</tt>.

These two errors are generated by the NSPR layer, so the OS is oblivious

of what is going on and the overlapped IO request is still in progress.

The OS still has a pointer to the overlapped buffer in the thread's

<tt>PRThread</tt> structure. If the thread subsequently exists and its

<tt>PRThread</tt> structure gets deleted, the pointer to the overlapped

buffer will be pointing to freed memory. This is problematic.

.. _Canceling_Overlapped_IO_by_Closing_the_File_Descriptor:

Canceling Overlapped IO by Closing the File Descriptor

------------------------------------------------------

Therefore, we need to cancel the outstanding overlapped IO request

before the thread exits. NT's <tt>CancelIo()</tt> function would be

ideal for this purpose. Unfortunately, <tt>CancelIo()</tt> is not

available on NT 3.51. So we can't go this route as long as we are

supporting NT 3.51. The only reliable way to cancel outstanding

overlapped IO request that works on both NT 3.51 and 4.0 is to close the

file descriptor, hence the rule of thumb stated at the beginning of this

memo.

.. _Limitations:

Limitations

-----------

This seemingly harsh way to force the completion of outstanding

overlapped IO request has the following limitations:

-  It is difficult for threads to shared a file descriptor. For example,

   suppose thread A and thread B call <tt>PR_Accept()</tt> on the same

   socket, and they time out at the same time. Following the rule of

   thumb, both threads would close the socket. The first

   <tt>PR_Close()</tt> would succeed, but the second <tt>PR_Close()</tt>

   would be freeing freed memory. A solution that may work is to use a

   lock to ensure only one thread can be using that socket at all times.

-  Once there is a timeout or interrupt error, the file descriptor is no

   longer usable. Suppose the file descriptor is intended to be used for

   the life time of the process, for example, the logging file, this is

   really not acceptable. A possible solution is to add a

   <tt>PR_DisableInterrupt()</tt> function to turn off interrupts when

   accessing such file descriptors.

..

   *A related known bug is that timeout and interrupt don't work for

   <tt>PR_Connect()</tt> on NT. This bug is due to a different

   limitation in our NT implementation.*

.. _Conclusions:

Conclusions

-----------

As long as we need to support NT 3.51, we need to program under the

guideline that after an IO timeout or interrupt error, the thread must

make sure the file descriptor is closed before it exits. Programs should

also take care in sharing file descriptors and using IO timeout or

interrupt on files that need to stay open throughout the process.

When we stop supporting NT 3.51, we can look into using NT 4's

<tt>CancelIo()</tt> function to cancel outstanding overlapped IO

requests when we get IO timeout or interrupt errors. If

<tt>CancelIo()</tt> really works as advertised, that should

fundamentally solve this problem.

If these limitations with IO timeout and interrupt are not acceptable to

the needs of your programs, you can consider using the Win95 version of

NSPR. The Win95 version runs without trouble on NT, but you would lose

the better performance provided by NT fibers and asynchronous IO.

.. _Original_Document_Information:

Original Document Information

-----------------------------

-  Author: larryh@netscape.com

-  Last Updated Date: December 1, 2004