LWIO Asynchronouse I/O And Rundown Speclet
------------------------------------------
1. Introduction
---------------
This speclet describes how asynchronous I/O support is being added to LWIO.
It also addresses improvements in how handle rundown happens in LWIO.
1.1 Goals
---------
- Asynchronous I/O model with cancellation, including:
+ ioapi.h support for asynchronous I/O.
- IoXxxFile() APIs.
- Cancellation.
+ Driver support for processing IRPs asynchronously.
+ Driver support for cancelling IRPs.
- Ensure that handle rundown works in desired scenarios.
1.2 Non-Goals
-------------
- NtXxxFile()/NtCtxXxxFile (ntfileapi.h) support for asynchronous I/O.
2. Async IO Model
-----------------
2.1 IoXxxFile() APIs
--------------------
IoXxxFile() APIs have an optional parameter:
IN OUT OPTIONAL PIO_ASYNC_CONTROL_BLOCK AsyncControlBlock,
Currently, this parameter is unused and must always be NULL. The async I/O
support will use this parameter to control whether the IoXxxFile() API should
behave asychronously.
Here is how the ACB (asynchronous control block) works:
typedef struct _IO_ASYNC_CONTROL_BLOCK {
IN PIO_ASYNC_CALLBACK Callback;
IN OPTIONAL PVOID CallbackContext;
OUT PIO_ASYNC_CONTEXT AsyncContext;
} IO_ASYNC_CONROL_BLOCK, *PIO_ASYNC_CONTROL_BLOCK;
On input, the ACB must contain a callback function. On output, the ACB will
return an AsyncContext that can be used to cancel the I/O. It is up to the
caller to dereference the returned context when the caller is done with it by
calling IoDereferenceAsyncContext().
VOID
IoDereferenceAsyncContext(
IN OUT PIO_ASYNC_CONTEXT* AsyncContext
);
The async callback is defined as:
typedef VOID (*PIO_ASYNC_CALLBACK)(
IN PVOID CallbackContext,
IN PIO_ASYNC_CONTEXT AsyncContext,
IN PIO_STATUS_BLOCK IoStatusBlock
);
The callback does not have to worry about the AsyncContext passed in here.
(Note, however, that the caller can use this value to call
IoDereferenceAsyncContext() if they never dereferenced the context returned in
the original async I/O call.)
ISSUE: Do we want to pass in AsyncContext here? It is just being passed in as
a convenience to the caller in case they do not want to have a CallbackContext
but still want to track additional context w/the operation.
The IoStatusBlock will contain the result of the operation.
2.2 Cancellation API
--------------------
To cancel an async I/O, the caller calls IoCancelAsyncContext(). Note that
the caller must still dereference the context.
BOOLEAN
IoCancelAsyncContext(
IN PIO_ASYNC_CONTEXT AsyncContext
);
IoCancelAsyncContext() will set the cancel bit on the call and call the
cancellation routine on the operation (if any). If the operation is in a
cancellable state (i.e., a cancel routine is set), IoCancelAsyncContext()
returns TRUE. If the operation is not cancellable (i.e., a cancel routine is
not currently set), IoCancelAsyncContext() returns FALSE.
Should IoCancelAsyncContext() return a BOOLEAN or just be VOID? Is there any
reason why we would actually care?
Once an async I/O is cancelled, the operation will complete with either
STATUS_CANCELLED or the normal result of the I/O. That means that the
async callback will be called with the appropriate result.
2.3 Driver Support
-------------------
To support asynchronous I/O, drivers must be able to behave asychronously with
respect to the I/O operation. If a driver behaves synchronously, the I/O will
be synchronous even if the caller (via IoXxxFile() call) asked for it to be
asynchronous.
There is no way for a driver to tell whether the caller asked for synchronous
or asynchronous behavior.
When a driver processes an IRP asychronously, it must return STATUS_PENDING to
tell the IO manager to take its hands off the IRP until the driver completes
the IRP.
When the IO manager gets STATUS_PENDING from the driver, it will do one of two
things depending on whether the caller of the IoXxxFile() API asked for
synchronous or asynchronous behavior. In the synchronous case, the IO manager
just waits for the driver to complete the I/O before returning the final
result to the caller. In the asynchronous case, the IO manager just returns
STATUS_PENDING to the caller (along with an AsyncContext).
2.3.1 IoIrpXxx() APIs
---------------------
To support asynchronous processing of IRPs, the IO manager provies the
following IoIrpXxx() APIs (via iodriver.h).
typedef NTSTATUS (*PIO_IRP_CALLBACK)(
IN PVOID CallbackContext,
IN PIRP Irp
);
VOID
IoIrpSetCancelCallback(
IN PIRP Irp,
IN PIO_IRP_CALLBACK Callback,
IN OPTIONAL PVOID CallbackContext
);
// Cannot touch IRP after this:
VOID
IoIrpComplete(
IN OUT PIRP Irp
);
2.3.2 Cancellation
------------------
For an I/O to be cancellable, the driver must call IoIrpSetCancelCallback().
Ideally, the driver should do this before returning STATUS_PENDING to the
driver. Note that a driver can set a cancel callback even when an opration is
synchronous.
Before a driver completes an IRP (synchronously or asynchronously), the driver
must clear any cancellation callback that it set on the IRP. This is done by
setting a NULL cancellation callback using IoIrpSetCancelCallback().
2.3.3 Completion
----------------
If an IRP is handled asynchronously, the IRP is completed by calling
IoIrpComplete(). Before calling this function, however, the driver must make
sure to clear any cancellation callback (see above) and it must set the
IoStatusBlock in the IRP with the proper Status and Information.
3. Rundown
----------
There are two basic rundown scenarios to consider here. One is the lwiod
shutdown case and the other is an RPC named pipe (NP) server scenario. The
former can be handled with the standard lwmsg and iomgr asynchronous support.
The latter, however, needs additional support.
There are three basic rundown scenarios to consider here:
1) lwiod shutdown
2) RPC named pipe (NP) server shutdown
3) SRV operation vs. close race
The first can be handled with the standard lwmsg and iomgr asynchronous
support. The other two, however, need additional support.
3.1 lwiod Shutdown Scenario
---------------------------
With the lwmsg async support, all lwmsg-based i/o would be async on the
server-side. If the IPC connection goes away, lwmsg can cancel all
asynchronous operations and cleanup its file object handles. Thus, if lwiod
shuts down lwmsg before shutting down the iomgr, the iomgr does not need any
additional rundown support aside from asynchronous i/o cancellation support.
In other words, because of lwmsg, iomgr does not need additional rundown
support for rundown. Rather, lwiod just need to use lwmsg in the following
manner:
- all lwmsg-based i/o would be async on server-side.
- lwmsg server will do async cancel and then close.
- would need to ensure that lwiod shutdown first shuts down lwmsg before
shutting down iomgr.
3.2 RPC NP Server Shutdown Scenario
-----------------------------------
The important rundown scenario for us is RPC NP server shutdown:
a) thread 1 - "accept" thread (waiting for client connection)
- NtFsControlFile(h)
b) thread 2 - shutdown server
- NtCloseFile(h)
- wait for "accept" thread to complete
This scenario is not handled by lwmsg handle rundown.
The current lwmsg server-side usage (IopIpcCloseFile()) handles NtCloseFile()
by simply unregistering the file object handle from lwmsg (using
lwmsg_assoc_unregister_handle()). This does not issue an IoCloseFile(). The
IoCloseFile() is only issues when IopIpcCleanupFileHandle() is called by lwmsg
when the last handle reference goes away. The last reference only goes away
when all operations for the handle dispatched by lwmsg have been finished.
ADD INFO TO SPEC: Why does lwmsg_assoc_unregister_handle() not cancel
all operations using the handle? Is it because lwmsg does not track that
explicitly?
Additional support is needed in iomgr to handle this scenario.
3.3 SRV Operation vs. Close Race Scenario
-----------------------------------------
There is a scenario in SRV where a client sends an operation on a FID and a
close on a FID over the wire thus creating a race between the two operations.
(This is not a chaining case, but rather two "independent" oprations.) While
this sequence would be invalid from a correctness point of view, the server
needs to behave sensisbly and not crash.
If SRV were to just call IoCloseFile() and IoReadFile() in separate threads, a
race coould happen where the IoCloseFile() would clean up the file object and
cause the IoReadFile() to be called with invalid memory.
3.4 Rundown Solution
--------------------
The solution to the unhandled rundown scenarios above is to separate the
notion of "closing the file" from the lifetime of the file object. This can
be achieved with the notion of "shutdown" vs. "close" (as per BSD socket
notions). We can define a "shutdown" that does the equivalent of "closing the
file" without de-allocating the handle/file object. The "close" then becomes
a "shutdown" + file object cleanup.
In places where XxxCloseFile() would be called, we can now call
XxxShutdownFile() instead. This would set the file to a "closed" state,
cancel any outstanding I/O, and prevent any new I/O from being issued. When
the caller determines that it no longer needs the file handle (i.e., all I/O
is complete or even the last reference to state embedding the file handle is
gone), the caller can call XxxCloseFile() to free up the file object state.
This is the breakdown of how the relevant APIs would work:
IoShutdownFile() [TODO]
- if !FILE_OBJECT_FLAG_CLOSED
+ set FILE_OBJECT_FLAG_CLOSED
+ issue cancellation for all outstanding I/O
+ MAY wait for outstanding I/O to complete (not required)
IoCloseFile()
- call IoShutdownFile() [TODO]
- wait for outstanding I/O to complete [TODO]
- send IRP_TYPE_CLOSE to driver
NtCtxShutdownFile() [TODO]
- send SHUTDOWN_FILE message to lwiod
NtCtxCloseFile()
- send CLOSE_FILE message
- remove from handle table
+ call lwmsg_assoc_unregister_handle() (via NtIpcUnregisterFileHandle()).
IopIpcShutdownFile() [TODO]
- call IoShutdownFile()
IopIpcCloseFile()
- call IoShutdownFile() [TODO]
- remove from handle table
+ call lwmsg_assoc_unregister_handle() (via NtIpcUnregisterFileHandle()).
IopIpcCleanupFileHandle()
- NOTE: called on last deref from lwmsg
- call IoCloseFile()
Note that IoCloseFile() and IoShutdownFile() could block if there are
in-flight ops. So IoCloseFile() and IoShutdownFile() need to support ACB to
signal completion but not to cancel.
3.4.1 Rundown Solution Template
-------------------------------
This is a general algorithm/template for rundown protection. It need to be
modified depending on how state is managed. Note that LAST below could be 1
or 0 depending on how how the context is managed and whether a reference count
or in-flight operation count is used.
a) thread 1 - doing some operation
- InterlockedIncrement(&context->RefCount) // (potentially from table lookup)
- Lock(context->Mutex)
- isClosed = context->IsClosed
- Unlock(context->Mutex)
- if (!isClosed)
+ XxxOperationFile(context->FileHandle)
+ Lock(context->Mutex)
+ isClosed = context->IsClosed
+ Unlock(context->Mutex)
- count = InterlockedDecrement(&context->RefCount)
- if (isCosed && (count == LAST))
+ Signal(context->RundownCondition)
b) thread 2 - rundown/"close"
- InterlockedIncrement(&context->RefCount) // (potentially from table lookup)
- Lock(context->Mutex)
- context->IsClosed = true
- (remove context from appropriate table(s))
- Unlock(context->Mutex)
- XxxShutdownFile(context->FileHandle, WAIT)
- Lock(context->mutex)
- count = InterlockedGet(&context->RefCount)
- if (count > 1)
+ Wait(context->RundownCondition, context->Mutex)
- Unlock(context->Mutex)
- XxxCloseFile(context->FileHandle)
- context->FileHandle = NULL
- InterlockedDecrement(&context->RefCount)
3.4.2 RPC NP Server Shutdown Scenario Solution
----------------------------------------------
a) thread 1 - "accept" thread (waiting for client connection)
- NtFsControlFile(h)
b) thread 2 - shutdown server
- NtShutdownFile(h)
- wait for "accept" thread to complete (join)
- NtCloseFile(h)
3.4.3 SRV Close Race Solution
-----------------------------
The implementation is like the template section above (3.4.1).
4. Async I/O IO Manager Implementation
--------------------------------------
[CURRENTLY WORKING ON THIS]
In the IO manager, we need to add the following:
- From IoXxxFile() API perspective, PIO_ASYNC_CONTEXT is just an opaque
pointer. Internally, it will point to either an IRP or a field inside an
IRP.
- Need reference counting for IRP.
- Need IRP_FLAGS to track state of IRP:
+ IRP_FLAG_COMPLETE
+ IRP_FLAG_CANCEL
- Need cancellation callback information in IRP.
- Need completion callback information in IRP.
- Need to track "close" state in file object
+ FILE_OBJECT_FLAG_CLOSED
4.1 Locking
-----------
Need locks to protect:
- Device object's file object list
- File object's IRP list and flags
- IRP's cancellation information (flag and callback).
Functions affected by locks:
- IopFileObjectAllocate()/IopFileObjectFree().
- IoIrpSetCancelCallback()/IoCancelAsyncContext()/IoIrpComplete().
[TODO: Specify lock locations]
Current assumptions:
- drivers only loaded synchronously on startup
- devices only created synchronously on startup
Implications:
- no locks needed to protect driver and device lists
4.2 IoShutdownFile()/IoCloseFile() Details
------------------------------------------
NTSTATUS
IoShutdownFile(
IN OUT IO_FILE_HANDLE FileHandle,
IN OPTIONAL PIO_ASYNC_CONTROL_BLOCK AsyncControlBlock,
IN BOOLEAN WaitForInFlightOperation
);
IoShutdownFile() needs to support async:
- if (!FILE_OBJECT_FLAG_CLOSED)
+ set FILE_OBJECT_FLAG_CLOSED
+ issue cancellation for all outstanding IRPs
- if WaitForInFlightOperations
+ if in-flight ops, return STATUS_PENDING (or block if sync)
+ wait until in-flight ops is 0
[NOTE: common completion processing woould check for FILE_OBJECT_FLAG_CLOSED
and resume any waiting file shutdown operation.]
NTSTATUS
IoCloseFile(
IN OUT IO_FILE_HANDLE FileHandle,
IN OPTIONAL PIO_ASYNC_CONTROL_BLOCK AsyncControlBlock
);
- IoShutdownFile(..., TRUE)
- send IRP_TYPE_CLOSE to driver
4.3 Cancellation Details
------------------------
IoIrpSetCancelCallback()
- acquire LOCK
- set cancel callback
- release LOCK
IoCancelAsyncContext()
- acquire LOCK
- if IRP_FLAG_CANCEL and IRP_FLAG_COMPLETE not set
+ set IPR_FLAG_CANCEL
+ call cancel callback, if any
- release LOCK
[NOTE: CANCEL CALLBACK MUST BE QUICK.]
4.4 Completion Details
----------------------
The details may change, but the basic idea is:
struct _IO_IRP_COMPLETION_CONTEXT {
union {
struct {
IN OPTIONAL PIO_ASYNC_CALLBACK Callback;
IN OPTIONAL PVOID CallbackContext;
} Async;
struct {
IN OPTIONAL Condition;
} Sync;
};
BOOLEAN IsAsync;
} IO_IRP_COMPLETION_CONTEXT, *PIO_IRP_COMPLETION_CONTEXT;
IoXxxFile()
- irp = IopIrpCreate()
- if AsyncControlBlock
+ irp->IsAsync = TRUE
+ irp->Async.Callback = AsyncControlBlock->Callback;
+ irp->Async.CallbackContext = AsyncControlBlock->CallbackContext;
+ Reference(irp)
- else
+ irp->IsAsync = FALSE
+ irp->Sync.Condition = pointer to stack condition variable
- IopDeviceCallDriver(..., irp)
- if sync
+ if STATUS_PENDING
- wait on context->Sync.Condition
+ *IoStatusBlock = Irp.IoStatusBlock
+ Dereference(irp)
- if async
+ if STATUS_PENDING
- set AsyncControlBlock->AsynContext = irp
+ else
- NOTE: *** FOR NOW, SYNC COMPLETION MEANS NO CALL TO IoIrpComplete() ***
- NOTE: *** Callback not called since returning complete to caller ***
- *IoStatusBlock = Irp.IoStatusBlock
- set AsyncControlBlock->AsyncContext = NULL
- Dereference(irp)
IoIrpComplete()
- acquire LOCK
- set IRP_FLAG_COMPLETE
- release LOCK
- if Irp->IsAsync
+ call Irp->Async.Callback(Irp->Async.CallbackContext, Irp, &Irp->IoStatusBlock)
+ Dereferece(Irp)
- else
+ signal Irp->Sync.Condition
====================
IRP NOTES wrt async calls into IoXxxFile() APIs:
All arguments are copied, except:
IoCreateFile():
- SecurityContext - ref-counted
- EA - TBD
- SD - TBD
- QOS - TBD
- ECP - caller must not free until complete
IoCloseFile():
- N/A
Io{Read,Write}File():
- Buffer - caller must not free until complete
Io{DeviceIo,Fs}ControlFile():
- InputBuffer - caller must not free until complete
- OutputBuffer - caller must not free until complete
IoFlushBuffersFile():
- N/A
Io{Query,Set}InformationFile Params:
- FileInformation - caller must not free until complete
IoQueryDirectoryFile():
- FileInformation - caller must not free until complete
Io{Query,Set}VolumeInformationFile():
- FsInformation - caller must not free until complete
Io{Lock,Unlock}File():
- N/A
Io{Query,Set}QuotaInformationFile():
- TBD
Io{Query,Set}SecurityFile():
- SecurityDescriptor - caller must not free until complete
NOTE: In general, IoXxxFile() should not copy buffer. In SRV case, this will
cause us to be faster. In lwmsg case, lwmsg will already preserve the
buffers for the duration of the call.
===========================
IopIrpCreate(FileObject) => irp
- irp->RefCount = 1
- irp->FileObject = FileObject
- REF(FileObject)
- return irp
IoCreateFile()
- irp = IopIrpCreate(file)
IopIrpDispatch(Irp, AsyncControlBlock)
- if AsyncControlBlock
+ irp->IsAsync = TRUE
+ irp->Async.Callback = AsyncControlBlock->Callback;
+ irp->Async.CallbackContext = AsyncControlBlock->CallbackContext;
+ REF(irp)
- else
+ irp->IsAsync = FALSE
+ irp->Sync.Condition = pointer to stack condition variable
- IopDeviceCallDriver(..., irp)
- if sync
+ if STATUS_PENDING
- wait on context->Sync.Condition
+ *IoStatusBlock = Irp.IoStatusBlock
+ Dereference(irp)
- if async
+ if STATUS_PENDING
- set AsyncControlBlock->AsynContext = irp
+ else
- NOTE: *** FOR NOW, SYNC COMPLETION MEANS NO CALL TO IoIrpComplete() ***
- NOTE: *** Callback not called since returning complete to caller ***
- *IoStatusBlock = Irp.IoStatusBlock
- set AsyncControlBlock->AsyncContext = NULL
- Dereference(irp)
IoCreateFile(AsyncControlBlock) => fileOject, ioSb, AsyncControlBlock->AsyncContext
- irp = IopIrpCreate(fileObject)
- fileObject = NULL
- (set various irp parameters)
- status = IopIrpDispatch(irp, AsyncControlBlock)
- if !STATUS_PENDING
+ ioSb = irp->IoSb
+ fileObject = irp->FileObject
- else do not touch ioSb or fileObject
- return fileObjbect, ioSb, AsyncControlBlock.AsyncContext
IoXxxFile(FileAsyncControlBlock)
- irp = IopIrpCreate(fileObject)
- fileObject = NULL
- (set various irp parameters)
- status = IopIrpDispatch(irp, AsyncControlBlock)
- if !STATUS_PENDING
+ ioSb = irp->IoSb
+ fileObject = irp->FileObject
- return fileOejbect, ioSb, status