Why can’t we have technology that would allow TS/VDI sessions to suspend/resume on a different host?

On April 1st, 2008  Brian Madden had posted up an April Fool’s day joke in the form of a blog entrydescribing how some new Australian firm had developed technology that would allow you to suspend a session running on a Terminal Server and take it offline in the form of VDI on a laptop.  While some people initially commented that they wanted more info about this, etc.  it became clear in the comments that people eventually realized that this technology doesn’t exist.

Brian later confirmed that it was truly a joke, but then questions why some of this is not possible.  What I’m going to do in this blog entry is discuss the reasons behind why this technology isn’t possible.

Suspend TS session and resume on a desktop VM (VDI):

This one is so ridiculuous I’m not even going to comment.  To begin with you’ve got a 2003 Server operating system trying to suspend resume something onto a Windows XP VM.  This really isn’t possible no matter what you do.  There’s just too many differences in the operating systems to even think something like this would ever be possible.  Now that we’ve stated that this isn’t possible, let’s talk about some other scenarios.

Suspend TS session and resume on another TS server:

This is the utopia for a Citrix Server admin.  While things like VMware allow you to suspend/resume a Terminal Server from one ESX host to another, all that allows you to do is perform server hardware maintenance.  It doesn’t allow you to do any software maintenance on the Terminal Server since the same OS instance is still running, just on different hardware.  Well, what if it was possible to suspend all the processes that everyone is using on a Terminal Server and resume those processes on another Terminal Server.  Is that possible?  No.  Here’s why:

What’s being described here of taking a running operating system with applications and suspending it and then resuming it elsewhere (be it on a VM or on a Terminal Server) is simply not possible because of the tenants of several different protocols such as TCP/IP, SMB, CIFS, Database protocols, etc.  You see the issue doesn’t stem from whether or not you can suspend an operating system and resume it elsewhere, because that’s do-able.  The issue is you can’t suspend stateful applications on one device and resume it on a completely different device without losing the state of those applications (when I’m referring to state I’m talking specifically about the network operations).  Well some of you are probably saying “You’re an idiot Shawn, VMWare does this now”.  Well, no, they don’t.  What VMware does is suspend a guest VM, ship the memory delta to the other ESX host, then resume the EXACT same Guest VM on a different host hardware device.  The guest operating system looks to be the exact same hostname, IP and MAC address to other systems that it has active sockets open to.  Compare this to suspending running applications on a Citrix server named Citrix01, and then trying to move those applications and their network connections to a system named Citrix02.  Any stateful applications would be broken as the socket would no longer exist on the new system Citrix02.  Now, if the applications that you’re using are stateless (like Internet Explorer browsing a straightup HTTP session), then when you resume the IE process on second host, it would simply retry the HTTP request and you’d be good.  Now let’s talk about someone who has a MS Word document open off a network share.  When Word is suspended and resumed on another host, the Windows file server hosting the open file would have absolutely no idea why Citrix02 was trying to connect to this file that was previously open by Citrix01 and you’d fail your file operation.  No retry is possible.

So is there any way around this networking issue?

If we ever want to see this type of session suspend/resume operations, what we really need is to have external I/O operations to databases, file servers, etc. be proxied through a relay device (not unlike what SSL Relay or CSG does).  In this manner, you could create a service similar to Citrix’s eXtended Transformation Engine (or XTE) to tunnel the user’s I/O operations to the remote file servers and databases, etc. and then maintain session state with the Terminal Server as the requesting client.  When you want to suspend/resume the TS session, you’d need to notify the app proxy that you’d disconnecting your session, but to keep your sockets open.  Then when you resume the application’s memory state on the new Citrix server, you’d simply reattach to the existing app proxy stream and continue your file/database operations unaffected.  This is still immensely complicated and would also introduce a single point of failure, so you’d need to have a load balancing device with sticky sessions to direct the connection back to the same proxy device if the connection was severed.  Even still, a failure of one of the app proxy devices would mean failure of all of the user’s running applications, not unlike any Terminal Server BSOD’ing.  The result is the same, the user would lose all of their running applications and files.

Aside from the network socket issues, there’s one more looming issue that would prevent process suspend/resume:

One word….memory.

You see suspending a process on one system and transporting those bits of memory to another system and resuming the process isn’t a big deal.  However, expecting the application to work is a big deal.  The reason for this is no two Windows systems (whether they are 2003 Terminal Servers or WinXP desktops) will have the exact same memory map.  Sure there may be lots of systems DLLs that are statically bound to a specific base memory address.  However, there are many DLLs that simply make a call to the operating system to allocate a block of memory and load them into it.  However, what if you’ve got references inside your running application that are pointed at other DLLs in memory that aren’t at those base addresses on the other system.  Well, blamo is what happens.  Best case scenario, your application crashes and you’ll need to relaunch it.  Of course, that wasn’t our intent in suspending and resuming the process.  We could have easily just closed it and reopened it on another system.  The worse case scenario is some type of buffer overflow or your transported application writes over some other applications memory space.  Now you’ve got big problems that may even result in a kernel trap or BSOD.  This just isn’t going to work.  Going back to how VMware VMotion does this you clearly see there isn’t a problem.  VMotion duplicates the contents of system RAM, so when the system is restored everything within the memory space is in the exact same location.  No problems there.

Suspend VDI session and resume on a desktop VM (VDI):

If this approach was strictly related to the same approach that VMWare uses for VMotion (in order words the user would still be running on a VM with the exact same name, IP, and MAC as the other VM), then this is certainly possible.  If on the otherhand, you’re inferring that you’d suspend processes and resume them on a guest system, then this suffers the exact same problems that the TS-to-TS suspend/resume has and wouldn’t be possible.  VMware announced at VMworld 2008 Europe, that they are working on an offline VDI scenario.  My guess is they are planning to use a system exactly like VMotion and/or the equivalent of suspending a VM in the VMware Workstation products and shipping the deltas to the client workstation where the VM would be resumed.  Of the the question that begs to be asked is “What applications were you running in the VDI session and what will happen to them when you resume the VM?”  If the applications were stateful, they’ll likely crash even if you have a VPN tunnel established remotely some number of hours later.  The reason why things like cluster technology, VMotion, etc. work well is because it’s a momentary disruption.  The reason why disconnected sessions on Citrix work over long interruptions is because the application is left running on the TS while the user is disconnected from the TS session (this is exactly like the application connect proxy I talked about earlier).

That’s all the random thoughts I have on this topic now.  Feel free to post some comments and we’ll continue the discussion.  Or maybe I’ll write a part 2 later 

Agree? Disagree? Let me know with a comment...