If you follow me via Twitter, you’re probably aware that I’ve been working on a bug in the Microsoft App-V 4.5 Client for the last few weeks. This particular bug has been occurring randomly at a client site of mine. User’s that have a particular App-V application will sporadically receive the following error message when trying to start the application:
The text of the error message is “This application has failed to start because the application configuration is incorrect. Reinstalling the application may fix the problem. Error code: 4505CD-1F702639-000036B1”
At the same time as this launch failure, the following event log entry is logged in the System Event Log
The details of this event log entry are this:
Background on the error:
This issue would randomly occur with Office 2007 SP1 App-V package, but the issue was very rare. However, we had one sequenced application (BMC Control-M) that it would occur around 1 in 5 launches. At first we suspected some kind of software conflict. When you’re in an environment with 2000+ applications across 20k desktops, it’s not unheard of that some broken package might be overwriting some key DLLs, etc. This suspicion was raised because the launch failures were not occurring for all users of the application. More on why later. Anyway, we began with the typical things like re-installing the .NET Framework, re-installing the VC++ 2005 SP1 runtime and while we had limited success after doing so then problem was still there. After messing around re-instaling a few applications, we decided to take our desktop build down to the absolute minimums and try to repro the issue. Even with the build at the very basic OS components, we could still reproduce it. I decided to try an OS build straight from media to avoid any kind of customer OS modifications. To my delight, the problem did not recur on my fresh OS build from media. We later discovered that it had more to do with this system being a VM than it did with the system being a fresh OS install.
On to the problem discovery:
One of the guys that I work with at this cilent site (we’ll call him Bob) had an ancient laptop that was already lifecycled off the books, but he still had possession of it. When Bob ran the Control-M package on his ancient laptop, he couldn’t reproduce the issue once. When Bob informed me of this, we both started thinking “Is it because this machine is slower and therefore the client is taking longer seeking the hard drive and preventing the problem from occurring? Or is it because this system has a single CPU whereas everything else is running at least two CPUs due to Hyperthreading or Dual Core?”
Let’s test the multiple CPU theory:
The first test for the multiple CPU condition was an easy one. Simply add a second CPU to my VM that was consistently working and see what happens. I did just that and voila the problem began occuring on my VM (not as frequently as on the physical desktop hardware though so system speed appears to have something to do with it too).
The second test was to take one of our dual processor systems (in this case a hyperthreading machine not a true dual core) and alter the boot.ini to include the /onecpu switch which forces Windows to ignore the 2nd logical processor. To our excitement, this system began working 100% of the time despite having failed regularly before.
Now we’ve proven it, now what?
Before we give up and call Microsoft, I wanted to ensure this wasn’t fixed in CU1 or any post-CU1 hotfix rollups otherwise that would be a wasted premier support incident. I downloaded and installed CU1 and the July hotfix rollup. No difference in error frequency.
On to Microsoft support:
Now that we’ve confirmed this isn’t something that’s already fixed, we opened an incident with MS Premier support. We provided all the details on how to reproduce the issue and even sent our problem package off for testing at Microsoft. They were able to repro the issue in their labs. After about a week of back and forth and the issue going up through escalation, Microsoft confirmed the existence of a race condition bug in the App-V File System in three different places and that they would be working on a hotfix.
And the fix….
Microsoft created a fix for the three race condition bugs and they will be including it into the September 2009 Hotfix Rollup Pack which currently has KB974278 This KB is not currently public, but I would expect it to go public in a few weeks. If you desperately need this fix before then, you should contact Microsoft Support to obtain it as I will not hand out any non-public hotfixes.
Agree? Disagree? Let me know with a comment...