I’ve been dropped in a SCCM environment that wasn’t going anywhere. All SCCM services were stopped, SCCM database file was 100+GB in size and containing about 1.3 milion files (of which about 90% was corrupted, invalid or otherwise crippled) and the tempdb was 20GB in size. This can be discovered by :
- Open SQL Management Studio
- Right click on the database open reports
- Open standard reports
- Open “Disk usage by top tables”
All package files had been moved from the SCCM-known location to another disk to gain disk space and evenso the disk got full again. This once was a glorious tiered hierarchy with once central site server and two primary site servers. However, now it’s dead. One of the primary site servers got disconnected during the initial site sync and after three days got back online. After getting online, two days later the virtual server (ESX in this case) broke down and the hosted SCCM primary site server broke even further. Resulting in what I described above.
Starting the SCCM services while the package files (package source folder) were moved from their location would generate a massive amount of errors. However, the files couldn’t be moved back to their original location, because of the lack of disk space. The DB files were on the same disks as the package files.
A short summary of malfunctions :
- No disk space on source folder / db host disks
- Packages moved out of their SCCM-known locations, hence :
- All SCCM services stopped
- SCCM DB file 100GB+ (and countless records)
- Temp DB file 20GB+
- Site hierarchy broken
- 103.000+ replication files queued on central site (destined to the primary site)
This seems pretty bad and well, it is! However, let’s prioritize and generalize things to get a clearer view :
- File structure
- Disk space
- Servicing
- Queue file processing
- Unlinking hierarchy
- Restoring hierarchy
File structure
The package files should be moved back, otherwise we’re just creating errors rather than solving them. To do that we first have to get a clear view on what’s consuming the disk space. Like I said, the SCCM DB and Temp DB are showstoppers in this scenario. Both DB’s have been shrunk by SQL, but we’re getting a mere 1.7GB from it, so that’s no good. Moving the SCCM DB is no option since we just don’t have disk capacity to do so in a sensible way, moving the temp DB however is a good option by which we would gain ~20GB. So how to move a temp DB? It’s pretty easy :
- Open SQL Management Studio
- Open a new query window
- Paste the following code :
USE master;
GO
ALTER DATABASE tempdb
MODIFY FILE (NAME = tempdev, FILENAME = '{new file location}\tempdb.mdf');
GO
ALTER DATABASE tempdb
MODIFY FILE (NAME = templog, FILENAME = '{new file location}\templog.ldf');
GO
Replace {new file location} with the full path of the new temp DB location.
- Execute the query, you shouldn’t get any errors. If everything went fine, close down the query window (not doing so resulted in an error for me when continuing these next steps)
- Open services.msc and restart “SQL Server (MSSQLSERVER)”
- Return to the SQL Management Studio and open a new query window, this time paste the following code :
SELECT name, physical_name
FROM sys.master_files
WHERE database_id = DB_ID('tempdb');
- If everything went fine, you should see two rows with the new location for the temp DB files
At this point we gained 20GB and we could move our moved files to save disk space (~15GB on package files) back to their original location. In order to complete this entire procedure (which involves running a tool to remove site-specific sync files) we’re gonna have to restore all queue files as well. These files will be placed back in replmgr.box on the primary server. In our case, a replication and multiple server crashes spoiled the fun for us, so we had 103.000+ queue files, these had to be placed back, which of course took ages! Not moving these back will leave db rows unaffected by the cleaning up tool.
Disk space
Next, you would want to reclaim disk space. This can either be done by shrinking DB files in the SQL Management Studio or if it’s a virtual disk, increase disk size. This step is optional, but strongly recommended. Small-scaling is always a problem and is better to be prevented than cured afterwards.
Servicing
After the package files and queue files have been been restored to their original locations, it’s time to start the SCCM services again. If you don’t know the exact server names, this is the point to quickly launch SCCM and check it out. By starting the services, I mean the following services :
- SMS_EXECUTIVE
- SMS_SITE_COMPONENT_MANAGER
Quickly check the Windows event log if you feel you need to, but even in my situation it’s unlikely these services won’t start.
Queue file processing
Next up, open a command prompt on both the central and primary site servers that aren’t communicating properly and run the following commands :
preinst.exe /deljob <targetsite>
Replace <TargetSite> with the site the site you’re currently on should send files to. From the primary, the targetsite would be the central site; from the central it should be the primary site.
In my case there were no jobs targetting either site, so the files were to be disposed (or moved out of the cycle). I decided to move the files to another folder (way outside the inboxes folder) and leave them for the time being (knowing moving them back would never be an option… but still, I don’t know, I just felt safer for some reason).
Unlinking hierarchy (and cleaning the child)
By the time the queue files were safe (yes, this took over 2 hours… 850MB transferring with 110kb/sec), we broke the SCCM hierarchy. Removed the central site as parent site from the primary and watched the magic happen…
We left it this way from friday 3PM till monday 8AM because SCCM’s brains aren’t very fast into realising something has changed. For this specific case we want to make absolutely sure SCCM does realise it, in each and very cell of its brain, so it needs time and a lot of it too!
Keep in mind, unlinking any SCCM servers from its hierarchy triggers a database cleanup, all references to its former parent will be removed. Well, unfortunately this last statement is theory and should work. As of today, we decided it would be best to reinstall the site server. Why? Well, there was no cleanup process and orphaned references didn’t get removed at all. We could manually remove some of the elements through the admin console, but trying to delete many of the software updates elements generated errors such as “Trying to retrieve object CI_ID xxxxx”, followed by “CI_ID xxxxx not found in database”. Again, note we didn’t change/delete anything in/from the database at all.
Another problem we faced is a SQL related symptom. I call this a symptom, because I don’t know its cause nor its effect. Trying to open the CI_SDMPackages table in SQL Management Studio showed a number of 250.000 rows and one might wonder why this table is quite big. But the real question actually is why another table called “CI_ConfigurationItems” contains over 1.2mln rows loads within 15 seconds and SDMPackages actually fails to load and only returns 3.400 rows. There’s a way of checking the database for consistency errors and other ways of corruption :
- Open SQL Management Studio
- Open the SCCM database
- Open a new query window
- Type : DBCC CHECKDB
- Wait about 15 mins (in our case) and read the full report generated in the output window
In our case unfortunately DBCC did return a full report, however, there were 0 errors, 0 inconsistencies and 0 orphaned records.
Restoring hierarchy
Observing the SCCM site status and component status views show that the servers are restoring and settling down in generating random error messages.
As time passed and I had a wonderful weekend, it turned monday morning again and I had good hopes into restoring the sites hierarchy. I actually felt I was going to do something good today, making the local admin people happy again!
After having restored the hierarchy within SCCM, it still hasn’t gotten any faster in realizing what just happened, therefor we again give SCCM time to recover from its illness. This time we decided to not force a full site sync by placing a <sitecode>.SHA file in the central site’s inboxes\objmgr.box folder. Let nature take its course. The day we restored the hierarchy things were replicating slowly, but at least they did replicate. The next day the server was unresponsive and looking at the processes (remotely) showed sqlsrvr.exe consuming CPU on the child site. The server was either synchronizing, performing SCCM jobs. My admin-senses tingled, I navigated towards: configmgr\inboxes\replmgr.box\incoming showed yet again 23.600 files. However, these disappeared automatically, hence were processed by SCCM. The more files disappeared from the “incoming”-folder, the more elements appeared on the child site, ultimately, the site was fully synced.
At this stage we decided to update the child sites ‘ distribution points. Package elements were replicated, but the sites’ distribution points weren’t. In order to prevent any flooding of the server, we chose to copy small batches at a time. This can be achieved by the following short process :
- Navigate to packages on the site from which you want to copy the packages
- Right click on “Packages”, then click “Copy Packages”, a wizard will start
- Select the appropriate distribution point share that will receive the packages
- Place a checkmark in front of the container / package(s) that should be copied
- Finish the wizard
Now, to monitor this process, go to “Microsoft Configuration Manager\inboxes\despoolr.box\receive” and check the incoming files. This is the download folder into which new packages are received. Once file downloads are completed, they’re moved to SMSPKG and the distribution point share.
Last but not least, it’s time to check reporting from the child site to the central site. An easy and obvious way of testing this is to create a software update deployment on the central site that also targets child site clients. Enforce these updates (small scale ofcourse) and run a report from the central site. You should be able to see child site clients reporting to the central site. If this happens, you know this works properly, so both downstream and upstream connections are alive and kicking.
It’s been a long and strange trip with many ups and downs. I didn’t want to reinstall the child site server at first, but at some point you just have to reconsider your options and choose the best one. I recorded this article to reflect my progression into trying to revive a SCCM child server to give an insight in the process and tasks involved. I’m not saying it’s the one and only method into reviving it, but it’s how I’ve done it. Any comments are more than welcome and I hope this article will help anyone who’s facing a similar problem!