Wednesday, November 24, 2010

The poor step child

Unfortunately, in the world of SharePoint Foundation 2010 not all file types are created equal. Here is an example.

 

I have saved two identical documents to my SharePoint Foundation 2010 site. One is a Word 2007 document (i.e. docx format) the other is an Adobe Acrobat document (i.e. PDF). Both are pretty standard right? Well, one is certainly a second class citizen as far as SharePoint is concerned. Can you guess which one?

 

image_2_60D24BCB

 

Here are the two documents in my document library. You can clearly see which one is the Word document because of the little file icon but alas, note that the Acrobat document has no file icon so you really aren’t sure what type of file it is. This can be rectified with some configuration but is not so out of the box.

 

If I click on the Word document (and I have Word on my system) the document opens. If I click on the Acrobat document (and I have Acrobat on my system) I see.

 

image_4_60D24BCB

 

Ahhh..where’s the option to open the file? Guess what? That’s also not enabled by default.

 

image_6_60D24BCB

 

To enable this you need to go into the SharePoint Central Administration, under Manage Web Applications you select the Web Application and then General Settings (easy eh?).

 

image_8_60D24BCB

 

Half way down the list change the Browser File Handling from Strict to Permissive.

 

image_10_60D24BCB

 

Now when you click on a PDF in SharePoint 2010 Foundation it will open in a browser window so you can view it.

 

Note that in this document (and the identical Word one) is the term ‘collaboration. How about we perform a search for this in SharePoint?

 

image_12_60D24BCB

 

After ensuring that indexing is running and that we have indexed all the SharePoint content we run a search for the term ‘collaboration’, which we know appears in both the Word and Acrobat document.

 

We only get one returned result as shown above, that being the Word document, even though we know it also appears in the Acrobat document.

 

image_14_60D24BCB

 

Apparently, the ‘solution’ to PDF indexing on SharePoint Foundation 2010 is to install Search Server Express 2010. As you can see from the above I get exactly the same result. Only one document match and again it is the Word document not the PDF.

 

image_16_60D24BCB

 

Ah ha, you say, but Search Server Express 2010 doesn’t come configured by default to index PDF documents and you are right as you can see from the list of files Search Server Express 2010 above does index be default.

 

image_18_60D24BCB

 

Luckily, I can configure Search Server Express 2010 to index any file type. So I add PDF as shown above, initiate a full crawl of the data and try to search again.

 

image_14_60D24BCB

 

Again, same result no PDF matches are returned.

 

Ah ha, you say again. You need to install the 64 bit iFilter for Acrobat to allow indexing for SharePoint Foundation 2010. Spot on once again Holmes. So I download and install that.

 

With that done and another full manual crawl run the result is once again.

 

image_14_60D24BCB

 

Do an IISRESET followed by a full manual crawl – same result.

Reboot server followed by a full manual crawl - same result.

Etc, etc.

 

But with the wave of my magic wand, hey presto

 

image_22_46B6EA01

 

I now have a ‘duplicates’ hyperlink which when I open I see

 

image_20_788EA493

 

It now works! I end up with duplicates since it is effectively the same file and when I expand these duplicates (since I am using Search Server Express 2010 here after going to all the hassle of installing it) I see not one but TWO matches finally Amazing what you can achieve if you have my SharePoint Operation Guide eh? How much more valuable does this make SharePoint Foundation 2010 now?

 

The real question for me is why it is not enabled by default out of the box? Isn’t PDF a common enough format? Doesn’t having the ability to index PDF documents greatly add to the value of SharePoint? How many people are going to know to go in and ‘tweak’ SharePoint Foundation 2010 (and WSS v3 for that matter) to all PDF indexing? Most are going to call the product ‘crap’ and move on to something else because it lack what should be common functionality (at least in my opinion).

 

Worst of all? The configuration to do PDF indexing really isn’t that difficult to enable.