This article contains a brief discussion of the issues and a C# example that you can use to convert Word documents to HTML or PDF on the fly in your applications.
Converting Word documents to PDF and HTML is a vital function for many applications. Opening a Word document requires the user to have Word installed, or at least Word Viewer, and with the advent of IE8 there's no easy way of showing a document in-situ within a web page (you can change the IE settings to get the old behaviour again, but that's too much of an ask for most web apps). The best solution is to convert the document to HTML or PDF, and to store the more portable file alongside the .doc on your server.
The main problem faced by third party converters is that the document needs to be laid out perfectly. It must be absolutely 100% consistent with what the author sees in Word. Any anomalies are unacceptable.
The solution I recommend every time is Word itself. You can automate Word using what are known as interop libraries. These are installed alongside Microsoft Office, so for this to work you must install Office on the server.
Some people will think that installing Office on a production server is unacceptable, or maybe your app runs on a Unix server. There is an answer: create a web service that can be used to retrieve the binary data of the next document to be converted, and accepts the binary data of the converted file. Now you can get a Windows Server, install Office 2007 and deploy a .NET service that constantly polls your web service and performs the conversions as needed. You will get perfect results every time.
Here's sample code that you can use to get up and running straight away. You'll have to make a few tweaks here and there (like the file name), and if you don't already have it you'll need to go to the Microsoft Office web site and download the PDF exporter (luckily it's free).
using Microsoft.Office.Interop.Word; ... object _read_only = false; object _visible = true; object _false = false; object _true = true; object _dynamic = 2; object _missing = System.Reflection.Missing.Value; object _htmlFormat = 8; object _pdfFormat = 17; object _xpsFormat = 18; object fileName = "C:\\Test.docx"; ApplicationClass ac = new ApplicationClass(); //ac.Visible = true; // Uncomment to see Word as it opens and converts the document //ac.Activate(); Document d = ac.Documents.Open(ref fileName, ref _missing, ref _true, ref _read_only, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _visible, ref _missing, ref _missing, ref _missing, ref _missing); object newFileName = ((string)fileName).Substring(0, ((string)fileName).LastIndexOf(".")) + ".pdf"; d.SaveAs(ref newFileName, ref _pdfFormat, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing, ref _missing); d.Close(ref _false, ref _missing, ref _missing); ac.Quit(ref _false, ref _missing, ref _missing); ac = null;
I hope you find this useful. I won't attempt to explain those huge method signatures with all of the _missing parameters, because it isn't important in the context of this example. If you want to know more, I suggest you take a look at the documentation on MSDN -- it's worth reading anyway if you're going to be building Word Interop in to your app.
About
We are a small British company that produces business-oriented software and solutions. These articles are a product of our daily work - information that we think might be useful to share. We hope you find them useful.
Our Software
These are some of our products. Several are open source, some are web-based and others are proprietary:
Categories
- .NET (10)
- Apple (2)
- Business (5)
- CSS (1)
- HTML (2)
- Innovation (4)
- Java (4)
- Javascript (1)
- Leadership (1)
- MySQL (2)
- Oracle (6)
- Postgres (1)
- Programming (5)
- Rails (4)
- Ruby (10)
- SQL Server (9)
- Subversion (1)
- Web (5)
- Windows Server (2)
Archives
- July 2010 (2)
- September 2009 (5)
- August 2009 (1)
- July 2009 (12)
- June 2009 (16)
- May 2009 (3)