Reading from a Word Document with COM in PHP
I love PHP. I love MySQL. They are powerful. They are easy to use. They are well documented.
I have no particular aversion to Microsoft Word. As a word processor, and more, it has served me well over the years. It has produced for me innumerable essays, reports, resumes, Engineering Department notices, and letters to Santa. I never before had the pleasure of working Word as a programmer.
A client wished to perform full-text searches on documents uploaded to her website. As you might expect, the Microsoft Word file format prevents one from simply reading in the text. Still, "No problem," we said. "I think we've heard of some COM platform that will let PHP talk to Word. We can definitely do this." You will notice that this is the moment at which Brooke and I took our first step into Hell.
You see, COM allows any programming language to interact directly with a Microsoft application, such as the IE or the shell or Excel. In PHP, we should be able to run Word, open a document, and read from that document.
So, we started poking around online, looking for COM documentation and examples of similar implementations. The examples were there, albeit sparsely, but the documentation was mostly lacking. When I instantiate a COM handle to Word, what methods are at my disposal? No one would tell me. Furthermore, no one presented examples of opening a document, reading the entire contents, and closing it. Seems simple, seems universally useful but it isn't there. Go look: I dare you to try.
I could write a new document or modify an existing one. I could read the first character or special 'bookmarked' characters. I could not just read the entire file. Just give me the text!
And then, I found this PHP class and I experienced an epiphany, a ray of sweet, warm sunlight shining on my cold, bare ass. I could open the Word document with COM in PHP, and then, without reading it, save it as a text file. AND THEN I COULD READ THE TEXT FILE.
$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . ".txt";
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$fh = fopen($new_filename, 'r');
// this is where we exit Hell
$contents = fread($fh, filesize($new_filename));
fclose($fh);
unlink($new_filename);
This method works! It actually works! I can actually have the contents of the Word document! Huzzah.
I posted this here, with attribution to the aforementioned PHP class for inspiration and for the format parameter to the SaveAS function, in the hope that some other hapless fool, attempting to complete the same task, will find solace in these lines. Feel free to contact me with any questions: I am more than happy to help you defeat the COM demon.
As a closing note, the second half of the task, intelligent full-text search, was rendered trivial, laughably easy, by the MySQL built-in full-text search functions. Thank you, Open Source. You win again.
From Brooke R
Commented January 25th, 2007 11:52 pm
Ha, so true. I think really, there was just one "example" that everyone re-posted.
I still can't believe that 2 (two) hours of trolling through technet and msdn articles couldn't turn up a single list of available COM commands.
once again, good work piecing together what was available to make a workable solution.
And let me second the "thank you" to OSS! MySQL FTW!
From jayson
Commented June 29th, 2007 10:54 pm
It is a great code u have there!
I've already use this code but the problem is that there's a fatal error when this syntax is written "$word->Release();".. any ideas? thanks. and by the way. I wanted to learn how to use .COM in php. Please recommend some books for preferences. thank u so much. I've been looking for this kind of code for almost a year.
From drew
Commented July 6th, 2007 9:52 am
From an e-mail to Jayson:
Hello Jayson,
I am happy to hear that my code was helpful to you. I hope I can be of further service. I have done very little work with COM in PHP---this small project was in fact the first and last time. So, I cannot suggest any reference books to you on the subject. It is particularly difficult to find a book on the subject of accessing a Microsoft service (COM) with an open source programming language (PHP)---the majority of books available are for .NET or the like.
Regarding your fatal error on $word->Release(), I would suggest commenting out that line and seeing if the code still works. That command is one of many that ensures the COM object is released and deleted and will not live on in memory. The command $word = NULL should accomplish this goal even without the Release() statement.
Let me know if you have further questions or problems.
Peace,
Drew/Carlos d'Avis
From Jakub Mroz
Commented September 22nd, 2007 1:04 pm
Wow... wondering if it's possible to exec another windows appliactions by COM ?
From Ali
Commented September 27th, 2007 9:06 pm
Hi,
I tried the above code but I am getting error at following line
$word->Documents->Open("myfile.doc");
The error is:
Warning: (null)(): Invoke() failed: Exception occurred. Source: Microsoft Word Description: The document name or path is not valid. Try one or more of the following: * Check the path to make sure it was typed correctly. * On the File menu, click Open. Search for the file using this dialog box. (myfile.doc)
Although the file does exists.
I am using PHP 4 on Windows XP with Apache.
From Simon Huntley
Commented October 9th, 2007 8:05 am
I might try this code on the company intranet. Thank you for publishing this. I'll let you know how it goes.
-Simon.
From james clavel
Commented October 23rd, 2007 7:51 pm
yah i have the same problem with ali... hope you could help. thanks
From drew
Commented October 30th, 2007 11:22 am
I will be revisiting this topic very soon with new blog posts and pages detailing the capabilities of PHP in dealing with Microsoft Office documents.
From rick
Commented November 9th, 2007 4:22 am
I read this COM object only works on a windows webserver..
I'm using a Linux webserver with php5/apache and I really NEED this functionality
any hope?
thnx
From Sam
Commented January 11th, 2008 3:50 am
Hey -- Is there any way to specify the desired encoding when saving to txt? I beleive the default is ISO-*, but I'm needing UFT-8 (not UTF-16).
If you know off the top of your head let me know.
Thanks
From Rajapriya
Commented February 13th, 2008 4:38 am
Thank you very much.i have this problem in my coding.unbeleivable! its work nice.
From Jacka
Commented February 18th, 2008 5:17 pm
Thanks a lot for this example!
You are so right... I looking for this too, but i only found to create or to alter a document (but a spell check (!) is possible..).
Curious.. ;o)
From Draicone
Commented March 15th, 2008 5:22 am
This is still a bit of a hack though - you shouldn't need to save it as a new text document. The ActiveDocument property of a word.application instance is an instance of Document which has a Content property serving the same purpose. Try this blog post:
http://www.developertutorials.com/blog/php/extracting-text-from-word-documents-via-php-and-com-81/
Working with MS Word documents and COM makes much more sense if you've used VB in the past.
Importing the MS Word library into a VB project gives you access to the object model via the object browser, and from there on it's smooth sailing given the level of detail in MSDN.
... of course, none of that will make sense unless you've developed for Windows before. Long story short, Word exposes itself like the DOM exposes itself in JS, and it makes sense to VB/VC# developers. Ask on a VB board about Word COM if you need help with a COM question; translating VB code into PHP COM code is really easy.
From osama
Commented March 16th, 2008 2:56 am
hi all...
i have aprblem when i try to use this code, the problem is this error:
Fatal error: Uncaught exception 'com_exception' with message 'Source: Microsoft WordDescription: This file could not be found. Try one or more of the following: * Check the spelling of the name of the document. * Try a different file name. (document.doc)' in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php:133 Stack trace: #0 C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php(133): variant->Open('document.doc') #1 {main} thrown in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php on line 133
i need to read adocument files like MS word and when i read it like i read a text file the result as bad file and Encrypted..
.. please any when help me.
From Esfandiar
Commented March 19th, 2008 12:11 am
I would love to know, how to open a MS Word file safely. That is making sure that viruses and worms don't infect your computer. Turning off the macro-s and running a virus protection is one way. Anything else?
Also how about opening the file, with a MS Word 2007 and parsing the XML?
Any help would be appreciated.
Thanks, contact: e.bandari@gmail.com.
From Achmad
Commented April 17th, 2008 11:19 pm
hi i have problem same as osama.. how do i fix that??
From Lupus
Commented April 20th, 2008 10:03 am
Any way to save a doc file to html? I would like to keep the tables images and anything else from the doc file.
Thx
From drew
Commented April 22nd, 2008 9:32 am
This post has proven to be my most popular without doubt. I will very soon be revisiting this programming hurdle and attempting to answer some of your questions.
Thank you for your patience and interest!
From Esfandiar
Commented April 29th, 2008 2:18 pm
Question to all:
Is there a way to dispaly MS Word 2007 OpenXML Word in a browser? Thanks in advance, Esfandiar
--
Esfandiar Bandari, PhD, MBA
e.bandari@cantab.net, e.bandari@gmail.com
skype: ebbandari & gtalk: e.bandari
H. (650) 964-4154 Cell: (650) 862-8351
http://www.linkedin.com/in/ebandari
From eric
Commented May 6th, 2008 5:30 am
how do i download and install the COM component with php
From swathi
Commented May 29th, 2008 12:06 am
i have same problem as ali
From eSolutions
Commented June 7th, 2008 10:33 am
is their any way to read word(.doc) file and crate PDF file online ?
if it possible then please tell me
i m very very thankful to u
please....
From anjali
Commented June 16th, 2008 12:05 am
Thanks,
its bery helpful for read microsoft word file in PHP.
But i am trying to make script which can read all emails from outlook using COM.
Do you have any idea?
Can you please help me
From Meenu
Commented July 8th, 2008 12:07 am
Where is the COM class file
From kashif
Commented July 8th, 2008 3:52 am
I want to read table, imges etc from the word file, for the simple text this is best script but me searching for the script that can read the images.
if any body knows kindly help me for this.
Thanks
Kashif
Kashifyh@gmail.com
From sunitha
Commented July 28th, 2008 4:31 am
hi i have some problem in my site,i am doing in php.
i am not able to open the content in doc files.if i open the doc file it display encoded data.Please help me its very urgent.
From sunitha
Commented July 28th, 2008 4:35 am
how do i download and install the COM component with php
From Arivusudar
Commented July 30th, 2008 5:04 am
How can i solve this:
Fatal error: Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Server execution failed ' in C:\wamp\www\check\index.php:3 Stack trace: #0 C:\wamp\www\check\index.php(3): com->com('word.applicatio...') #1 {main} thrown in C:\wamp\www\check\index.php on line 3
From Arivusudar
Commented July 30th, 2008 9:29 pm
Your coding is very nice. it is working in localhost but it is not working in my web server..
showing error, can you tell me what can i do for clear error..
From UTKARSH DIXIT
Commented August 4th, 2008 9:57 pm
I have written the following code
New Page 1
Documents->Open($filename);
$newfilename = substr($filename,0,-4) . “.txt”;
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($newfilename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$fh = fopen($newfilename, ‘r’);
// this is where we exit Hell
$contents = fread($fh, filesize($newfilename));
fclose($fh); unlink($new_filename)
?>
but i was getting the following error
Fatal error: Uncaught exception 'com_exception' with message 'Source: Microsoft WordDescription: This file could not be found. Try one or more of the following: * Check the spelling of the name of the document. * Try a different file name. (demo.doc)' in C:\Program Files\EasyPHP 2.0b1\www\docfile.php:12 Stack trace: #0 C:\Program Files\EasyPHP 2.0b1\www\docfile.php(12): variant->Open('demo.doc') #1 {main} thrown in C:\Program Files\EasyPHP 2.0b1\www\docfile.php on line 12
help me to remove this problem
From Kevin
Commented August 12th, 2008 8:42 pm
For all the people who are having the "file could not be found" problem, try using the full path, eg "c:\my folder\my subfolder\myworddoc.doc".
My problem is, that Word 2000 won't accept the parameter 2. I'm assuming that the above COM functions work in Word 2003 and 2007?
From Marcos
Commented August 29th, 2008 10:54 am
For those of you needing to read word documents on a linux box . There is antiword.
Take a look here of how it works:
http://www.linux.com/articles/52385
Maybe it's ok for those needing to parse a word file
From kazey
Commented November 16th, 2008 10:53 am
Does this method support .docx file conversion?