Reading from a Word Document with COM in PHP
I love PHP. I love MySQL. They are powerful. They are easy to use. They are well documented.
I have no particular aversion to Microsoft Word. As a word processor, and more, it has served me well over the years. It has produced for me innumerable essays, reports, resumes, Engineering Department notices, and letters to Santa. I never before had the pleasure of working Word as a programmer.
A client wished to perform full-text searches on documents uploaded to her website. As you might expect, the Microsoft Word file format prevents one from simply reading in the text. Still, "No problem," we said. "I think we've heard of some COM platform that will let PHP talk to Word. We can definitely do this." You will notice that this is the moment at which Brooke and I took our first step into Hell.
You see, COM allows any programming language to interact directly with a Microsoft application, such as the IE or the shell or Excel. In PHP, we should be able to run Word, open a document, and read from that document.
So, we started poking around online, looking for COM documentation and examples of similar implementations. The examples were there, albeit sparsely, but the documentation was mostly lacking. When I instantiate a COM handle to Word, what methods are at my disposal? No one would tell me. Furthermore, no one presented examples of opening a document, reading the entire contents, and closing it. Seems simple, seems universally useful but it isn't there. Go look: I dare you to try.
I could write a new document or modify an existing one. I could read the first character or special 'bookmarked' characters. I could not just read the entire file. Just give me the text!
And then, I found this PHP class and I experienced an epiphany, a ray of sweet, warm sunlight shining on my cold, bare ass. I could open the Word document with COM in PHP, and then, without reading it, save it as a text file. AND THEN I COULD READ THE TEXT FILE.
$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . ".txt";
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$fh = fopen($new_filename, 'r');
// this is where we exit Hell
$contents = fread($fh, filesize($new_filename));
fclose($fh);
unlink($new_filename);
This method works! It actually works! I can actually have the contents of the Word document! Huzzah.
I posted this here, with attribution to the aforementioned PHP class for inspiration and for the format parameter to the SaveAS function, in the hope that some other hapless fool, attempting to complete the same task, will find solace in these lines. Feel free to contact me with any questions: I am more than happy to help you defeat the COM demon.
As a closing note, the second half of the task, intelligent full-text search, was rendered trivial, laughably easy, by the MySQL built-in full-text search functions. Thank you, Open Source. You win again.
From Brooke R
Commented January 25th, 2007 11:52 pm
Ha, so true. I think really, there was just one "example" that everyone re-posted.
I still can't believe that 2 (two) hours of trolling through technet and msdn articles couldn't turn up a single list of available COM commands.
once again, good work piecing together what was available to make a workable solution.
And let me second the "thank you" to OSS! MySQL FTW!
From jayson
Commented June 29th, 2007 10:54 pm
It is a great code u have there!
I've already use this code but the problem is that there's a fatal error when this syntax is written "$word->Release();".. any ideas? thanks. and by the way. I wanted to learn how to use .COM in php. Please recommend some books for preferences. thank u so much. I've been looking for this kind of code for almost a year.
From drew
Commented July 6th, 2007 9:52 am
From an e-mail to Jayson:
Hello Jayson,
I am happy to hear that my code was helpful to you. I hope I can be of further service. I have done very little work with COM in PHP---this small project was in fact the first and last time. So, I cannot suggest any reference books to you on the subject. It is particularly difficult to find a book on the subject of accessing a Microsoft service (COM) with an open source programming language (PHP)---the majority of books available are for .NET or the like.
Regarding your fatal error on $word->Release(), I would suggest commenting out that line and seeing if the code still works. That command is one of many that ensures the COM object is released and deleted and will not live on in memory. The command $word = NULL should accomplish this goal even without the Release() statement.
Let me know if you have further questions or problems.
Peace,
Drew/Carlos d'Avis
From Jakub Mroz
Commented September 22nd, 2007 1:04 pm
Wow... wondering if it's possible to exec another windows appliactions by COM ?
From Ali
Commented September 27th, 2007 9:06 pm
Hi,
I tried the above code but I am getting error at following line
$word->Documents->Open("myfile.doc");
The error is:
Warning: (null)(): Invoke() failed: Exception occurred. Source: Microsoft Word Description: The document name or path is not valid. Try one or more of the following: * Check the path to make sure it was typed correctly. * On the File menu, click Open. Search for the file using this dialog box. (myfile.doc)
Although the file does exists.
I am using PHP 4 on Windows XP with Apache.
From Simon Huntley
Commented October 9th, 2007 8:05 am
I might try this code on the company intranet. Thank you for publishing this. I'll let you know how it goes.
-Simon.
From james clavel
Commented October 23rd, 2007 7:51 pm
yah i have the same problem with ali... hope you could help. thanks
From drew
Commented October 30th, 2007 11:22 am
I will be revisiting this topic very soon with new blog posts and pages detailing the capabilities of PHP in dealing with Microsoft Office documents.
From rick
Commented November 9th, 2007 4:22 am
I read this COM object only works on a windows webserver..
I'm using a Linux webserver with php5/apache and I really NEED this functionality
any hope?
thnx
From Sam
Commented January 11th, 2008 3:50 am
Hey -- Is there any way to specify the desired encoding when saving to txt? I beleive the default is ISO-*, but I'm needing UFT-8 (not UTF-16).
If you know off the top of your head let me know.
Thanks
From Rajapriya
Commented February 13th, 2008 4:38 am
Thank you very much.i have this problem in my coding.unbeleivable! its work nice.
From Jacka
Commented February 18th, 2008 5:17 pm
Thanks a lot for this example!
You are so right... I looking for this too, but i only found to create or to alter a document (but a spell check (!) is possible..).
Curious.. ;o)
From Draicone
Commented March 15th, 2008 5:22 am
This is still a bit of a hack though - you shouldn't need to save it as a new text document. The ActiveDocument property of a word.application instance is an instance of Document which has a Content property serving the same purpose. Try this blog post:
http://www.developertutorials.com/blog/php/extracting-text-from-word-documents-via-php-and-com-81/
Working with MS Word documents and COM makes much more sense if you've used VB in the past.
Importing the MS Word library into a VB project gives you access to the object model via the object browser, and from there on it's smooth sailing given the level of detail in MSDN.
... of course, none of that will make sense unless you've developed for Windows before. Long story short, Word exposes itself like the DOM exposes itself in JS, and it makes sense to VB/VC# developers. Ask on a VB board about Word COM if you need help with a COM question; translating VB code into PHP COM code is really easy.
From osama
Commented March 16th, 2008 2:56 am
hi all...
i have aprblem when i try to use this code, the problem is this error:
Fatal error: Uncaught exception 'com_exception' with message 'Source: Microsoft WordDescription: This file could not be found. Try one or more of the following: * Check the spelling of the name of the document. * Try a different file name. (document.doc)' in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php:133 Stack trace: #0 C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php(133): variant->Open('document.doc') #1 {main} thrown in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php on line 133
i need to read adocument files like MS word and when i read it like i read a text file the result as bad file and Encrypted..
.. please any when help me.
From Esfandiar
Commented March 19th, 2008 12:11 am
I would love to know, how to open a MS Word file safely. That is making sure that viruses and worms don't infect your computer. Turning off the macro-s and running a virus protection is one way. Anything else?
Also how about opening the file, with a MS Word 2007 and parsing the XML?
Any help would be appreciated.
Thanks, contact: e.bandari@gmail.com.
From Achmad
Commented April 17th, 2008 11:19 pm
hi i have problem same as osama.. how do i fix that??
From Lupus
Commented April 20th, 2008 10:03 am
Any way to save a doc file to html? I would like to keep the tables images and anything else from the doc file.
Thx
From drew
Commented April 22nd, 2008 9:32 am
This post has proven to be my most popular without doubt. I will very soon be revisiting this programming hurdle and attempting to answer some of your questions.
Thank you for your patience and interest!
From Esfandiar
Commented April 29th, 2008 2:18 pm
Question to all:
Is there a way to dispaly MS Word 2007 OpenXML Word in a browser? Thanks in advance, Esfandiar
--
Esfandiar Bandari, PhD, MBA
e.bandari@cantab.net, e.bandari@gmail.com
skype: ebbandari & gtalk: e.bandari
H. (650) 964-4154 Cell: (650) 862-8351
http://www.linkedin.com/in/ebandari
From eric
Commented May 6th, 2008 5:30 am
how do i download and install the COM component with php
From swathi
Commented May 29th, 2008 12:06 am
i have same problem as ali
From eSolutions
Commented June 7th, 2008 10:33 am
is their any way to read word(.doc) file and crate PDF file online ?
if it possible then please tell me
i m very very thankful to u
please....
From anjali
Commented June 16th, 2008 12:05 am
Thanks,
its bery helpful for read microsoft word file in PHP.
But i am trying to make script which can read all emails from outlook using COM.
Do you have any idea?
Can you please help me
From Meenu
Commented July 8th, 2008 12:07 am
Where is the COM class file
From kashif
Commented July 8th, 2008 3:52 am
I want to read table, imges etc from the word file, for the simple text this is best script but me searching for the script that can read the images.
if any body knows kindly help me for this.
Thanks
Kashif
Kashifyh@gmail.com
From sunitha
Commented July 28th, 2008 4:31 am
hi i have some problem in my site,i am doing in php.
i am not able to open the content in doc files.if i open the doc file it display encoded data.Please help me its very urgent.
From sunitha
Commented July 28th, 2008 4:35 am
how do i download and install the COM component with php
From Arivusudar
Commented July 30th, 2008 5:04 am
How can i solve this:
Fatal error: Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Server execution failed ' in C:\wamp\www\check\index.php:3 Stack trace: #0 C:\wamp\www\check\index.php(3): com->com('word.applicatio...') #1 {main} thrown in C:\wamp\www\check\index.php on line 3
From Arivusudar
Commented July 30th, 2008 9:29 pm
Your coding is very nice. it is working in localhost but it is not working in my web server..
showing error, can you tell me what can i do for clear error..
From UTKARSH DIXIT
Commented August 4th, 2008 9:57 pm
I have written the following code
New Page 1
Documents->Open($filename);
$newfilename = substr($filename,0,-4) . “.txt”;
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($newfilename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$fh = fopen($newfilename, ‘r’);
// this is where we exit Hell
$contents = fread($fh, filesize($newfilename));
fclose($fh); unlink($new_filename)
?>
but i was getting the following error
Fatal error: Uncaught exception 'com_exception' with message 'Source: Microsoft WordDescription: This file could not be found. Try one or more of the following: * Check the spelling of the name of the document. * Try a different file name. (demo.doc)' in C:\Program Files\EasyPHP 2.0b1\www\docfile.php:12 Stack trace: #0 C:\Program Files\EasyPHP 2.0b1\www\docfile.php(12): variant->Open('demo.doc') #1 {main} thrown in C:\Program Files\EasyPHP 2.0b1\www\docfile.php on line 12
help me to remove this problem
From Kevin
Commented August 12th, 2008 8:42 pm
For all the people who are having the "file could not be found" problem, try using the full path, eg "c:\my folder\my subfolder\myworddoc.doc".
My problem is, that Word 2000 won't accept the parameter 2. I'm assuming that the above COM functions work in Word 2003 and 2007?
From Marcos
Commented August 29th, 2008 10:54 am
For those of you needing to read word documents on a linux box . There is antiword.
Take a look here of how it works:
http://www.linux.com/articles/52385
Maybe it's ok for those needing to parse a word file
From kazey
Commented November 16th, 2008 10:53 am
Does this method support .docx file conversion?
From LongEric
Commented December 29th, 2008 5:52 pm
Great!!! It took some reading through all the useful comment here, but now I got the script working (see below) and output the Word file in the browser (HTML
Below you see how to get the full path if the script gives an error not able to open the word file.
<?php
if( isset($_REQUEST['filename'] ) )
$filename= $_SERVER['DOCUMENT_ROOT'].'/'.dirname($_SERVER['PHP_SELF']).'/'.$_REQUEST['filename'];
else
die( "use: word2txt?filename=path");
if( file_exists( $filename))
echo "Opening document: $filename";
else
die( "File not found: '$filename'");
$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . ".txt";
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
//$word->Release();
$word = NULL;
unset($word);
$array= file( $new_filename);
foreach( $array as $line)
echo $line.'';
?>
From Mike
Commented January 15th, 2009 10:59 am
How is the identification of the file that will be counted fit in the script? in another words where do you make the instance for the file?
Would be cool to Upload the .doc file and count it
From Bharanikumar
Commented February 12th, 2009 12:23 am
hi all
First i tell you the lot of your comments for COM PHP,
More persons problem is ,They know the code for read the
word document, but they got the Error always or sometime
This is the Error
loaded , word version12.0
Fatal error: Uncaught exception 'com_exception' with message 'Error [0x80020003] Member not found. ' in E:\WorkingProjects\ameexImapAjax\AmeexImapTested\source\createworddocumet.php:15 Stack trace: #0 E:\WorkingProjects\ameexImapAjax\AmeexImapTested\source\createworddocumet.php(15): com->Release() #1 {main} thrown in E:\WorkingProjects\ameexImapAjax\AmeexImapTested\source\createworddocumet.php on line 15
And
Fatal error: Uncaught exception 'com_exception' with message 'Unable to lookup `Content': Call was rejected by callee. ' in E:\phpprojects\source\insertWord.php:5 Stack trace: #0 E:\phpprojects\source\insertWord.php(5): unknown() #1 {main}
These for sory of some Error,
Yes i know COM object trying to read the word document but due to some reason , its goes to fails,
We guys want that reason, how to fix this sort of exceptions,
thanks
expecting reply from all
Thanks
From Aira Pratama
Commented February 17th, 2009 9:53 pm
Hi all, I'm a beginner in PHP programming. Would you help me to show how to implement that PHP script code wordconvert.php.
I would like to build internal documentation for my office to open theirs history word document. I'm using Joomla as front-end for user interaction.
So when they search some keyword in word document, the result should show all related keyword in all word documents file.
Would you help me. Because I'm really confuse to solve this problem.
From Baz
Commented August 25th, 2009 5:19 am
hi there guys
im currently doing my 3rd year in a degree of computer science and theofore dissertation. And in this i have to deal with word files and invoices. As to some of the above comments, Programming PHP by oreilly seems to cover ALL Microsoft office packages. How up to date it is, i do not know
however may be a good start for some people here stating that they are new to php and wish to have a reference book.
From Brian
Commented October 19th, 2009 11:16 am
My script just hangs on
$word->Documents->Open($filename);
Has anyone else experience this and have a solution? This isn't just this script. Once I try to OPEN the document on any script they just hang.
Thanks!
From joy
Commented November 1st, 2009 10:01 am
Hi.
Your topic is very helpful.
One more question.If i want to preserve the formatting what needs to be done.say for example,somebody upload a cv in word format and i want to convert it into html file with same formatting.
please advice.
From sajad aziz
Commented December 23rd, 2009 1:43 am
thanks for your support. i have used this peice of code
$filename= "c:/fo.doc";
//$content = shell_exec('C:/antiword/Docs/work2.doc');
//print $content."who";
$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . ".txt";
// the '2' parameter specifies saving in txt format
//$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$fh = fopen($new_filename, 'r');
// this is where we exit Hell
$contents = fread($fh, filesize($new_filename));
fclose($fh);
unlink($new_filename);
this code hangs the browser and shows nothing..
please help me in this regard
From kapil
Commented January 9th, 2010 1:00 am
i have a doc file stored in my mysql database .now i want to read that file from where .I tried the above code given by you but it does not give work.
the problem is that is is giving an error like file name is uncorrect and get hanged.
plz tell me the solutions?
From ROopesh
Commented January 19th, 2010 2:34 am
Hi,
while using your code or any code which got fromforums, to read a document file.
$word->Documents->Open($filename);
Stucks in this line, page gets loading... loading... no result
I'm testing in local system in Windows XP office 2007 installed.
Can you please guide me to over come this?
thanks
roopesh
From Jason
Commented January 29th, 2010 10:22 am
Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': The server process could not be started because the configured identity is incorrect. Check the username and password. ' in path\InsertWordFields.php:3 Stack trace: #0 path\InsertWordFields.php(3): com->com('word.applicatio...') #1 {main} thrown in path\InsertWordFields.php on line 3
Line 3 is
$word = new COM("word.application") or die("Unable to instantiate Word");
I'm having trouble setting up permissions for COM on IIS7 for word. Any suggestions?
From Ian
Commented March 11th, 2010 5:39 pm
For all those who's script gets stuck at
$word->Documents->Open($filename);
It's because the word app on your server has a copy open already, you need to quit the winword.exe with the task manager and delete the temp file word uses to lock the file.
I'm at the next hurdle, reading the bookmarks, getting the following error
'Unable to lookup `Bookmarks'
Any answers!
Cheers
Ian
From Mathew Anderson
Commented March 14th, 2010 10:37 pm
Can this be done using php on linux host, without having to use COM ?
From vishnu
Commented March 29th, 2010 3:21 am
hi i need to read the exact format of doc or docx files using php can any body help me
thank u
From Adrian
Commented June 24th, 2010 4:44 am
> When I instantiate a COM handle to Word,
> what methods are at my disposal?
http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.application_members.aspx
Also look at;
http://msdn.microsoft.com/en-us/library/kw65a0we%28VS.80%29.aspx
http://msdn.microsoft.com/en-us/library/78whx7s6%28v=VS.80%29.aspx
Also, if you use proper development tools that understand COM then the tools will tell you what methods and properties are available on a COM object without you having to consult the extensive documentation. However I won't stop a lack of proper dev environment and lack of ability to use google from letting a bad workman blame his tools.
For the chap with Outlook requirements;
http://msdn.microsoft.com/en-us/library/ms268893%28VS.80%29.aspx
http://msdn.microsoft.com/en-us/library/ms268731%28v=VS.80%29.aspx
Thankfully Outlook automation is far easier than Word given the fixed nature of the data you're dealing with and there are loads of very good websites that deal with this stuff if you have a google. The automation principals are the same as those with Word. Which brings me onto….
To everyone else having problems, you have to ensure that the Word application is installed on the server for this code to work. However above all else I'd actually tell you to just abandon this method; it's not supported and not recommended in a web environment. There are a few different products that let you read Word docs that are written to work in a web environment, so you should really be using those.
http://support.microsoft.com/kb/257757
The thing to remember above all is that it is MS's fault that your dev tools don't use COM properly, it is MS's fault that you can't find the ample documentation on the subject, and it is MS's fault that a desktop application shouldn't be used in code running on a web server as every other desktop application can be automated and interacted with perfectly well from web server code, and it's probably MS's fault that you can't just write a simple, professional, factual blog to help people. Still, I suppose that's what "open source" is all about.
From Adrian
Commented June 24th, 2010 4:49 am
> Can this be done using php on linux host, without having to use COM ?
Not using the above technique, no. Word can only be automated on a Windows platform (or I suppose any other platform that supports COM and Windows binaries) via a programming language that can interact with COM objects.
You'll have to use one of the other products mentioned like antiword or something that does the same job and is suitable for execution within a web environment on your chosen platform.
From Jaffar
Commented July 4th, 2010 9:24 pm
I want to read doc file which is located in my Xampp directory ,while trying this I got exception of Com...
It work fine when I doc file is located on DESKTOP.....
Please Figure Out..
From mediavince
Commented October 21st, 2010 1:12 pm
try and use openoffice with odt files...
(to go from doc to odt if needed on linux cli: unoconv or jodconverter)
http://www.phpclasses.org/package/2586-PHP-Convert-OpenOffice-Writer-documents-to-HTML.html
From Anton
Commented May 25th, 2011 9:05 pm
Hi All, just sharing
I have problem open document in a folder like
"C:\sample\test file to be read by php.doc" as filename.
Then i change the slash ( / ) into backslash ( \ ), and it works fine. so i change the filename as
"C:/sample/test file to be read by php.doc".
From Anonymous
Commented November 3rd, 2011 6:12 am
[...] [...]
From Felix
Commented November 17th, 2011 3:17 pm
can someone post a working code here?
From bapu
Commented January 21st, 2012 12:36 am
this code ok but iam getting one error is that Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Invalid syntax '
From bapu
Commented January 21st, 2012 12:38 am
$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . ".txt";
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);
$fh = fopen($new_filename, 'r');
// this is where we exit Hell
$contents = fread($fh, filesize($new_filename));
fclose($fh);
unlink($new_filename);
it ok but iam getting error is that Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Invalid syntax '
From bapu
Commented January 21st, 2012 12:38 am
please help me.............
From Niva Hada
Commented January 23rd, 2012 9:07 pm
how to give a exact path in webserver while opening a file in word document. in local it works while giving path like "C:/xampp/htdocs/cms/filename.txt" but how to give path to webserver.Please Help me
From Manish
Commented March 13th, 2012 10:26 pm
hey ..
$word->Documents->Open($filename)or die("Cannot find file to convert");
its dying every time please help me out...
From ?Dariush
Commented May 11th, 2012 9:42 am
I want to create a word file and need your help.guide me please