Reading from a Word Document with COM in PHP

I love PHP. I love MySQL. They are powerful. They are easy to use. They are well documented.

I have no particular aversion to Microsoft Word. As a word processor, and more, it has served me well over the years. It has produced for me innumerable essays, reports, resumes, Engineering Department notices, and letters to Santa. I never before had the pleasure of working Word as a programmer.

A client wished to perform full-text searches on documents uploaded to her website. As you might expect, the Microsoft Word file format prevents one from simply reading in the text. Still, "No problem," we said. "I think we've heard of some COM platform that will let PHP talk to Word. We can definitely do this." You will notice that this is the moment at which Brooke and I took our first step into Hell.

You see, COM allows any programming language to interact directly with a Microsoft application, such as the IE or the shell or Excel. In PHP, we should be able to run Word, open a document, and read from that document.

So, we started poking around online, looking for COM documentation and examples of similar implementations. The examples were there, albeit sparsely, but the documentation was mostly lacking. When I instantiate a COM handle to Word, what methods are at my disposal? No one would tell me. Furthermore, no one presented examples of opening a document, reading the entire contents, and closing it. Seems simple, seems universally useful but it isn't there. Go look: I dare you to try.

I could write a new document or modify an existing one. I could read the first character or special 'bookmarked' characters. I could not just read the entire file. Just give me the text!

And then, I found this PHP class and I experienced an epiphany, a ray of sweet, warm sunlight shining on my cold, bare ass. I could open the Word document with COM in PHP, and then, without reading it, save it as a text file. AND THEN I COULD READ THE TEXT FILE.

$word = new COM("word.application") or die("Unable to instantiate Word");
$word->Documents->Open($filename);
$new_filename = substr($filename,0,-4) . ".txt";
// the '2' parameter specifies saving in txt format
$word->Documents[1]->SaveAs($new_filename,2);
$word->Documents[1]->Close(false);
$word->Quit();
$word->Release();
$word = NULL;
unset($word);

$fh = fopen($new_filename, 'r');
// this is where we exit Hell
$contents = fread($fh, filesize($new_filename));
fclose($fh);
unlink($new_filename);

This method works! It actually works! I can actually have the contents of the Word document! Huzzah.

I posted this here, with attribution to the aforementioned PHP class for inspiration and for the format parameter to the SaveAS function, in the hope that some other hapless fool, attempting to complete the same task, will find solace in these lines. Feel free to contact me with any questions: I am more than happy to help you defeat the COM demon.

As a closing note, the second half of the task, intelligent full-text search, was rendered trivial, laughably easy, by the MySQL built-in full-text search functions. Thank you, Open Source. You win again.


60 Comments

  1. From Brooke R

    Commented January 25th, 2007 11:52 pm

    Ha, so true. I think really, there was just one "example" that everyone re-posted.

    I still can't believe that 2 (two) hours of trolling through technet and msdn articles couldn't turn up a single list of available COM commands.

    once again, good work piecing together what was available to make a workable solution.

    And let me second the "thank you" to OSS! MySQL FTW!

  2. From jayson

    Commented June 29th, 2007 10:54 pm

    It is a great code u have there! :)
    I've already use this code but the problem is that there's a fatal error when this syntax is written "$word->Release();".. any ideas? thanks. and by the way. I wanted to learn how to use .COM in php. Please recommend some books for preferences. thank u so much. I've been looking for this kind of code for almost a year.

  3. From drew

    Commented July 6th, 2007 9:52 am

    From an e-mail to Jayson:
    Hello Jayson,

    I am happy to hear that my code was helpful to you. I hope I can be of further service. I have done very little work with COM in PHP---this small project was in fact the first and last time. So, I cannot suggest any reference books to you on the subject. It is particularly difficult to find a book on the subject of accessing a Microsoft service (COM) with an open source programming language (PHP)---the majority of books available are for .NET or the like.

    Regarding your fatal error on $word->Release(), I would suggest commenting out that line and seeing if the code still works. That command is one of many that ensures the COM object is released and deleted and will not live on in memory. The command $word = NULL should accomplish this goal even without the Release() statement.

    Let me know if you have further questions or problems.

    Peace,
    Drew/Carlos d'Avis

  4. From Jakub Mroz

    Commented September 22nd, 2007 1:04 pm

    Wow... wondering if it's possible to exec another windows appliactions by COM ?

  5. From Ali

    Commented September 27th, 2007 9:06 pm

    Hi,

    I tried the above code but I am getting error at following line

    $word->Documents->Open("myfile.doc");

    The error is:

    Warning: (null)(): Invoke() failed: Exception occurred. Source: Microsoft Word Description: The document name or path is not valid. Try one or more of the following: * Check the path to make sure it was typed correctly. * On the File menu, click Open. Search for the file using this dialog box. (myfile.doc)

    Although the file does exists.

    I am using PHP 4 on Windows XP with Apache.

  6. From Simon Huntley

    Commented October 9th, 2007 8:05 am

    I might try this code on the company intranet. Thank you for publishing this. I'll let you know how it goes.

    -Simon.

  7. From james clavel

    Commented October 23rd, 2007 7:51 pm

    yah i have the same problem with ali... hope you could help. thanks

  8. From drew

    Commented October 30th, 2007 11:22 am

    I will be revisiting this topic very soon with new blog posts and pages detailing the capabilities of PHP in dealing with Microsoft Office documents.

  9. From rick

    Commented November 9th, 2007 4:22 am

    I read this COM object only works on a windows webserver..

    I'm using a Linux webserver with php5/apache and I really NEED this functionality

    any hope?

    thnx

  10. From Sam

    Commented January 11th, 2008 3:50 am

    Hey -- Is there any way to specify the desired encoding when saving to txt? I beleive the default is ISO-*, but I'm needing UFT-8 (not UTF-16).

    If you know off the top of your head let me know.

    Thanks

  11. From Rajapriya

    Commented February 13th, 2008 4:38 am

    Thank you very much.i have this problem in my coding.unbeleivable! its work nice.

  12. From Jacka

    Commented February 18th, 2008 5:17 pm

    Thanks a lot for this example!
    You are so right... I looking for this too, but i only found to create or to alter a document (but a spell check (!) is possible..).
    Curious.. ;o)

  13. From Draicone

    Commented March 15th, 2008 5:22 am

    This is still a bit of a hack though - you shouldn't need to save it as a new text document. The ActiveDocument property of a word.application instance is an instance of Document which has a Content property serving the same purpose. Try this blog post:

    http://www.developertutorials.com/blog/php/extracting-text-from-word-documents-via-php-and-com-81/

    Working with MS Word documents and COM makes much more sense if you've used VB in the past.

    Importing the MS Word library into a VB project gives you access to the object model via the object browser, and from there on it's smooth sailing given the level of detail in MSDN.

    ... of course, none of that will make sense unless you've developed for Windows before. Long story short, Word exposes itself like the DOM exposes itself in JS, and it makes sense to VB/VC# developers. Ask on a VB board about Word COM if you need help with a COM question; translating VB code into PHP COM code is really easy.

  14. From osama

    Commented March 16th, 2008 2:56 am

    hi all...
    i have aprblem when i try to use this code, the problem is this error:

    Fatal error: Uncaught exception 'com_exception' with message 'Source: Microsoft WordDescription: This file could not be found. Try one or more of the following: * Check the spelling of the name of the document. * Try a different file name. (document.doc)' in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php:133 Stack trace: #0 C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php(133): variant->Open('document.doc') #1 {main} thrown in C:\Program Files\Apache Software Foundation\Apache2.2\htdocs\test\osama\word.php on line 133

    i need to read adocument files like MS word and when i read it like i read a text file the result as bad file and Encrypted..
    .. please any when help me.

  15. From Esfandiar

    Commented March 19th, 2008 12:11 am

    I would love to know, how to open a MS Word file safely. That is making sure that viruses and worms don't infect your computer. Turning off the macro-s and running a virus protection is one way. Anything else?

    Also how about opening the file, with a MS Word 2007 and parsing the XML?

    Any help would be appreciated.
    Thanks, contact: e.bandari@gmail.com.

  16. From Achmad

    Commented April 17th, 2008 11:19 pm

    hi i have problem same as osama.. how do i fix that??

  17. From Lupus

    Commented April 20th, 2008 10:03 am

    Any way to save a doc file to html? I would like to keep the tables images and anything else from the doc file.

    Thx

  18. From drew

    Commented April 22nd, 2008 9:32 am

    This post has proven to be my most popular without doubt. I will very soon be revisiting this programming hurdle and attempting to answer some of your questions.

    Thank you for your patience and interest!

  19. From Esfandiar

    Commented April 29th, 2008 2:18 pm

    Question to all:

    Is there a way to dispaly MS Word 2007 OpenXML Word in a browser? Thanks in advance, Esfandiar
    --
    Esfandiar Bandari, PhD, MBA
    e.bandari@cantab.net, e.bandari@gmail.com
    skype: ebbandari & gtalk: e.bandari
    H. (650) 964-4154 Cell: (650) 862-8351
    http://www.linkedin.com/in/ebandari

  20. From eric

    Commented May 6th, 2008 5:30 am

    how do i download and install the COM component with php

  21. From swathi

    Commented May 29th, 2008 12:06 am

    i have same problem as ali

  22. From eSolutions

    Commented June 7th, 2008 10:33 am

    is their any way to read word(.doc) file and crate PDF file online ?

    if it possible then please tell me
    i m very very thankful to u
    please....

  23. From anjali

    Commented June 16th, 2008 12:05 am

    Thanks,
    its bery helpful for read microsoft word file in PHP.
    But i am trying to make script which can read all emails from outlook using COM.
    Do you have any idea?
    Can you please help me

  24. From Meenu

    Commented July 8th, 2008 12:07 am

    Where is the COM class file

  25. From kashif

    Commented July 8th, 2008 3:52 am

    I want to read table, imges etc from the word file, for the simple text this is best script but me searching for the script that can read the images.

    if any body knows kindly help me for this.
    Thanks

    Kashif
    Kashifyh@gmail.com

  26. From sunitha

    Commented July 28th, 2008 4:31 am

    hi i have some problem in my site,i am doing in php.
    i am not able to open the content in doc files.if i open the doc file it display encoded data.Please help me its very urgent.

  27. From sunitha

    Commented July 28th, 2008 4:35 am

    how do i download and install the COM component with php

  28. From Arivusudar

    Commented July 30th, 2008 5:04 am

    How can i solve this:
    Fatal error: Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Server execution failed ' in C:\wamp\www\check\index.php:3 Stack trace: #0 C:\wamp\www\check\index.php(3): com->com('word.applicatio...') #1 {main} thrown in C:\wamp\www\check\index.php on line 3

  29. From Arivusudar

    Commented July 30th, 2008 9:29 pm

    Your coding is very nice. it is working in localhost but it is not working in my web server..
    showing error, can you tell me what can i do for clear error..

  30. From UTKARSH DIXIT

    Commented August 4th, 2008 9:57 pm

    I have written the following code

    New Page 1

    Documents->Open($filename);
    $newfilename = substr($filename,0,-4) . “.txt”;

    // the '2' parameter specifies saving in txt format

    $word->Documents[1]->SaveAs($newfilename,2);
    $word->Documents[1]->Close(false);
    $word->Quit();
    $word->Release();
    $word = NULL;
    unset($word);

    $fh = fopen($newfilename, ‘r’);
    // this is where we exit Hell

    $contents = fread($fh, filesize($newfilename));
    fclose($fh); unlink($new_filename)

    ?>

    but i was getting the following error
    Fatal error: Uncaught exception 'com_exception' with message 'Source: Microsoft WordDescription: This file could not be found. Try one or more of the following: * Check the spelling of the name of the document. * Try a different file name. (demo.doc)' in C:\Program Files\EasyPHP 2.0b1\www\docfile.php:12 Stack trace: #0 C:\Program Files\EasyPHP 2.0b1\www\docfile.php(12): variant->Open('demo.doc') #1 {main} thrown in C:\Program Files\EasyPHP 2.0b1\www\docfile.php on line 12
    help me to remove this problem

  31. From Kevin

    Commented August 12th, 2008 8:42 pm

    For all the people who are having the "file could not be found" problem, try using the full path, eg "c:\my folder\my subfolder\myworddoc.doc".

    My problem is, that Word 2000 won't accept the parameter 2. I'm assuming that the above COM functions work in Word 2003 and 2007?

  32. From Marcos

    Commented August 29th, 2008 10:54 am

    For those of you needing to read word documents on a linux box . There is antiword.

    Take a look here of how it works:

    http://www.linux.com/articles/52385

    Maybe it's ok for those needing to parse a word file

  33. From kazey

    Commented November 16th, 2008 10:53 am

    Does this method support .docx file conversion?

  34. From LongEric

    Commented December 29th, 2008 5:52 pm

    Great!!! It took some reading through all the useful comment here, but now I got the script working (see below) and output the Word file in the browser (HTML :-)
    Below you see how to get the full path if the script gives an error not able to open the word file.

    <?php

    if( isset($_REQUEST['filename'] ) )
    $filename= $_SERVER['DOCUMENT_ROOT'].'/'.dirname($_SERVER['PHP_SELF']).'/'.$_REQUEST['filename'];
    else
    die( "use: word2txt?filename=path");

    if( file_exists( $filename))
    echo "Opening document: $filename";
    else
    die( "File not found: '$filename'");

    $word = new COM("word.application") or die("Unable to instantiate Word");
    $word->Documents->Open($filename);
    $new_filename = substr($filename,0,-4) . ".txt";
    // the '2' parameter specifies saving in txt format
    $word->Documents[1]->SaveAs($new_filename,2);
    $word->Documents[1]->Close(false);
    $word->Quit();
    //$word->Release();
    $word = NULL;
    unset($word);

    $array= file( $new_filename);

    foreach( $array as $line)
    echo $line.'';
    ?>

  35. From Mike

    Commented January 15th, 2009 10:59 am

    How is the identification of the file that will be counted fit in the script? in another words where do you make the instance for the file?

    Would be cool to Upload the .doc file and count it

  36. From Bharanikumar

    Commented February 12th, 2009 12:23 am

    hi all

    First i tell you the lot of your comments for COM PHP,

    More persons problem is ,They know the code for read the

    word document, but they got the Error always or sometime

    This is the Error

    loaded , word version12.0
    Fatal error: Uncaught exception 'com_exception' with message 'Error [0x80020003] Member not found. ' in E:\WorkingProjects\ameexImapAjax\AmeexImapTested\source\createworddocumet.php:15 Stack trace: #0 E:\WorkingProjects\ameexImapAjax\AmeexImapTested\source\createworddocumet.php(15): com->Release() #1 {main} thrown in E:\WorkingProjects\ameexImapAjax\AmeexImapTested\source\createworddocumet.php on line 15

    And

    Fatal error: Uncaught exception 'com_exception' with message 'Unable to lookup `Content': Call was rejected by callee. ' in E:\phpprojects\source\insertWord.php:5 Stack trace: #0 E:\phpprojects\source\insertWord.php(5): unknown() #1 {main}

    These for sory of some Error,

    Yes i know COM object trying to read the word document but due to some reason , its goes to fails,

    We guys want that reason, how to fix this sort of exceptions,

    thanks

    expecting reply from all

    Thanks

  37. From Aira Pratama

    Commented February 17th, 2009 9:53 pm

    Hi all, I'm a beginner in PHP programming. Would you help me to show how to implement that PHP script code wordconvert.php.
    I would like to build internal documentation for my office to open theirs history word document. I'm using Joomla as front-end for user interaction.
    So when they search some keyword in word document, the result should show all related keyword in all word documents file.

    Would you help me. Because I'm really confuse to solve this problem.

  38. From Baz

    Commented August 25th, 2009 5:19 am

    hi there guys

    im currently doing my 3rd year in a degree of computer science and theofore dissertation. And in this i have to deal with word files and invoices. As to some of the above comments, Programming PHP by oreilly seems to cover ALL Microsoft office packages. How up to date it is, i do not know

    however may be a good start for some people here stating that they are new to php and wish to have a reference book.

  39. From Brian

    Commented October 19th, 2009 11:16 am

    My script just hangs on

    $word->Documents->Open($filename);

    Has anyone else experience this and have a solution? This isn't just this script. Once I try to OPEN the document on any script they just hang.

    Thanks!

  40. From joy

    Commented November 1st, 2009 10:01 am

    Hi.
    Your topic is very helpful.
    One more question.If i want to preserve the formatting what needs to be done.say for example,somebody upload a cv in word format and i want to convert it into html file with same formatting.
    please advice.

  41. From sajad aziz

    Commented December 23rd, 2009 1:43 am

    thanks for your support. i have used this peice of code

    $filename= "c:/fo.doc";
    //$content = shell_exec('C:/antiword/Docs/work2.doc');
    //print $content."who";

    $word = new COM("word.application") or die("Unable to instantiate Word");
    $word->Documents->Open($filename);
    $new_filename = substr($filename,0,-4) . ".txt";

    // the '2' parameter specifies saving in txt format

    //$word->Documents[1]->SaveAs($new_filename,2);
    $word->Documents[1]->Close(false);
    $word->Quit();
    $word->Release();
    $word = NULL;
    unset($word);

    $fh = fopen($new_filename, 'r');

    // this is where we exit Hell

    $contents = fread($fh, filesize($new_filename));
    fclose($fh);
    unlink($new_filename);

    this code hangs the browser and shows nothing..

    please help me in this regard

  42. From kapil

    Commented January 9th, 2010 1:00 am

    i have a doc file stored in my mysql database .now i want to read that file from where .I tried the above code given by you but it does not give work.
    the problem is that is is giving an error like file name is uncorrect and get hanged.
    plz tell me the solutions?

  43. From ROopesh

    Commented January 19th, 2010 2:34 am

    Hi,

    while using your code or any code which got fromforums, to read a document file.

    $word->Documents->Open($filename);

    Stucks in this line, page gets loading... loading... no result
    I'm testing in local system in Windows XP office 2007 installed.

    Can you please guide me to over come this?

    thanks
    roopesh

  44. From Jason

    Commented January 29th, 2010 10:22 am

    Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': The server process could not be started because the configured identity is incorrect. Check the username and password. ' in path\InsertWordFields.php:3 Stack trace: #0 path\InsertWordFields.php(3): com->com('word.applicatio...') #1 {main} thrown in path\InsertWordFields.php on line 3

    Line 3 is
    $word = new COM("word.application") or die("Unable to instantiate Word");

    I'm having trouble setting up permissions for COM on IIS7 for word. Any suggestions?

  45. From Ian

    Commented March 11th, 2010 5:39 pm

    For all those who's script gets stuck at
    $word->Documents->Open($filename);

    It's because the word app on your server has a copy open already, you need to quit the winword.exe with the task manager and delete the temp file word uses to lock the file.

    I'm at the next hurdle, reading the bookmarks, getting the following error
    'Unable to lookup `Bookmarks'

    Any answers!

    Cheers
    Ian

  46. From Mathew Anderson

    Commented March 14th, 2010 10:37 pm

    Can this be done using php on linux host, without having to use COM ?

  47. From vishnu

    Commented March 29th, 2010 3:21 am

    hi i need to read the exact format of doc or docx files using php can any body help me
    thank u

  48. From Adrian

    Commented June 24th, 2010 4:44 am

    > When I instantiate a COM handle to Word,
    > what methods are at my disposal?

    http://msdn.microsoft.com/en-us/library/microsoft.office.interop.word.application_members.aspx

    Also look at;

    http://msdn.microsoft.com/en-us/library/kw65a0we%28VS.80%29.aspx
    http://msdn.microsoft.com/en-us/library/78whx7s6%28v=VS.80%29.aspx

    Also, if you use proper development tools that understand COM then the tools will tell you what methods and properties are available on a COM object without you having to consult the extensive documentation. However I won't stop a lack of proper dev environment and lack of ability to use google from letting a bad workman blame his tools.

    For the chap with Outlook requirements;
    http://msdn.microsoft.com/en-us/library/ms268893%28VS.80%29.aspx
    http://msdn.microsoft.com/en-us/library/ms268731%28v=VS.80%29.aspx

    Thankfully Outlook automation is far easier than Word given the fixed nature of the data you're dealing with and there are loads of very good websites that deal with this stuff if you have a google. The automation principals are the same as those with Word. Which brings me onto….

    To everyone else having problems, you have to ensure that the Word application is installed on the server for this code to work. However above all else I'd actually tell you to just abandon this method; it's not supported and not recommended in a web environment. There are a few different products that let you read Word docs that are written to work in a web environment, so you should really be using those.

    http://support.microsoft.com/kb/257757

    The thing to remember above all is that it is MS's fault that your dev tools don't use COM properly, it is MS's fault that you can't find the ample documentation on the subject, and it is MS's fault that a desktop application shouldn't be used in code running on a web server as every other desktop application can be automated and interacted with perfectly well from web server code, and it's probably MS's fault that you can't just write a simple, professional, factual blog to help people. Still, I suppose that's what "open source" is all about.

  49. From Adrian

    Commented June 24th, 2010 4:49 am

    > Can this be done using php on linux host, without having to use COM ?

    Not using the above technique, no. Word can only be automated on a Windows platform (or I suppose any other platform that supports COM and Windows binaries) via a programming language that can interact with COM objects.

    You'll have to use one of the other products mentioned like antiword or something that does the same job and is suitable for execution within a web environment on your chosen platform.

  50. From Jaffar

    Commented July 4th, 2010 9:24 pm

    I want to read doc file which is located in my Xampp directory ,while trying this I got exception of Com...
    It work fine when I doc file is located on DESKTOP.....

    Please Figure Out..

  51. From mediavince

    Commented October 21st, 2010 1:12 pm

    try and use openoffice with odt files...
    (to go from doc to odt if needed on linux cli: unoconv or jodconverter)

    http://www.phpclasses.org/package/2586-PHP-Convert-OpenOffice-Writer-documents-to-HTML.html

  52. From Anton

    Commented May 25th, 2011 9:05 pm

    Hi All, just sharing

    I have problem open document in a folder like
    "C:\sample\test file to be read by php.doc" as filename.

    Then i change the slash ( / ) into backslash ( \ ), and it works fine. so i change the filename as
    "C:/sample/test file to be read by php.doc".

  53. From Anonymous

    Commented November 3rd, 2011 6:12 am

    [...] [...]

  54. From Felix

    Commented November 17th, 2011 3:17 pm

    can someone post a working code here?

  55. From bapu

    Commented January 21st, 2012 12:36 am

    this code ok but iam getting one error is that Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Invalid syntax '

  56. From bapu

    Commented January 21st, 2012 12:38 am

    $word = new COM("word.application") or die("Unable to instantiate Word");
    $word->Documents->Open($filename);
    $new_filename = substr($filename,0,-4) . ".txt";
    // the '2' parameter specifies saving in txt format

    $word->Documents[1]->SaveAs($new_filename,2);
    $word->Documents[1]->Close(false);
    $word->Quit();
    $word->Release();
    $word = NULL;
    unset($word);
    $fh = fopen($new_filename, 'r');
    // this is where we exit Hell

    $contents = fread($fh, filesize($new_filename));
    fclose($fh);
    unlink($new_filename);

    it ok but iam getting error is that Uncaught exception 'com_exception' with message 'Failed to create COM object `word.application': Invalid syntax '

  57. From bapu

    Commented January 21st, 2012 12:38 am

    please help me.............

  58. From Niva Hada

    Commented January 23rd, 2012 9:07 pm

    how to give a exact path in webserver while opening a file in word document. in local it works while giving path like "C:/xampp/htdocs/cms/filename.txt" but how to give path to webserver.Please Help me

  59. From Manish

    Commented March 13th, 2012 10:26 pm

    hey ..
    $word->Documents->Open($filename)or die("Cannot find file to convert");
    its dying every time please help me out...

  60. From ?Dariush

    Commented May 11th, 2012 9:42 am

    I want to create a word file and need your help.guide me please

Add a Comment