Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Softpanorama MS Office Bulletin,2004

[Oct 7, 2004] Converting from Microsoft Word to HTML  Demoronizer is actually pretty good. See demoroniser - correct moronic and gratuitously incompatible Microsoft HTML

For an amusing account of the non-standard HTML code produced by the "Save As...HTML" feature, and a method to correct some of the more egregious mistakes, check out: http://www.fourmilab.ch/webtools/demoronizer

-- Andreas Yankopolus, December 31, 1998

mswordview is a Unix application which converts Word 8 documents to HTML. You can set netscape up to use mswordview to display .doc attachments.

http://www.csn.ul.ie/~caolan/docs/MSWordView.html

-- Andrew Morton, January 4, 1999

Star Office, which is free from Star Division, is an office suite (word processor, etc.). I have been successful using it to read MS Word files and to generate HTML versions of those files. I have not tried this with tables, however.

Found at http://www.stardivision.com, it installs easily on Solaris, Linux, etc. I think Windows NT/95 as well.


-- Patrick Logan, February 14, 1999

I finally gave up fighting with the staff over using Word, but I was able to convince them to save their work for the Web in RTF format (nearly a "standard"). We use RTF2HTML which results in doing a minimum of hand-coding. Although you don't get the source, you do get the scripts (not Perl) which are fully customizable. It doesn't have the overhead of running a CGI script and is available for most common platforms.

As always, YMMV (your mileage may vary).

Craig

-- Craig Burgess, April 29, 1999

The correct URL for the demoroniser is:

http://www.fourmilab.ch/webtools/demoroniser/

-- Frances Prevas, September 23, 1999

Every time I see a virii alert about MSWord macros I just chuckle to myself. Why in the world anyone would allow themself to be abused by that monstrosity (of MSWord) is beyond me.
For the best html-developer/editor ever made for the Windoze enviroment set your browser to
NoteTabPro where for $19.95 you get an editor that I believe rivals EMACS and has a great feature known as clipbooks which allow for customized development for anything an interactive developer would want (even does BINARIES!)
Available FREE Clipbooks:
HTML, CSS, Java/Script, SGML, and on-and-on...if the clipbook hasn't been developed, the open-enviroment of NoteTabPro allows you to write your own clipbooks.
I have no financial interest in NoteTabPro but simply believe it is the best thing to come along since Gates invented the internet!
Check it out, trash MSWord, regain your HD space, your memory, and your upgrade-itis.

-- Mark Comerford, October 6, 1999 Give Word2000 a try

Sorry for the crass commercialism, but I noticed that you had not updated the information on this page to reflect the latest version of Microsoft Office. Unlike the previous two (I must admit, half-hearted) attempts at saving documents in HTML format, the (teaming millions of) Office developers went all out to make HTML a full-fledged document format and not just a poor cousin to the old binary formats.

 

If you just want really simple HTML, you might be better using FrontPage or Visual Notepad, but if you have existing documents in Word, Excel or PowerPoint, I think you'll find surprisingly good fidelity to the look of the original document and a much better use of HTML tags to represent that look.

 

By the way, the HTML for this comment was created by Word2000. 



-- Mike Koss, October 28, 1999

Everytime I create an HTML file using the built-in HTML "converters" in MS Office97, I need to open the .htm file (another annoyance courtesy of Microsoft) in NoteTab Light, and strip out all of the extra junk in the file (e.g. STYLE="vnd.ms-excel.numberformat:$#,##0") for every special number/date format produced by Excel or Access; not to mention all of the Font Face tags. By manually stripping the extra labels, I am able to get rid of several thousand bytes of extra ASCII garbage that does not add anything to my HTML. It's an extra step, but since I know that most people that look at my files are using dial-up from home connections, it is the least that I can do.

-- John Fracisco, October 29, 1999

Regarding the "Try Office 2000" comment above: take a look at the HTML source this thing produced.

First, an embedded stylesheet about 30 lines long, completely specifying margins, fonts, etc. for each paragraph class. Each style used several non-standard attributes and values. Then, for each paragraph, it added a <span style=''> wrapper to override the styles in the class. Finally, it added some scripting, apparently for the hell of it.

So the end answer is no, the output of Office 2000 is no better than any previous MS effort, and is probably worse. (The explanation is that what's happing is that they're using HTML as an actual complete file format -- a replacement for .doc. They're using stylesheets as the equivalent to templates. Because there are lots of things in MS formatting that HTML doesn't support, they have to use lots of non-standard extensions. As usual, MS has totally missed the point: HTML is *not* a formatting language, it's a semantic markup language.)

-- Steve Greenland, January 24, 2000

For mass find a replace features, I enjoy the Allaire products HomeSite and ColdFusion Studio. They let you do find and replace features on whole folders. Makes changing MS HTML into legible/legal HTML a little easier. I've noticed MS HTML does sneaky stuff like incorrect nesting that works with IE, but breaks NetScape (subtle browser war tactic???). The Allaire products also have a "code sweeper" function which you can set to do things like "strip the font tag" or "strip ending P tags". You can customize the usage of all tags with code sweeper. Also there is a validate function that will point out the nesting errors. It's handy and so far my favorite editor. It has a WYSIWYG thing too, but it's not that great and I never use it. The 4.0 version of these programs great on Windows98/NT but I would be careful with CF Studio 4.5. I had strange memory problems with it.

-- Phillip Harrington, January 25, 2000

I have recently started converting books for the Web, and luckily these books were produced in pagemaker 6.5. I had never used the HTML export from pm, because generally I start a project in dreamweaver. The HTML export from pagemaker essentially takes a pm style, and you map that style to H1, P, etc. There are some funky problems with font colors, but a search and replace in dreamweaver is fairly quick. {shameless plug} you can see how it finally turned out at my Georgia Coast book {/shameless plug} If you have any specific questions about how to use pagemaker you can mail me directly.

-- John Lenz, April 25, 2000

I would just like to add to the Word 2000 "thread". I am maintaining a site where the principal content producer uses Word. I get emailed the docs, then have to convert them to HTML. At first, I was just cutting and pasting into GoLive 5, and manually editing for lists and breaks etc. I was getting tired of this, and thought I would try the save as HTML option. I was completely amazed at the amount of rubbish in the resulting file, xml this, namespace that. I've gone back to cut'n'paste!

-- Mark Horrocks, November 23, 2000

MSWordView is now wvWare.

Having tried Office 2000 and Office 2001 (Mac) converstion to HTML and seen the awful results, the choice of a un*x (including MacOSX) converter is really cool.



-- Bob Kerstetter, January 11, 2001

Hi, regarding Microsoft 2000 and it's inability to understand what the words "tidy" code mean. I used MS Word 2000 when our site needed to create an intranet and it was a total nightmare. It kept creating all these files and folders on the server that took up space and loading time. It created a file folder for every single page of html which was just plain ignorant if you ask me. I persisted with this until my manager let me buy FrontPage 2000 which is still a little bit messy but better value. I feel that Word is okay for quick, simple pages that aren't going to need much maintenance. Frontpage is very well integrated with all the other MS packages but does tend to spit out some garbage in the form of FPDB includes etc and doesn't tend to like files created in other Web design applications, having it's own tilted view on the world. If you want a dynamic database driven website, then FrontPage is great for the novice who hasn't got time to learn ASP, Javasript etc. It's a good training ground to pick all that up. Thanks

-- Hazera Bibi, January 18, 2001

Try downloading this utility from the microsoft web site. It's saved me hours of reinputting line breaks after I've copied and pasted Word text into Dreamweaver. It's a real life-saver for webmasters.

Yes - it really does work!!

Word 2000 'crud' HTML filter: http://office.microsoft.com/downloads/2000/Msohtmf2.aspx

ruth arnold
www.spacehoppa.com



-- Ruth Arnold, May 30, 2001

I can't believe no one mentioned the fact that Dreamweaver (4.0 at least) has a function that will import Word generated html and clean it up for you. It does a fabulous job and allows you to pick an choose how severe you want the clean up.

-- kim simms, June 21, 2001

Another useful tool is Dave Raggett's Tidy program. It can be found at http://www.w3.org/People/Raggett/tidy/. It will clean up your HTML, and has numerous options so you can customize how it formats (or cleans) the HTML. It's been ported to most OSs, and the source code is available if you want to modify how it works. Since it's a command-line program, it can be hooked into any decent editor -- that is, ones that allow you to run programs and capture the output.



-- David Wall, August 15, 2001

Well I was searching the web on convertion projects from MS-gernerated HTML files to Pure HTML tagged files... And I landed up in this page and found a lot of useful info.

I have developed a Java/JSP/Javascript/HTML based web-enabled application to do the job of converting .txt files to .htm files and it gives the end user a choice of operations paragraph by paragraph and the processed paragraphs are then written by the JSP with tags to the .htm file. I found the speed of conversion to be about 45 to 50 files an hour! for this you have to save every MS-HTML file as .txt and then give it to my program as input. I recently converted about 1200 content files for a German website.

anyone interested in offloading projects or additional info? please do write to me at [email protected]

-- Benjamin Christopher, November 5, 2001

Yes, Dreamweaver as a nice utility for cleaning the HTML code generated by Microsoft word, it allows also to choose how strong must be the cleaning, and it works satisfactorily for the Word 97 HTML code. Unfortunately, even with the strongest cleaning, it is not able to get rid of the <span style = ...> definition which Word 2000 put at the beginning of every sentence. If you import a Word 2000 HTML file, you will not be able to change the font of the document if not editing the HTML source, line by line... I'll give a try to the Office 2000 HTML filter 2.0 (by Microsoft), hoping it works!

-- Luca Bonci, February 21, 2002

I have found that eWebEditPro from Ektron (www.ektron.com) does a pretty good job of cleaning up Word 2000. It will produce xhtml output. It's not perfect, though - maybe about the same as Dreamweaver 4 but I haven't tested the difference. I'd like to find something that strips off all the font styles and leaves layout structure in place.

-- Andy Harrison, May 15, 2002

After struggling with the crud you get out of Word, even if you copy and paste into an HTML-friendly editor, I came up with this approach using Ant 1.5's very nice ReplaceRegExp task (sorry about the formatting loss here, but I'm too tired to reformat this nicely for text right now):

<target name="strip-test"> <replaceregexp flags="g" match="&lt;/FONT&gt;" replace=""> <fileset dir="${publish.dir}"/> </replaceregexp> <replaceregexp flags="g" match="&lt;FONT(.*)&gt;" replace=""> <fileset dir="${publish.dir}"/> </replaceregexp> <replaceregexp flags="g" match="&lt;P class=(.*)&gt;" replace="&lt;P&gt;"> <fileset dir="${publish.dir}"/> </replaceregexp> </target>

This effectively strips out the offensive font, style and non-standard <o:p> tag. There may be a way to optimize this by combining the replace expressions into a set of nested expressions, and it could easily be extended to strip out other junk. It's nicely speedy and easy to add to an Ant script for processing directories recursively leveraging Ant's <fileset> tag.

Ant is really wonderful if you're not familiar with it. Hope this is helpful to someone out there trying to clean their MS Word junk.

-- Daniel Seltzer, July 23, 2002

Lot of wonderfull information here guys, thanks.

Another utility to try is:

http://www.textism.com/resources/cleanwordhtml/

Using the *Office 2000 HTML Filter 2.0* from Microsoft and the page above does a good job of cleaning out Microsoft's, um, inaccuracies.

Now, if I can only teach my users not to use all caps . . .

-- Grey Gremlin, July 25, 2002

I have a vaguely db-backed personal web site on which I had to append some MSWord documents. The easiest solution I found is to open the Word file in OpenOffice Writer, (http://www.openoffice.org) save as HTML then edit the file in Emacs.

OpenOffice does generate a load of crap in the html file, but it's nowhere as bad as Word 2000. Most of my work is done by a rather ugly Emacs Macro (which should really be a Lisp procedure, but that will have to wait until I actually learn Lisp) to replace-regexp a couple of tags, namely :

- delete SPAN ("</?SPAN[^>]*>" -> "")

- remove attributes from p and h tags ("<H1[^>]*>" -> "<h1>")

The macro also add calls to my header and footer scripts and edits the header to use external CSS stylesheets.

Overall it works rather well and I can get .doc files up really fast. I still have to correct a few things by hand but with something more involved than my macro (by a better programmer) I think that wouldn't even be necessary.

By the way OpenOffice (free version of StarOffice 6) is quite good. It does mostly everything MsOffice does and it's free. And the equation editor is *way* better.

-- Serge Boucher, December 30, 2002

I am a farmer, not a computer expert, but here we are in 2003 and it seems even farms need web pages. I can accept that. I traded some beets for some web development work, and even got a crash course on using dreamweaver on a Mac. Wow, could it be, a computer that actually works! Looks complicated though. Next I needed to change something and thought it would be a simple matter to open an html document in MS word (the latest and greatest, in a university PC lab) make my changes and save as HTML. Word changed everything everything around so the images wouldn't display correctly on a Mac, and now I am wasting my time wading through code I don't understand. Sorry Mr. Gates, It looks like you still don't get it.

-- mac burgess, February 6, 2003

I'm sorry if someone has said this but there were a lot of responses and someone may have missed it. UK legislation to be introduced in about a years time will be very harsh on company websites that do not offer adequate accessability options - i.e allowing users to change font or font size and bg colour - to help people with learnig or reading dificulties. When word creates html it seems to put so many tags in that a lot of these facilities will not work. This is perhaps a consideration if you are using word for a vaguely commercial site as the UK gov have claimed they will agressively enforce this law.

-- Nathan Mcilree, July 1, 2003

We have found that the HTML exported from a MS-Word document by OpenOffice 1.1 is much cleaner than the Microsoft version. In particular it uses relative font sizes rather than the idiotic point sized fonts, so the user's screen prefrences are honoured. The output file is also significantly smaller than the equivalent Word export. (or indeed the original Word file.)

-- Andrew Macpherson, March 18, 2004

Also there is a useful plain-vanilla utility called antiword (which does a nice job of just grabbing the text), useful for creating indexes and the like (I personally use it for Plone, an open source CMS)

The other utility around is wvWare, but this has a lot of cascaded dependencies so is difficult to compile (build from scratch, as there is not off-the-shelf version) on some systems.

Good luck, and I wish you well extracting your intellectual property from M$ proprietary format !

-- stu hannay, August 9, 2004

 

[Sept 9, 2004]  WindowsDevCenter.com Lightweight XML Editing in Word 2003 by Evan Lenz, coauthor of Office 2003 XML

Did you know that Word documents can be saved in XML format? As of Microsoft Office 2003, the second option in Word's Save As dialog--right under "Word Document (*.doc)"--is "XML Document (*.xml)". This format is Microsoft's own XML vocabulary for Word documents, called WordprocessingML (or sometimes just WordML).

 

The ability to save Word documents as XML is arguably the most important XML-related feature introduced in Word 2003. But you wouldn't know it from all the hype surrounding Word's new support for customer schemas. When Microsoft announced that Word would let you edit XML documents that conform to your own schema (not just the WordprocessingML one), we were rightly intrigued and even excited. The promise of using the world's most popular word processor to edit, say, Docbook documents was nothing less than astounding, and it caused quite a stir in the XML community.

Hope Deferred

Now that the dust has settled and Office 2003 has been available for almost a year, we've got a clearer picture of reality. While the XML features in Word, Excel, Access, and the new InfoPath application are truly impressive and useful, it's clear that Word 2003 doesn't support arbitrary XML editing. At least it doesn't line up with the picture Microsoft painted originally. For one, the custom schema functionality is available only with Office Professional or the stand-alone Word 2003. More importantly, the features don't live up to the hype. While, strictly speaking, you can edit custom XML in Word, you're limited to using schemas that have a very static, fill-in-the-blanks structure. That means no optional or repeating elements and certainly no mixed content--that is, if you want a minimally user-friendly experience.

Or you could force your users to apply XML elements to portions of their document manually, using the new XML Structure task pane with Show XML Tags turned on. In that case, yes, they could edit arbitrary XML documents, even those with mixed content. And yes, Word will let them know if they've done something invalid (though it won't stop them from doing it). But since the user has to do all the work, and since XML elements cannot be associated with style information, the experience is not close to being user friendly (let alone WYSIWYG).

Or you could try to script in all the user friendliness by hand through the new Document Actions task pane. Of course, you should plan on joining a monastery to learn Smart Document programming and the attendant asceticism you'll need in order to appreciate the usability (or lack thereof) of your efforts' final results. (Tell me again, why are we using Word?)

Or (finally) you could come to terms with the fact that the most important (and robust) XML feature that was introduced in Word 2003 is its capability to save documents in a lossless, well-formed, open XML format called WordprocessingML. Ways to use it for generating, transforming, converting, querying, and otherwise processing Word documents are only starting to be realized. Editing custom XML may not be WordprocessingML's killer app, but it does raise some interesting possibilities that we'll explore here.

A Lightweight XSLT-Based Approach

This article presents a lightweight approach to XML editing in Word. It's "lightweight" in that it ignores all of Word's built-in custom schema functionality. A nice side effect of this approach is that it works in all editions of Word 2003. All you need outside of Word is an XSLT processor. (If you do happen to have the advanced XML functionality, you can make use of Word's bundled XSLT processor, but that's not required.)

This approach to editing will work only when your XML format is isomorphic to the structure and styles of your Word documents. The document's markup will only be as rich as the styles that are applied to it, so this rules out full-on Docbook editing. Word doesn't work well for editing recursive markup structures in general, because it doesn't support recursive styles. Each paragraph has exactly one paragraph style, and each character is associated with exactly one character style. (Word does, however, provide a convenient representation of heading levels as hierarchical subsections, using the <wx:sub-section> element, which we'll see referenced in our example below.)

You can make a complete XML editing solution for Word by writing two XSLT style sheets:

  1. A style sheet to transform from your custom XML to WordprocessingML, and
  2. A style sheet to transform from WordprocessingML back to your custom XML.

The basic scenario goes like this: to edit an custom XML document, it must get transformed by XSLT (No. 1) into WordprocessingML so that a user can edit it in Word. After the user is finished editing the document, the resulting WordprocessingML must be transformed again (No. 2), back to the custom XML format.

Note: This article does not introduce WordprocessingML except by example. For more thorough coverage, refer to the Office 2003 XML sample chapter available online, called "The WordprocessingML Vocabulary" (PDF).

An Example

Before we look at the XSLT, here's a document that conforms to a dead-simple, Docbook-esque format that we'll be editing:

<?xml version="1.0"?>
<?mso-application progid="Word.Document"?>
<?xml-stylesheet type="text/xsl" href="article2wordml.xsl"?>
<article>
  <title>This is the article title</title>
  <section>
    <title>First section</title>
    <para>This is the <emphasis>first</emphasis> paragraph.</para>
    <para>This is the <strong>second</strong> paragraph.</para>
  </section>
  <section>
    <title>Second section</title>
    <para>This section will have some sub-sections.</para>
    <section>
      <title>First sub-section</title>
      <para>This is the paragraph text of the first sub-section.</para>
    </section>
    <section>
      <title>Second sub-section</title>
      <para>This is the paragraph text of the second sub-section.</para>
      <para>And here is another paragraph, just for the fun of it--with a
<a href="http://www.xmlportfolio.com/">hyperlink</a> to boot!</para>
    </section>
  </section>
</article>

Here is what we want this document to look like while it's being edited in Word:

As you can see, the XML has a few examples of mixed content, which are rendered in Word using character styles (italic, bold, and blue/underlined). The hierarchical sections of the XML document are rendered using a heading for each title (Heading 1 for the article title, and Heading 2, Heading 3, and so on for successively deep section titles).

The Code

Assuming we have this XML document lying around already and we want to let people edit it, we'll need an XSLT style sheet to transform it to WordprocessingML (style sheet No. 1 in the list above). This file is called article2wordml.xsl. It contains various template rules that map elements from the custom XML to elements and styles defined in WordprocessingML. For example, to turn <emphasis> elements into character runs with the Emphasis character style, we use the following template rule:

<!-- For text in <emphasis>, apply the "Emphasis" character style -->
  <xsl:template match="emphasis/text()">
    <w:r>
      <w:rPr>
        <w:rStyle w:val="Emphasis"/>
      </w:rPr>
      <w:t>
        <xsl:value-of select="."/>
      </w:t>
    </w:r>
  </xsl:template>

To turn section titles into hierarchical headings, we use this template rule:

<!-- Convert section titles to "Heading X" paragraphs -->
  <xsl:template match="section/title">
    <w:p>
      <w:pPr>
        <w:pStyle w:val="Heading{count(ancestor::section)+1}"/>
      </w:pPr>
      <xsl:apply-templates/>
    </w:p>
  </xsl:template>

Once a user has made changes to the document from within Word, a new WordprocessingML document is saved and must be translated back to the custom XML format using style sheet No. 2 mentioned above. This style sheet, called wordml2article.xsl, has similar rules, except that they reflect the reverse mapping--from WordprocessingML to our custom XML format. For example, here's the rule that turns text in the Emphasis style into an <emphasis> element:

<!-- turn a run with the "Emphasis" character style into <emphasis> -->
  <xsl:template match="w:r[w:rPr/w:rStyle/@w:val='Emphasis']"
                mode="para-content">
    <emphasis>
      <xsl:copy-of select="w:t/text()"/>
    </emphasis>
  </xsl:template>

Here are the rules that convert the Heading paragraphs back to sections with titles:

<!-- Convert <wx:sub-section> elements to <section> elements -->
  <xsl:template match="wx:sub-section">
    <section>
      <xsl:apply-templates/>
    </section>
  </xsl:template>

<!-- Convert <w:p> paragraphs to <para> paragraphs -->
  <xsl:template match="w:p">
    <para>
      <xsl:apply-templates mode="para-content"/>
    </para>
  </xsl:template>

<!-- ...except for the first paragraph in a sub-section (Heading 1,2,3,...);
       the heading will be the <title> of the section -->
  <xsl:template match="wx:sub-section/w:p[1]">
    <title>
      <xsl:apply-templates mode="para-content"/>
    </title>
  </xsl:template>

For a complete investigation of the style sheets (including descriptive comments), see the full text of these files:

 

Formatting Restrictions

Word 2003 also quietly introduces a new feature called formatting restrictions. When you have formatting restrictions enabled, users are restricted to using the set of styles that you specify. They can't modify the styles, nor can they apply direct formatting (such as bold or italic) to their document. While not specifically an XML feature, this enables a sort of document validation that makes particular sense when you are using the lightweight XML editing approach described above. It lets you restrict the range of formatting constructs that your conversion XSLT will have to handle. Rather than writing a generic WordprocessingML transformation, your style sheet will have to handle only those Word documents that are restricted to a particular Word template and its styles. This is a global restriction--a set of allowed styles, as opposed to a content model schema. You can't, for example, enforce that the Emphasis character style be used only in Normal paragraphs. Nevertheless, it is a profoundly useful feature for XML editing applications in Word.

If you look back in article2wordml.xsl, you'll see that formatting restrictions are enabled as a document setting:

  <w:docPr>
    ...
    <w:documentProtection w:formatting="on" w:enforcement="on"/>
  </w:docPr>

The particular styles that are locked or unlocked are indicated as such in the WordprocessingML's global <w:styles> element. In this case, we restrict the users to only the styles they see in the "Styles and Formatting" task pane:

 

The other built-in styles normally available to Word users appear as if they don't even exist anymore.

Using Word's XSLT Processor

This editing "solution" will work regardless of the edition of Word 2003 you have, provided that you have an external XSLT processor to do the transformations between edits. But if you have Office Professional or the stand-alone Word 2003, then you don't need another XSLT processor; you can use the bundled XSLT processor that comes with those editions of Word. Looking back at our article XML example, we see two processing instructions:

<?mso-application progid="Word.Document"?>
<?xml-stylesheet type="text/xsl" href="article2wordml.xsl"?>

The mso-application processing instruction (PI) associates the XML file with the Word application, so that when a user double-clicks the file, Word opens the XML file, overriding whatever the default XML viewer is on their system. The second PI is useful only if you've got the advanced XML features. Upon opening the file, the user is presented with an option to apply article2wordml.xsl to the document, yielding the editing view we saw above. This is called an onload transformation.

Our other style sheet, wordml2article.xsl, is called an onsave style sheet, as it is applied to the WordprocessingML representation of the edited Word document when the user saves the document after making changes. How does Word know to use this style sheet, you ask? It is referenced inside the WordprocessingML result of the onload transformation. If you look inside article2wordml.xsl, you'll see the relevant document properties being set like so:

      <w:docPr>
        <!-- This only works if you're using Word 2003 standalone or
             Office 2003 Professional -->
        <w:useXSLTWhenSaving/>
        <w:saveThroughXSLT w:xslt="wordml2article.xsl"/>
        ...
      </w:docPr>

The end result is that end users can open, edit, and save the custom XML file without having to invoke any external IT processes. Word handles both XSLT transformations to and from WordprocessingML.

Some Benefits

This approach treats XML editing as essentially a conversion problem. While the activity of conversion isn't the same as that of editing, they're related. If you can create a reasonably reliable transformation from a legacy document format to a desired XML format, then it stands to reason that you could use the same transformation for new Word documents that users create.

A few things can make this easier for the scenario in which authors are creating new documents, as opposed to you converting legacy documents. Before users start authoring documents, you have the freedom to decide what Word template to use, along with the appropriate styles--whereas you don't have that option when converting legacy documents that already exist.

Another advantage of this approach is that it doesn't force the Word user to adopt a new model or way of thinking or editing (which is decidedly not the case if you make them use Word's built-in custom XML features). The savvy Word author doesn't have to know that the document will be converted to XML later on. They just know that using styles is good practice. But even if they don't know that, we can force them (through formatting restrictions) to use the correct styles to get the formatting they want.

Some Limitations

One of the things I like about this "lightweight" approach is that, beyond creating a Word template, the only code you have to write is two XSLT style sheets. It sounds deceptively simple. The problem is that the more complicated your XML formats become, the more difficult it will be to define round-trip mappings between them and WordprocessingML. In the real world, we usually want to support at least some forms of recursive markup. For example, we should be able to specify that some text is "strong" and "emphasized" by using markup like this:

  <strong>This is bold <emphasis>and italic</emphasis>.</strong>

But since Word doesn't support such combinations, you have to merge these into a single style definition, called something like StrongAndEmphasis. And you'll want to also account for the scenario in which a <strong> element appears inside an <emphasis> element, not just the other way around. So we would need to add a rule to our onload style sheet that looks something like this:

  <xsl:template match="strong/emphasis/text() | emphasis/strong/text()"
                priority="1">
    <w:r>
      <w:rPr>
        <w:rStyle w:val="StrongAndEmphasis"/>
      </w:rPr>
      <w:t>
        <xsl:value-of select="."/>
      </w:t>
    </w:r>
  </xsl:template>

The transformation back to the custom XML format is even trickier if we want to avoid flattened markup that looks like this in the result:

  <strong>This is bold </strong>
  <strong><emphasis>and italic</emphasis></strong>
  <strong>.</strong>

That's not to say that your average XSLT wizard won't be able to figure out a solution--maybe even a generic solution. (I can imagine using a two-stage transformation that would allow you to reintroduce a normalized hierarchy into the markup, but that's getting out of scope here.) It's just that it won't be terribly straightforward. Even so, I like the challenge.

Conclusion

The takeaway from this article should not be that Word's custom XML schema features are completely useless. No, they have their uses, particularly if you've got more data-oriented, business-template document formats. The thing to keep in mind is that this is essentially version 1.0 technology. It is exciting, even if it's not ready for prime time in terms of general XML editing. It will definitely be interesting to see what the next version of Word will add in terms of XML support. Until then, you might still be able to employ Word in a robust and usable way for your document-oriented XML applications with a little bit of creativity and XSLT trickery.

Evan Lenz is an XML developer specializing in XSLT.


O'Reilly Media, Inc., recently released (June 2004) http://www.oreilly.com/catalog/officexml.

[Sept 8, 2004] OOo Off the Wall The Outlining and the Ecstasy

The Rookery: OOo Off the Wall: The Outlining and the Ecstasy
Posted on Wednesday, September 08, 2004 by Bruce Byfield

With a bit of practice and some of these tips, you can become an outlining pro, even if you haven't done an outline since Freshmen Comp.

Outlining is the arrangement of sections within documents. The process of outlining includes re-positioning paragraphs and making decisions about what level in the hierarchy a heading should be.

Outlining is not writing, but it is a core part of the writing process. Despite this fact, many people begin to write with almost no outlining. Perhaps they are in a hurry to get started. Perhaps, if they attended high school in North America, they think of an outline as something they cobbled together after they finished writing the paper simply to satisfy a teacher's arbitrary demands. Whatever the reason, many people plunge into a document and discover its structure as they write. This practice usually is inefficient, because they are trying to do two things at the same time, write and organize. They don't know where they are going, which makes writing a prolonged and painful process.

It is true that a handful of professional writers never outline or outline only long and complex documents. Far more professionals, however, use some sort of outlining technique. For some, the physical act of writing accounts for as little as 10 to 20% of the time spent on a document. The rest is spent outlining and editing.

Judging from the habits of most professionals, then, outlining benefits most writers. However, each writer needs to discover how much outlining he or she need to do and what form that outlining should take. Sometimes, an outline can be a simple scribbled list or a brainstorming session on a white board. At other times, it's a formal document with headings and subheadings.

For those who prefer the formal approach, OpenOffice.org's Writer program offers a de-centralized set of tools. Writer uses:

Users of MS Word often leap to the conclusion that Writer has no outlining tools. In fact, Writer does have such tools, but they are arranged and function differently. Out-of-the-box (or out of the tar file), the tools are less functional than MS Word's, but with a little ingenuity, you can wrench almost the same functionality out of them.

The Role of Heading Paragraph Styles

The purpose of outlining is to structure your document. That means your document's format also needs to be structured if you are going to outline in Writer. And that, in turn, means using heading paragraph styles. If you format manually, there simply isn't enough consistency for Writer's outlining tools to work with.

Heading styles, numbered 1-10, are intended to indicate levels of organization. In other words, a heading at a higher level should contain the subject matter of a heading at a lower level. To give a simple example, if a document discusses the solar system, then the second-level headings might name individual planetary systems. Below the headings for planetary systems, the next level of headings might be each planet's moons.

Writer recognizes all other paragraphs as belonging to the same topic until the next heading at the same level appears. If the heading is moved during outlining, so are the other paragraphs, including any subheadings.

By contrast, the most you can do in a manually formatted document is single-style outline numbering (see "It's Numbering, But Not As We Know It". Single-style outlining is useful in the early stages of planning, when you have no content. But, if you find formal outlines useful in the first place, single-style outlining probably is too limited for you. Although you can promote or demote paragraphs easily enough by using Tab and Tab+Shift, moving sections of text requires you to copy and paste. Although you can get by using these methods, you'll probably find that copying and pasting distracts you from thinking about the structure of your document. In addition, a single-style outline ordinarily is not visible in the Navigator.

The Role of Outline Numbering

Here's where it gets confusing. If you use styles in Writer, you probably know that numbering styles can be applied to paragraph styles. Yet, in addition to numbering styles, Writer has a second system for numbering paragraph styles, located in Tools > Outline Numbering. I call this system multi-style outlining, as opposed to single-style outlining. Both are called outline numbering, yet the two systems are completely independent of each other.

Figure 1. Despite its name, Tools  Outline Numbering is as much about managing how other Writer tools use styles as it is about outlining.

Then, to make matters worse, multi-style outlining uses paragraph styles that it describes as levels. By default, these levels correspond to the paragraph styles Heading 1-10--but they don't have to. Moreover, if any of the paragraph styles used in multi-style outlining are formatted using paragraph styles, or even if a manually formatted list uses the headings, Tool > Outlining is overridden and has no effect whatsoever.

Why does Writer work this way? Why does the software encourage the use of paragraph styles in every other way and then muddy the waters with Tools > Outlining? The answer is simple:

Nobody knows.

My theory is Tools > Outlining was added by a programmer ignorant of styles back in the Jurassic Age when OpenOffice.org was StarOffice and owned by StarDivision. That is only a guess, but what else explains the duplication?

It may help if you think of Tools > Outlining as a means of managing how styles are used by other tools throughout Writer rather than as a means of setting style characteristics. Multi-style outlining sets the styles used:

By default, the numbers assigned to each level's style are formatted the same as the rest of the paragraph. If you choose, though, you can use the Character Style on the Numbering tab of Tools > Outlining to give them a different format. You also can insert a separator automatically, such as a period or a parentheses before or after the number, as well as the numbering system, the starting number and the position and spacing for the number.

Perhaps the most important setting is the paragraph style. Because Tools > Outlining has ten levels and uses Headings 1-10 by default, you can be lulled into thinking no other arrangement is possible. The truth is, you can assign any paragraph style to any level. Because you rarely need more than four levels of headings, you can assign the main body text to one level and have it displayed in the Navigator. You can't read all of the body paragraph, though, because the Navigator uses a single line for each level. Most headings are short, so that's all that normally is needed. But by dragging the Navigator window wider, you should be able to see enough that you can work with the body text. Be sure, however, that the level to which the body text style is assigned isn't included when you set up a table of contents.

If outlining features in your work methods, create a template in which multi-style outline numbering is set up. However, be careful to include text that uses each of the outline levels have configured. Through some oversight, multi-style outline settings are not preserved in a template unless they actually are used.

The Navigator's Role

The Navigator lists all the elements of your document. In Writer, over a dozen types of objects, including headings, graphics, tables, cross-references and draw objects are listed in the Navigator. You can click on any instance of one of these objects to jump to it. This ability especially is useful if you give each instance a meaningful name instead of using the default names, such as Graphic1 or Table1.

Figure 2. The Navigator, set to the content view and ready to start outlining

Yet, as useful as this feature is, the Navigator really comes into its own in outlining. To use the Navigator in outlining, press the F5 key to open its floating window. Of all the objects listed in the Navigator, headings are the only ones you need, so select Headings in the Navigator's list, then select the Contents View button in the Navigator tool bar second from the top. This selection displays only the currently selected type of object, giving you more window space in which to work. When you are finished, you can press the Contents View button again to display the complete list of objects.

If you have never done much with the Navigator, you also should drag on a corner of the window to make it bigger. The default size of the Navigator usually is too small to be used conveniently for outlining.

Each heading level is indented further than the one above it. You can change how many heading levels are visible by selecting the Heading Levels Shown button. The button is third from the right on the Navigator's second tool bar.

Unfortunately, Navigator offers no provision for hiding a single paragraph. You can, however, select Insert > Fields > Other > Functions > Hidden Text to hide a paragraph. Because an open Field window does not keep you from using the main editor window, this is a workable kludge, but only so long as you have a large enough screen for all the open windows.

Figure 3. The Navigator's Outlining Buttons. Promote Chapter and Demote Chapter are on the top right; Promote and Demote Level on the bottom right. On the bottom left is Heading Levels Shown.

Around the Heading Levels Shown button are the other tools you need for outlining:

Contrary to many users' first inclination, trying to drag headings around with the mouse doesn't work. You must use these buttons instead.

 

Figure 4. The Navigator also can be used for outlining in master documents. Only the buttons have been changed, to protect the inconsistent.

In a master document, the Navigator works much the same way. Acting like a table of contents in a floating window, in a master document, the Navigator acts similarly to a book file in FrameMaker. In a master document, however, the Promote and Demote Chapter buttons are replaced--for no good reason except inconsistency--with the Move Up and Move Down buttons.

Conclusion

Serious outliners have complained that OpenOffice.org's outline tools are basic. They have a point. Although workarounds exist for the most important deficiencies, they require more than a beginner's knowledge of Writer to set up. Still, even when used with default settings, Writer's outlining tools are preferable to repeated cut and pastes.

If you're not in the habit of outlining, give it a try--single and multiple style outlining both. Once outlining becomes part of your routine, you'll probably find that you spend more time preparing for but less total time on each document. Who knows? It even may give you enough confidence that you look forward to writing instead of avoiding it.

Bruce Byfield was product manager at Stormix Technologies and marketing and communications director at Progeny Linux System. He also was a contributing editor at Maximum Linux and the original writer of the Desktop Debian manual. Away from his computer, he listens to punk-folk music, raises parrots and runs long, painful distances of his own free will.

Microsoft Works Suite 2004

I do not understand why the reviewer complains about MS word 2002 ;-). For all but the most advanced users heavily involved in macro programming, the difference between Word 2003 and Word 2002 is unnoticeable.  If you are heavily involved in macro programming Ms Office 2003 is the only way to go and you need to shell the money to Microsoft (In the USA via partner programs you can buy Ms Office Professional and Windows 2003 server for $300 or so).

3 out of 5 stars Unjustified for Some, October 22, 2003
 

Top 1000 Reviewer Reviewer: Andre Da Costa (see more about me) from Jamaica W.I.

This version of Works Suite 2004 offers almost the same exact features of version 2003. The only difference is Money and Encarta, they included version 2004 instead. For users who might have stuck with version 2000, 2001 or 2002 of Works Suite, this will suite you more. The collection of software included is amazing for the value at which it is being offered. From power and easy to use Word Processing solutions such as Microsoft Word and quality photo-editing with Picture It! this is recommend set of programs for the home user and some businesses who need only the essentials.

The package includes Works 7.0 which was also in Works Suite 2003, with basic spreadsheet and database applications.

You also get great money management solution, Money 2004 makes it sinch to keep your budget on track and in tact. With its online itegration you can pay bills online or get further information to better manage your financial data.These applications I still consider not be strong enough as Microsoft Excel and Access. The database in Works approach to creating queries is limited and unreliable. The Spreadsheet does not offer the powerful calculation tools as Excel and integration between both applications is very limited. But if you need only essential tools for managing the essential aspects of your life whether at home or school Works Suite 2004 is truly a bargain.

Remember if you are a user of Works Suite 2003, this update might be unnecessary to you because the majority of the products in version 2003 are included in version 2004. If you want the updated products which are Money 2004 Standard and Encarta Standard 2004 you can purchase them separately.
 


Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019