Copy-pasting from Word

I've been working with web content management systems for almost fifteen years now. And exasperatingly, I still see the same project problems recur constantly. Some of this is because of a lack of education -- it seems the field has grown a lot quicker than the general level of knowledge about the basics of content management. But a lot of it is just the same old technical problems.

Exhibit A: copy/pasting from Microsoft's Word.

Where does content commonly come from when it's repurposed for the Web? Microsoft Office, which is pretty much the standard for office productivity applications. In fact, it's quite usual for editors to send in their content as Word documents -- with webmasters or web managers diligently copying all the text, and pasting it into a rich text editor within a CMS.

Or rather, pasting it in Notepad, and then pasting it into the editor. Because what Word leaves on the clipboard is Microsoft's interpretation of what HTML should look like -- and that's quite a mess. Redmond's proprietary tags routinely break pages and standard layouts. And then there's the separate problem of content encoding -- those magic quotes often don't translate too well. In short, Word doesn't really separate content and design -- one of the basic tenets of content management.

Most systems nowadays have some sort of solution to this. Popular rich text editors like CKEditor and TinyMCE have buttons to either paste plain text only (the equivalent of the Notepad intermediary) or "clean" the Word content. Alternatively, your CMS may offer filters that will try to scrub the HTML after it is saved.

Cleaning, however, never quite works. Either too much gets stripped, so tables or more complex document structures don't make it across; or too little, leaving us with a bunch of tags with unpredictable results. All of this is difficult to get right. (I know this all too well, having once tried my hand at writing an XSLT filter for the purpose. The horror!) Unrealistic expectations here can lead to many help-desk calls -- "the CMS screwed up my document" -- and the like.

The reality is that the only reliable way to get text from Office to the web editor is "text only" -- forget any formatting. That's what the Notepad-route does; and it's what Google's Chrome browser now does with CTRL + SHIFT + V.

It's fair to say only Microsoft could really fix this. How hard would it be to just paste minimal markup, instead of proprietary lingo? This isn't exactly rocket science, cold fusion, or teleportation. So, I asked the company.

The problem for Microsoft, of course, is that while pasting into web applications is common, pasting from one Office document to another is much, much more common. In those cases, you'll often want to preserve formatting, and according to Redmond, "the HTML clipboard format in Word is optimized for those scenarios." What's more, there's now the Office Web Apps -- so Microsoft enables pasting into those web versions of the Office suite with all formatting intact, too.

That's all fair, but what about the web editor and her tedious clean-up process? Well, according to Microsoft, "[Y]ou can save your documents as 'Web Page, Filtered' where the extra markup will be removed and you will be left with a simpler set of HTML markup." Alas, even filtered HTML is not entirely MS-free. 

So, there's a glimmer of hope, yet we remain pretty much were we've been the past decade on this problem. There is no single answer to something as simple as copying text from an Office document and pasting it into your CMS. Microsoft's solution is a bit cumbersome and incomplete, and Google's rips out tables and other content you may like to keep.

However, instead of blaming Microsoft for this, consider it a reminder. The trenches aren't glamorous, but it's where you're most likely to encounter hurdles. There are plenty more day-to-day obstacles to getting it right. And nobody's going to magically fix this for you any time soon.


Our customers say...

"The Web CMS Research is worth every penny!"


Gil, Partner, Cancentric Solutions Inc.
iStudio Canada Inc.

Other Web Content & Experience Management posts

The Sitecore Paradox

The vendor's focus on core R&D and channel-based selling proved a winning business strategy, but I think Sitecore has hit a ceiling in recent years.