The Journey Begins

How my journey to writing CAT tool related scripts all began, written down here for posterity (i.e., a memory aid for my future self...)

First Steps toward Useful Scripts

The catalyst for my scripting journey was a request from my boss/colleague some years back.

At the time, he had a large job that required revising hundreds of Excel files whose translation had been outsourced. Those files contained the original text in one column, and the translation in the next column (along with several more columns not relevant to the translation itself). The files potentially contained numerous repetitions, both within a single file and across multiple files.

Faced with the prospect of doing so manually, and knowing about my interest in programming, he asked me if it would be possible to create a translation memory from the files sent by the outsourcer so he could do the revision in OmegaT.

I managed to cobble a combination of scripts to do so.

The first was an Excel VBA macro that produced files containing only the columns with the original and translated text, and then put the revised translated text back into the original file.

The second was a LibreOffice Basic script that loaded the files produced by the VBA script and saved them into the translation memory exchange (TMX) format used by CAT tools. (The actual conversion to TMX format was actually performed by TMX filters modified for our language pair. I never managed to find an equivalent for Excel.)

The pair of scripts worked, insofar as they produced the desired results, but the VBA macro, in particular, was cumbersome and slow. (The LibreOffice script simply consisted of loading the Excel file and saving it to a different folder as a TMX file)

Nevertheless, they did work! That prompted me to look at other tasks in translation jobs through the lens of potential automation, and to take a closer look at Python.

The job that prompted my boss's request was actually a recurring quarterly job at the time, and following the success of creating a translation memory and handling the job in OmegaT, he decided to do the subsequent batches himself rather than oustsource them again. This led to having to change the VBA macro from producing a file with the original and the translation to one that produced a file containing only the original text for translation.

It also involved creating additional macro designed to help find and remove extraneous line breaks. This "find and remove" line breaks macro worked by identifying cells containing line breaks and presenting the user with a side-by-side comparison of that cell's text with and without any line breaks. The user would then make any necessary adjustments and click a button to save the final file.

Python to the Rescue

The trials and tribulations involved in working with VBA led me to turn to Python. With the help of search engines and Stack Overflow, I found libraries to handle Excel files (in both the older binary and the newer XML-based formats) and managed to rewrite the VBA script that extracted the files to translate in Python.

The macro designed to help remove extraneous line breaks had to remain in VBA because of its required manual intervention.

I did, however, use python to write a script that produced a slightly more useful side-by-side comparison by being more selective about removing line breaks. Instead of removing everything, it used regex to identify line breaks that should kept based on the preceding and following characters, and removed the remainder. (Manual intervention was still required since it was nowhere near perfect or foolproof, but it still managed to make manual intervention a little less troublesome than the version that removed all line breaks.)

Although the actual resulting scripts were extremely specific to that particular set of files, writing them taught me a lot about generally applicable coding principles, and also helped me understand Python and build a foundation for subsequent dabbling.

The Journey Begins

First Steps toward Useful Scripts

Python to the Rescue

Wiki Contents