However, although corpus analysis tools have been extensively used for research purposes, it seems that, at least in translator education in Finland, the systematic use of such tools as actual translation aids has until now been rather neglected. It also seems that electronic corpora are not used widely by practising translators either, probably because they have not been exposed to the potential of corpus analysis tools during their own education and probably because of the unavailability of ready-made special-field corpora. Thus Jääskeläinen and Mauranen (2004, p 53) propose that courses on how to compile and use corpora should not only be integrated into translator training at the undergraduate level but also be offered as continuing education to practising translators.
With this in mind, I began compiling a corpus of English-language tourism brochures in spring 2004, with the aim of using it to teach students how the competent use of electronic text corpora in conjunction with corpus analysis tools can help both the trainee translator and the professional translator to become better language service providers by enhancing both the quality of their work and their productivity, particularly when translating special field texts into a foreign language. (Many translators of non-literary texts in Finland frequently translate into their L2).
3 The Tourism Corpus
There were a number of reasons for deciding to compile a TL-corpus of tourist brochures. Firstly, there is a high demand in Finland for tourism texts to be translated from Finnish into English, not only for various kinds of brochures but also for websites. Secondly, I myself have extensive experience in this field, having done a large amount of language checking for various professional translators as well as a certain amount of translating of tourism texts from Finnish into English. Thirdly, many printed tourist brochures also appear in PDF format on their owners' websites, and thus are relatively easy to convert into the plain text format required by many corpus analysis tools. And last but certainly not least, students seem to be attracted to this field--perhaps because there is a certain amount of glamour attached to travel and tourism, and perhaps also because the concepts are relatively easy for even the non-expert to understand compared with many other special fields.
Nevertheless, translating tourist brochures can appear at first sight to be deceptively easy. For example capturing the right style, conforming to conventions of the target language and culture, and finding a consistent and logical strategy for translating names of places, resorts and establishments as well as for translating culture-specific terms are just a few of the difficulties that face the translator. In Finland, another problem is that although the source texts of some brochures are written with a foreign audience in mind, more often than not they are written first for the Finnish audience and it is this text that serves as the basis for the foreign language versions. The content is not necessarily geared towards a foreign audience, and thus there are, for example, frequent allusions to information that will be implicitly understood by the Finnish audience but not by the foreign audience.
The texts of the Tourism Corpus were mainly derived from tourist brochures that appear on the Internet in PDF format. In many cases, converting these into plain text format was quite straight-forward, though in most cases careful post-editing needed to be done, since headings, and in some cases even complete paragraphs, frequently tended to switch positions in the conversion process. Usually, the more sophisticated and attractive the brochure, the trickier it was to convert into text format.
By September 2004, with the help of a student assistant, I had compiled a corpus amounting to 670,000 words. There are various types of corpora and various ways of classifying them. The Tourism Corpus could be described as being an untagged monolingual target-language corpus. It contains mainly texts from brochures from the British Isles and from North America, especially Canada. When compiling the corpus, a major reason for including Canadian brochures was that they contain descriptions of activities that are often featured in Finnish source texts--e.g. snowshoe treks, skiing, snowmobile trips, wilderness adventures--which are rarely mentioned in British brochures.
The file names have been labelled with one of the following codes: BI, CA, US, so that the user can immediately identify whether a concordance line is from the British Isles, Canada, or the United States, as illustrated in Figure 1.
4 Exploiting the Tourism Corpus
During the 2004-2005 academic year, I integrated corpus exploitation into my translation courses. Students received instruction in using the corpus analysis package WordSmith Tools (Scott, 2004), were taught various strategies for exploiting corpora when translating, and were given tourist brochure texts as translation assignments from Finnish into English. Examples are given below illustrating ways in which students have been able to exploit the Tourism Corpus in order to improve the quality of their translations.
The corpus has proved very useful for finding information about collocates, especially adjectives that collocate with nouns. For example, when translating sentences containing the noun rapids, the KWIC display provides a rich menu of adjectives to choose from, as illustrated in Figure 2.
Figure 2: Display of some of the concordance lines generated by WordSmith Tools for the search word rapids
When searching for collocates, the corpus often leads to somewhat unexpected discoveries. For example when looking for translation equivalents for hoidettu or kunnostettu when referring to cross-country ski trails, traditional resources suggest, for example, conditioned, maintained, restored and reconditioned as possible translation candidates. However, of the 1000-plus concordance lines generated by the search word trails, none of the above adjectives appear immediately to the left of the search word, while there are over 40 occurrences of the adjective groomed. Native speakers, especially North Americans, will probably be familiar with this term. However, most novice translators, and even those professional translators that have little experience in translating tourism texts, are not usually familiar with this adjective. A new concordance with groomed as the search word generates 128 hits, and provides evidence of, for example, groomed bicycle and walking trails, groomed classic and skating trails, groomed cross-country ski trails, groomed fairways, groomed off-road trails, groomed runs, groomed slopes, and groomed wilderness trails, as illustrated in Figure 3.
Figure 3: Display of some of the concordance lines generated by WordSmith Tools for the search word groomed
However, even the seasoned concordance user may "miss" the 40-plus occurrences of groomed when scrolling through the 1000-plus hits for trails. Therefore, when a search word generates a large number of concordance lines, students are taught to turn to the collocates display and the clusters display. For example, Figure 4 shows the words that occur most frequently within a span of five words to the left of trails, while Figure 5 shows the most common 3-word clusters containing trails. Each of these displays helps to highlight the frequent co-occurrence of groomed and trails.
Figure 4: Fifteen most frequent collocates
occurring to the left of trails
Figure 5: Fifteen most frequent 3-word clusters
4.2 Finding and choosing between terms
When deciding on a translation equivalent for a specific term or phrase, the corpus has been of great help in verifying or rejecting decisions based on other tools such as dictionaries and the Internet. An example of this is the Finnish term koiravaljakkoajelu. After hunting through traditional translation aids, student translators came up with the terms dog sled, dog sledge & dog sleigh, each of which is also often written with hyphens or as one word. The corpus helps in deciding on which of these alternatives to use. Figure 6 illustrates some of the concordance lines generated for the search pattern dog*. The original KWIC display contained 22 hits for dog sled, 27 hits for dogsled, and 6 hits for dog-sled, with no hits at all for dog sledge or dog sleigh or variations thereof. Moreover there were 68 hits for dogsledding, often written also as two words. The display also shows that adventure, excursion, ride, trip, and tour are amongst the nouns that collocate with dog sled.
Figure 6: Display of some of the concordance lines generated by WordSmith Tools for the search word dog*
Researchers such as Bernardini (2000, 2001) and Varantola (2003) have pointed out that corpora allow unpredictable, incidental learning: the user may notice and explore unknown or unfamiliar uses in a concordance and go off at a tangent to follow them up. Bowker & Pearson (2002, pp 200-202) show how creative search techniques, for example concordancing with contextually-relevant search words, can increase the likelihood of "accidentally" finding relevant information.
As shown earlier, a search of the Tourism Corpus for trails led to the serendipitous discovery of the adjective groomed. The KWIC display in Figure 6 provides further examples of the kind of previously "unknown" information the translator might acquire when browsing through a KWIC display. This information may be relevant to the translation assignment at hand, or may come in handy for future assignments. Lines 1, 2 & 14 contain references to dog musher and dog mushing that may warrant further consideration; lines 6, 17 & 21 refer to ice-fishing, while line 14 encourages the tourist to fish through a hole in the ice--two possible translations for the Finnish term pilkkiminen; lines 10 & 11 mention ATV tours, lines 18 & 24 aurora viewing, line 21 snowshoeing, and line 22 illuminated skating loop, all of which may lead to further exploration by viewing in fuller context or by entering new search patterns. For example a search for ATV, will quickly reveal that this is a widely used abbreviation for All Terrain Vehicle--a possible translation candidate for mönkijä, a Finnish term that is difficult to find an equivalent for using traditional resources.
4.4 Language chunks
Gavioli & Zanettin (1997) point out that a corpus acts as a continual source of additional raw material and consider that the greatest benefit of using TL corpora is that they can suggest multi-word "chunks" that students are able to use to produce texts that sound more natural in the target language. According to Gavioli & Zanettin, achieving such "naturalness" is probably the greatest benefit of using corpora in translation, particularly into the foreign language, where naturalness is more difficult to achieve.
Finnish tourist brochures often contain references to ruska-aika, the period in autumn when the leaves change colour leading to breathtakingly beautiful landscapes. The translator may decide that the concept of ruska contains implicit information that needs to be expressed more explicitly for a foreign audience, and thus some sort of description is necessary. Figure 7 shows some of the concordance lines produced by a search for autumn. Words and phrases could be extracted from them and incorporated into the translator's own description.
Figure 7: Display of some of the concordance lines generated by WordSmith Tools for the search word autumn
If one had searched for fall, the American synonym for autumn, one would also have found references to the fall foliage season, brilliant foliage in fall and stunning fall foliage.
5 Words of Warning
Some researchers, e.g. Ball (1997), have warned that the use of electronic text may tempt the analyst to seek only that which is easy to find--you notice only what you get back; you will not notice what you did not find. However the experience that I have had when integrating corpora analysis into translation courses suggests that creative searching is likely to result in a wealth of discoveries and answers to questions that the translator did not even think of asking in the first place.
There have also been some concerns that corpora may reinforce the tendency of translated texts towards "normalisation" (i.e. making texts more standardised and conventional):