The INTERSECT corpus at the University of Brighton is a translation corpus (also called a “parallel corpus”), consisting of texts in English and their translations in French or German, stored in electronic form. The texts are varied, including fiction, journalism, business reports, UN and EU documents, science and technology texts, tourist brochures, and other genres. The corpus contains about 1.5 million words in French and English, and about 800,000 words in German and English.
A corpus of authentic language provides a sample of language in use in real contexts. Surprising patterns often emerge which are not apparent in isolated sentences. A translation corpus adds a further layer of value, because each text in the original has been analysed by a skilled bilingual translator and then resynthesised in the target language. The kind of pattern found here often goes beyond the simple correspondences found in dictionaries. One example among many is the French word préciser, which according to the dictionaries usually translates as “specify, make precise”. In the INTERSECT corpus, though, we found over twenty different strategies used to render this word into English, some of them illustrated here:
Les biens des organisations visées sont transférés dans le domaine de l'État, et le Ministère de l'Intérieur doit par décret suprême [préciser] quels sont les biens en question
The assets of the organisations concerned are transferred into the State domain, and the Interior Ministry is called upon to specify by supreme decree the assets in question.
Cette dame, rencontrée dans le centre de Moscou, déclare, dans un premier temps, gagner 8,000 roubles par mois. Elle [précisera], dans un deuxième temps, que son mari gagne aussi à peu près la même somme...
A woman I met in central Moscow at first said her income was 8,000 roubles a month. Subsequently, she added that her husband earned about the same amount...
Le comité prie le gouvernement de [préciser] si les trois travailleurs de l'entreprise Vianini Entrecanales mentionnés par les plaignants ont été licenciés
The Committee requests the Government to indicate whether the three workers of the Vianini Entrecanales company mentioned by the complainant were dismissed.
La combinaison no 32 peut être utilisée dans certaines séquences de signalisation de commutation; ces usages sont [précisés] dans les Recommandations U.11 et S.4
Combination No. 32 can be used in certain sequences of switching signals; these uses are set out in Recommendations U.11 and S.4.
Un autre acte pourtant, à vos yeux ridicule peut-être, mais que je redirai, car il [précise] en sa puérilité le besoin qui me tourmentait ...
I will tell you, however, about one other action of mine, though perhaps you will consider it ridiculous, for its very childishness marks the need that then tormented me...
Investigating real translations can bring to light significant differences in the way meanings are expressed in different languages. The research literature tends, however, to discuss carefully selected, isolated instances. A computer corpus enables researchers to find many examples of a word or phrase and their translations, quickly and accurately. Crucially, a computer can easily count occurrences of items in the corpus and thereby help identify recurrent patterns. Isolated instances can be distinguished from those which are frequent and genuinely representative. In this way, the corpus captures the expertise of skilled translators, and enables us to compare the expressive resources of different languages in ways that previously were not feasible.
Translation corpora are a central part of the scholarly infrastructure in several fields, notably Translation Studies, Contrastive Linguistics, Language Engineering, Bilingual Lexicography and Language Teaching. INTERSECT has been used by researchers in several countries (e.g. St. John 2001, Celle 2005).
The corpus texts are plain text files. The texts are aligned for use with a parallel concordancer called ParaConc, written by Michael Barlow of the University of Auckland. Efforts to add further annotation are continuing.
The French-English corpus includes:
The German-English corpus includes:
We are unable to make the corpus freely downloadable because of copyright issues. Efforts to solve these are also continuing. We are happy to distribute the corpus to researchers who contact us.
For information about ParaConc, contact Michael Barlow. Email:firstname.lastname@example.org or email@example.com. See below for his web site.
Other publications that use the INTERSECT corpus
School of Humanities
University of Brighton
Falmer, Brighton, BN1 9PH
Phone: (+44) 01273 643335
Fax: (+44) 01273 690710