[Mt-list] Call for Participation: WMT 2021 Multilingual Low-Resource Translation for Indo-European Languages

Cristina España-Bonet

2021-05-20 16:01:24 UTC

[Apologies for multiple posting]

Dear all,

The period for submitting resources for the shared task on "Multilingual
Low-Resource Translation for Indo-European Languages" ended yesterday.
Please, visit the webpage if you are interested in participating and
training your multilingual MT system!

http://www.statmt.org/wmt21/multilingualHeritage-translation-task.html

Best,
Cristina

November 10-11, 2021
Punta Cana, Dominican Republic (ONLINE) Shared Task: Multilingual
Low-Resource Translation for Indo-European Languages
TASK DESCRIPTION

Massively multilingual machine translation has shown impressive
capabilities, including zero and few-shot translation of low-resource
languages. However, these models are often evaluated from or into English,
where the most data is available, and assuming that models generalise to
other pairs and low-resource languages.

In the first edition of the Multilingual Low-Resource Translation shared
task, we focus on multilinguality in the cultural heritage domain for two
Indo-European language families: North-Germanic and Romance. We want to
explore how information in one language can be transferred to other related
languages and, for this, we evaluate translation quality in low-resourced
language pairs, but explicitly encourage the use of data of the
high-resourced language-pairs in the same family. We would like to answer
the question: do we need English and/or Spanish for high quality
translation of related languages? If so, which is the best way to combine
the data: pivot, cascade, multilingual, via pre-training? To explore the
topic, we present two tasks with different characteristics, one per
language family (see below). Even if data in the highest resourced
languages is allowed and encouraged for training, translation quality will
be only evaluated on the other pairs. Participants are also encouraged to
make available to the community any additional resources useful for the
task.
Subtasks

- *Task 1. Europeana thesis abstracts and descriptions*. North-Germanic
languages: *from/to Icelandic, Norwegian BokmÃ¥l and Swedish*. Danish,
German and English data is allowed for training but translation is not
evaluated.
- *Task 2. Wikipedia cultural heritage articles*. Romance languages: *from
Catalan to Occitan, Romanian and Italian*. Spanish, French and
Portuguese data (+ English!) is allowed for training but translation is not
evaluated.

EVALUATION *Task 1. Europeana thesis abstracts and descriptions.* We will
provide a test set with a similar amount of abstracts in the 3 languages
that have to be translated to the other 2. The data will be in xml format
with information of the source language and document boundaries. The
validation set has exactly the same structure. All data has been translated
by professional translators, being the source language the original
language.

*Task 2. Wikipedia cultural heritage articles.* We will provide a test set
with articles in Catalan that need to be translated into the other 3
languages. Similarly to Task 1, the data will be in xml format with
information of the source language and document boundaries. The validation
set has exactly the same structure. All data has been translated by
professional translators, being the source language the original language.

*Automatic evaluation* results will be reported per language pair and in
average. The final ranking will be done according to the average
translation quality per subtask, that is, per family and not per pair. When
possible, we will also perform *human evaluation at document level* on a
subset of paragraphs/sentences. If you plan to participate, please, contact
us so that we can plan the human evaluation.

Since the official evaluation will be done per family, participants can
participate in only one of the tasks if preferred, but they are expected to
submit translations for all the languages involved in the task.

IMPORTANT DATES
Release of initial training data April 5, 2021
Additional training data deadline May 19*, *2021
Release of test data *June 29*, 2021
Results submission deadline July 6, 2021
Paper submission deadline August 5, 2021
Paper notification September 5, 2021
Camera-ready version due September 15, 2021
Conference at EMNLP November 10-11, 2021
ORGANISERS Anastasija Amann (DFKI GmbH, Germany)
Kwabena Amponsah-Kaakyire (DFKI GmbH, Germany)
Cristina EspaÃ±a-Bonet (DFKI GmbH, Germany)
Josef van Genabith (DFKI GmbH, Germany)
Leonie Harter (DFKI GmbH, Germany)
Contact Interested in the task? Join the WMT google group for questions or
comments and/or email cristinae aatt dfki.de.