home

QA@CLEF-2005

Resources

CLEF QA resources multilingual collections of questions and answers language-related resources



QA @ CLEF 2005 Workshop


Judged Submissions of the CLEF-2005 QA Track

Only registered participants are allowed to access these data.

Username:  
Password:   



Check input utilities

Before submitting their results, participants should run this checking routine in order to detect format inconsistencies (invalid document numbers, missing data, etc..) in their runs. The submissions that are not compliant with the required format will not be assessed.
Download the checking routine for the QA track at:

  CLEF-2003.
  CLEF-2004.
  CLEF-2005.



Test Sets at CLEF-2004

You can access them here.



Judged Submissions of the CLEF-2004 QA Track

Eighteen groups participated in the CLEF-2004 QA evaluation exercise, submitting 48 runs in 19 different tasks.
Submissions have been judged by human assessors and grouped according to the target language of the tasks. Here you can download them (zip file).



Test Sets at CLEF-2003

Three monolingual tasks (with Dutch, Italian and Spanish questions) and five bilingual tasks (where Dutch, French, German, Italian and Spanish queries searched for an answer in an English target corpus) were proposed at CLEF-2003.
Here are the original test sets that were distributed to participants. Each test collection is a plain text file. Please, visit this web page for further information about the format.
Correct answers were manually retrieved and are included in the "DISEQuA" and "Multisix" corpora (see below).

  Monolingual tasks: Dutch, Italian, Spanish.
  Cross-language tasks (all against English): Dutch, French, German, Italian, Spanish.
  English version of the cross-language test sets (with manually retrieved [answer, docid] pairs).



Multieight-04 Corpus

The Multieight-04 Corpus is a collection of 700 questions in eight languages and their manually retrieved answers. It was created for the QA@CLEF-2004 track and represents a replicable gold standard that can be used for training.
Here you can access the README file and download the first version of the corpus.



DISEQuA corpus

The Dutch, Italian, Spanish and English collection of Questions and Answers was developed by three research groups: ITC-irst (Centro per la Ricerca Scientifica e Tecnologica, Trento - Italy), UNED (Spanish Distance Learning University, Madrid - Spain) and ILLC (Language and Inference Technology Group, University of Amsterdam - The Netherlands).
It is composed of 450 questions formulated into four languages. The answers have been manually searched in three document collections, which enables to test/train cross-language QA systems in twelve different combinations. The corpora in which the answers were retrieved are those licensed by the CLEF consortium in 2002: La Stampa and SDA newspaper/wire articles (year 1994) for Italian, EFE (year 1994) for Spanish and Algemeen Dagblad and NRC Handelsblad (years 1994 and 1995) for Dutch. Questions appear also in English, but they were not verified in an English document collection.
Reference publication (to be acknowledged whenever you use DISEQuA) is B. Magnini, S. Romagnoli, A. Vallin, J. Herrera, A. Peñas, V. Peinado, F. Verdejo, M. de Rijke, Creating the DISEQuA Corpus: a Test Set for Multilingual Question Answering, in Carol Peters, editor, Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway, 2003.

For further information, read a short description of the corpus.



Multisix corpus

The test sets we used for the cross-language tasks at CLEF QA-2003 are collected in the Multisix corpus, is a collection of 200 English questions whose answers have been manually searched in the Los Angeles Times corpus (year 1994) licensed last year by CLEF. Each question has been translated into five languages: Dutch, French, German, Italian and Spanish, but no manual processing was conducted in other document collections.
Some typos were recently found and corrected in German questions, so some entries in the "Multisix corpus" are slightly different from those in the original test sets (that can be downloaded above).
Reference publication (to be acknowledged whenever you use Multisix) is B. Magnini, S. Romagnoli, A. Vallin, J. Herrera, A. Peñas, V. Peinado, F. Verdejo, M. de Rijke, The Multiple Language Question Answering Track at CLEF 2003. (see chapter "Gold Standard for the Cross-Language Tasks"), in Carol Peters, editor, Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway, 2003.

For further information, read a short description.
Here you can download the revised version (v2) of the Multisix corpus (zip file).



Italian Translation of the TREC Questions

ITC-irst has translated into Italian 1000 questions released for the QA track at TREC-2002 and 2003. They can be used for training.
Similarly to the DISEQuA corpus (see above), the translation of the two TREC question sets is given in two XML files, where queries are numbered and described according to the category they belong to (either FACTOID, LIST or DEFINITION) and their answer type, i.e. the instance they refer to.
Several kinds of answer types have been taken into account: LOCATION (a place), PERSON (someone's name or role), TIME (the date of an event), MEASURE (the amount of something), MATERIAL (a particular substance), HOW ( questions like "How did something happen?"), TITLE (the title of a song, movie, book, etc.), ACRONYM ( the meaning of an abbreviation) and OTHER (plants, animals, inanimate objects, etc.). In most of the cases, the right answer is provided.
This translation represents a growing resource, and you are all encouraged to add other languages and other useful tags.

Translation of the TREC-2002 questions. (zip file)
Translation of the TREC-2003 questions. (zip file)



Test Set for Italian Named-Entities Recognition

Annotated text represent another useful resource you may use to test and improve your system. ITC-irst provides the transcribed text of Italian broadcasts, in which the entities LOCATION, PERSON and ORGANIZATION have been marked with tags, according to the NIST guidelines.

Download the test set. (tar.gz file)



French Translation of the TREC Questions

The RALI group (Laboratoire de Recherche Appliquée en Linguistique Informatique) at the University of Montreal, Canada, has translated into French 1893 questions drawn from the TREC QA evaluation exercises.

The file is available at the RALI website.



Spanish Resources

QA resources for Spanish (including the translation of the TREC questions) are available on the website of the NLP and IR Group at UNED (Madrid, Spain).

URL: http://terral.lsi.uned.es/QA/resources/



Finnish Resources

The DOREMI research group at the University of Helsinki has posted some QA resources for Finnish, including translations of the CLEF 2003 and 2004 test sets.

URL: http://www.cs.helsinki.fi/research/doremi/interests/QAResources.shtml