Nordic Treebank Network Meeting
Tartu, 8-10 September 2004
Day 1: Working Group Sessions and Discussion
Tools and Resources (TIGER-XML)
Coordinator: Matthias Trautner Kromann
In the TIGER-XML session at the Tartu meeting, the network members
voted to accept proposals 1.1 on character encoding, 1.2 on
intersegmental links, and 1.3 on glosses (cf.
http://www.id.cbs.dk/~mtk/ntn/tiger-xml.html) as recommendations in
their current form. For the other proposals on the ballot, the network
members voted to set up the following working groups for each proposal:
- 1.4. Dependency graphs: Joakim Nivre, Heli Uibo, Matthias Kromann,
Søren Harder.
- 1.5. External sources (text, speech, video): Koenrad de Smedt,
Janne Bondi Johannesen, Matthias Kromann, Manne Miettinen.
- 1.6. Argument structure: Joakim Nivre, Eckhard Bick, Søren Harder.
- 2.1. Segments: Nobody volunteered. Unless somebody volunteers, the
network will not make a recommendation within this area.
- 2.2. Alignment: Martin Volk, Yvonne Samuelsson, Matthias Kromann,
Søren Harder, Eckhard Bick, Lars Nygaard.
The members of the working groups are responsible for producing a
proposal for a recommendation before November 1 (December 1 for 1.6,
where the work starts from scratch). The network decided that the working
groups should remain open to everybody, and that all discussion within the
working groups should be carried out on the nordic-treebank list so that
all members of the network can participate in the discussion.
Parallel Treebank
Coordinator: Martin Volk
Martin Volk presented two projects about parallel treebanks that were
done by PhD students as part of the Treebank Course. Both were on the
topic of transfering information from a treebank in one language (e.g.
EN) to a parallel language (e.g. Amharic) .
Atelach Alemu has done a project on "Projecting Dependency Parses -
English to Amharic". She has parsed English Sofie sentences and wrote a
program to transfer the information to Amharic. Her conclusions were
rather negative. However, Svetoslav Marinov (Skövde) has done a project
on "(Semi-)Automatic transfer of syntactic information" from Swedish to
Bulgarian. He transfered the dependency information computed for Swedish
by the Växjö group to Bulgarian. And his recall and precision values
were encouraging.
In this section we also presented three other projects that were done by
PhD students but were not related to parallel treebanks:
- Johan Hall and Jens Nilsson (Växjö University) worked on "Converting
dependency treebanks to MALT-XML"
- Kaarel Kaljurand (Tartu University) worked on "Checking treebank
consistency"
- Henrik Oxhammar and Hans Hjelm (Stockholm University) worked on
"Guidelines for Named Entity Markup in ANNOTATE"
Then Martin Volk and Yvonne Samuelsson presented their work on a
Swedish-German parallel treebank:
- They have annotated the first chapter of Sofie's World both for German
and Swedish with flat constituent structure trees according to the
NEGRA/TIGER guidelines.
- They used the German chunker to annotate the Swedish sentences which
saved more than 50% of the annotation time (compared to completely
manual annotation of the Swedish sentences).
- They have developed a program to "deepen" the flat tree structures by
automatically adding new nodes. This deepening makes the trees more
consistent, linguistically more plausible, and enables finer alignment
between the trees.
- They have written a program to exploit automatic word alignment
information (as obtained by Jörg Tiedemann) for phrase alignment and
were quite pleased with the evaluation results.
- They have written a program to visualize node alignments based on the
SVG trees exported from TIGER-Search and the alignment information in an
XML file.
Matthias Trautner Kromann presented his pseudo-automatic word alignment
program within his DTAG treebank tool.
Janne Bondi Johannessen compiled a status list of the annotated
Sofie sentences in the various languages. She will follow up on this and
see that the various groups submit their annotations to the online
database in Oslo. We also asked that the problem with the display of
crossing branches need be solved. One option would be to use the SVG
trees from TIGER-Search (instead of a local tree display).
Various other research topics with respect to Parallel Treebanks were
discussed, but no actions were taken:
- A search tool over parallel treebanks is needed.
- Alignment between different annotation formats should be explored.
- Alignment between differing languages should be explored (e.g. Swedish-Estonian
alignment).
- The interpretation of alignment information as transfer rules in
Machine Translation seems interesting.
Notational Harmonization (VISL)
Coordinator: Eckhard Bick
In the session on notational harmonization, Eckhard presented the VISL
category system providing definitions for the individual form and
function categories, with a special focus on co-ordination and the
stacking notation. Joakim presented a VISL-transformation of Swedish
dependency treebank edge labels.
The following actions were agreed upon:
- Eckhard will produce a full list of VISL categories for the Sophie web
site.
- Everybody will provide a VISL style version of their Sophie data,
with the possibility of publishing both VISL and native versions
in parallel at the Sophie website.
- On top of the existing f (free) and b (valency bound), the following
lower case prefix letters for VISL categories were suggested, though not
formally decided:
- t = topic, e.g. tS (topic subject: *Peter*, han er kedelig.)
- s = secondary, e.g. sOd (secondary edge object relation)
- g = gapping, e.g. gS (gapping subject, without a verb: Peter bor i Rom,
*Hans* i London
Spoken Language and Discourse
Coordinator: Jens Allwood (absent)
The network decided to set down a working group consisting of Janne
Bondi Johanneson and Matthias Kromann for planning NTN's work on
spoken language treebanks. The primary task for this working group is:
- To plan a spoken language task where NTN members are asked to
create a treebank for a small sample of spoken language
transcriptions. The purpose of this task is to start a
discussion on spoken language treebanks within the NTN.
- To contact the organizers of NODALIDA about the possibility of
arranging a treebank workshop in connection with NODALIDA.
The workshop should have a public session on open issues in
spoken language treebank creation, a session on other public
issues (eg, parallel treebanks), and a (possibly private) NTN
session where we can plan the upcoming activities within the
network.
- To plan the NODALIDA workshop, including examining the
possibility of inviting 1-2 keynote speakers in the spoken
language treebank session. The plans must be circulated before
the TLT meeting in Tübingen.
Day 2: Planning
Six main topics were discussed in the final planning session:
- TIGER-XML: All the tasks, working groups
and deadlines proposed in the working group session (see above)
were agreed upon.
- Publication efforts:
- The network will produce a white paper on treebanking in the Nordic
countries for the yearbook of the Nordic Language Technology Program.
An editing committee consisting of Joakim,
Koenraad and Martin will circulate drafts during the month of October
and produce a final version before the end of October.
- The network will put together a book proposal focusing on the
Sophie parallel treebank. An editing committee consisting of Eckhard,
Heli and Martin will produce a first version by 1 December.
- Contributions to the yearbook of the Nordic Language Technology Program
should be sent to the local documentation centers before the end of October.
In addition, contributors should send their submissions to Joakim, who will
inform Henrik about what contributions to expect from the network.
Everyone who contributes a paper to the TLT workshop is encouraged to
publish the paper also in the yearbook.
- Parallel treebank and notational harmonization:
It was decided that everyone who contributes to the
Sophie parallel treebank should try to
- produce a version using the VISL categories,
- annotate the entire first two chapters (with alignment to the Norwegian original),
before December 31.
- Spoken language and discourse: The plans proposed in the working group
session (see above) were agreed upon.
- TLT 2004: Each network site should send a message to Joakim before 5 October
indicating how many participants they want to send to TLT 2004 in Tübingen.
It was agreed that priority should be given to
- sites contributing papers
to the workshop,
- sites that have been active in the network in general.
- Network activities after December 2004. Three decisions were reached:
- We will apply for an extension of the network's duration to 1 July 2005,
aiming to have a final meeting in conjunction with NoDaLiDa and the General
Seminar of the Nordic Language Technology Program in Joensuu in May 2005.
- We will try to have yearly meetings also after 2005. The next meeting
after Joensuu is tentatively planned to be in Bergen in 2006 with support
from the Trepil project.
- We will maintain the mailing list also after 1 July 2005.
Participants
| Site | Participants |
| Copenhagen Business School | Matthias Trautner Kromann |
| CSC Scientific Computing | Manne Miettinen |
| Stockholm University | Martin Volk |
| Yvonne Samuelsson |
| University of Bergen | Koenraad de Smedt |
| University of Helsinki | Kimmo Koskenniemi |
| University of Oslo | Janne Bondi Johannessen |
| Gunnar Hrafn Hrafnbjargarson |
| University of Southern Denmark | Eckhard Bick |
| Søren Harder |
| University of Tartu | Heli Uibo |
| Kadri Muischnek |
| Kaili Müürisep |
| Växjö University | Joakim Nivre |
| Nordic Language Technology Program | Henrik Holmboe |
Pictures
- Pictures (thanks to Kimmo Koskenniemi)
- More pictures
(thanks to Gunnar Hrafn Hrafnbjargarson and Janne Bondi Johannessen)