Approaches to Corpus Searches

Culture and languages

Science and Information Technology

Welcome to the workshop Approaches to Corpus Searches.

Workshop

Date

23 Apr 2024

Time

10:15 - 12:00

Location

Room J233, Renströmsgatan 6, University of Gothenburg

Organizer

Språkbanken Text

Talks

Špela Arhar Holdt, University of Ljubljana, Slovenia

Title: A specialised concordancer for corpora with annotated language corrections

Abstract: In this presentation, we introduce a new specialized concordancer for corpora with annotated language corrections (e.g. learner and developmental corpora). Through a variety of search scenarios, we'll show the tool’s capabilities, emphasizing how users can easily search and examine features of both the learner and the corrected texts. The concordancer serves as a complement to the Svala annotation system and the newly proposed XML TEI guidelines for these corpora, bridging the gap between data annotation and analysis. Warmly welcome!

Johannes Graën, University of Zurich, Switzerland

Title: LCP -- the LiRI Corpus Platform

Abstract: Over the past three years, we have been developing a novel technology for querying corpora of different kinds. We found that, although plenty of specialized tools are freely available (CWB, ANNIS, NoSketchEngine, etc.), none of them are suitable for large corpora (> 1b tokens) or multimodal data (audio, video, images) effectively. Furthermore, the query languages supported by those tools vary in terms of expressiveness.

Our corpus platform LCP is designed to cater to the diverse needs of linguists and researchers from related fields. Its modular structure enables the creation of customized user interfaces on top of a shared infrastructure, while allowing users to import corpora with tailored structures.

Yousuf Ali Mohammed, Språkbanken Text, Sweden

Title: Strix -- for wiser text visualisation

Abstract: Strix is a text visualisation tool currently developed at Språkbanken Text. This tool gives the opportunity to researchers, teachers, students and others who work on text data, to visualise the whole document (or text) together with the annotations on text level, sentence level and word level. The current version of Strix has a simple search functionality (word or phrase) and filtering option based on the metadata attributes. Statistics on the corpus level and each document level makes it easy to analyse and understand the content in the data. Users can import and analyse their own collections of data in Strix through Mink, and they can also get a collection of similar documents. The long term goal is to have all the open access data in Strix that is currently available in Korp.

Peter Ljunglöf, Språkbanken Text, Sweden

...in collaboration with Nick Smallbone, Språkbanken Text & Niklas Deworetzki, Chalmers Univeristy of Technology

Title: Towards corpus algebra with precise semantics

Abstract: Researchers in digital humanities routinely work with text corpora, annotated text collections of up to billions of words. To find patterns in these corpora, they need search tools that can handle complex queries and huge amounts of text. But existing tools fail to perform well on complex queries, because of poor query optimisation. Query optimisation is hard because existing query languages are ad hoc and have no clear semantics.

We will create a principled foundation for corpus query languages: a well-behaved language, a precise semantics and clear algebraic properties that can be used for query optimisation. We will use this to develop practical algorithms for efficiently searching very large text corpora.

Inspired by relational algebra, we propose a corpus algebra with a precise semantics. Queries are compiled into corpus algebra, then transformed using algebraic laws into a more efficient form and executed.

In the project we will address the following research problems:

1. What is a suitable query language for corpus algebra?
2. Which query operators have a well-behaved semantics?
3. What laws can we use to optimise corpus algebra expressions?
4. What search indexes are useful, and how can they be incorporated into corpus algebra?

The algorithms we produce will find optimal query plans that use available search indexes well. Hence we expect that complex queries will run in seconds, compared to several minutes today. This will open up new research fronts in digital humanities.

Panel with the four speakers on approaches to corpus searches

Moderator: Elena Volodina

Last modified

15 April 2024