On-site registration will be available from 8.00 to 8.30 for additional 15 EUR.
Aivars Glaznieks Eurac Research, Italy
The best time to talk with speakers and attendees
Lieke Verheijen and Wilbert Spooren
Lisa Hilte, Reinhild Vandekerckhove and Walter Daelemans
Laura Herzberg and Angelika Storrer
Harald Lüngen, Michael Beißwenger, Laura Herzberg and Cathrin Pichler
We will start in the Eurac Research garden and take a group photo.
Stefania Spina
Taja Kuzman and Darja Fišer
Steven Coats
The best time to talk with speakers and attendees
Paola Leone
Julien Longhi, Claudia Marinica, Nader Hassine, Abdulhafiz Alkhouli and Boris Borzic
Social Get-Together with local dishes and drinks at Batznhäusl.
A. Seza Doğruöz Independent Researcher
Michael Beißwenger, Ciara R. Wigham and colleagues
Darja Fišer Director of User Involvement within CLARIN ERIC
The best time to talk with speakers and attendees
Márton Petykó
Damjan Popič and Darja Fišer
Muhammad Shakir and Dagmar Deuber
The tutorial gives a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using TEI.
Harald Lüngen Institute for the German Language, Mannheim, Germany
Michael Beißwenger University of Duisburg-Essen, Germany
Laura Herzberg University of Mannheim, Germany
The tutorial gives a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using TEI.
Harald Lüngen Institute for the German Language, Mannheim, Germany
Michael Beißwenger University of Duisburg-Essen, Germany
The best time to talk with speakers and attendees
The tutorial gives a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using TEI.
Harald Lüngen Institute for the German Language, Mannheim, Germany
Laura Herzberg University of Mannheim, Germany
Features a series of invited speakers whose work has greatly contributed to the development of Learner Corpus Research. Read more about the event.
Link to the registration: €15 for the LCR 2017 Pre-Conference Workshop and Reception (without attending the LCR 2017 Conference).
Link to more information about LCR registration fees and policy.
On-site registration will be available from 8.00 to 8.30 for additional 15 EUR.
The best time to talk with speakers and attendees
We will start in the Eurac Research garden and take a group photo.
Social Network Sites (SNS) claim that they are "on a mission to connect the world". They facilitate communication among people wherever they are located. Consequently, many users of SNS communicate with a broad and heterogenic group of friends on different occasions and thereby express various aspects of their identities (such as gender, age, ethnic background etc.). One aspect may also be a local identity.
Users of SNS can show their local identity linguistically by using a regional variety. Sometimes, the use of single regionally marked words or sporadic regiolectal spellings are sufficient to identify the regional background of the writer; in other cases entire text messages and conversations appear in dialectal spellings meaning that the dialect appears as the main variety of the conversation. The extent of dialect use in computer-mediated communication (CMC) may depend on various factors such as the individual dialect skills, the vividness and prestige of the respective dialect in the community, emotional involvement in the given topic, age, gender, the intended recipient, and other factors probably interacting with each other.
The use of regional dialects in written CMC is one reason (amongst others) why language in CMC often differs from the respective standard languages. Since no orthographic rules are usually available for writing in dialect, it is up to the users to represent their dialect in a proper but readable and comprehensible way. Users have to construct their regiolectal language variety on the basis of the orthography of the respective standard language, which usually allows also for variation. One reason for this may be various adequate possibilities to represent a dialect word within a given writing system (e.g. German). Another reason may be the (sometimes very slight) phonetic differences between regionally close dialects that writers want (or do not want) to turn up in the dialect respelling. Therefore, dialect respellings are not always coherent (neither with respect to a group of dialect speakers nor with respect to individual writers) but usually appear in various forms. However, unifications of respellings in CMC are described for pidgin languages and also occur in dialectal CMC.
Over the last decade, researchers started to compile corpora containing different genres of CMC. Such CMC corpora enable a systematic analysis of the way dialect features are reflected in written communication. In my talk, I will focus on patterns of the regional dialect(s) in the DiDi Corpus, a collection of Facebook messages from around 100 South Tyrolean writers. I will provide examples of regional features, analyse the distribution of such features, and discuss challenges of identifying local writings on SNS.
Mobile communication tools and platforms provide various opportunities for users to interact over social media. With the recent developments in computational research and machine learning, it has become possible to analyze large chunks of language related data automatically and fast. However, these tools are not readily available to handle data in all languages and there are also challenges handling social media data. Even when these issues are resolved, asking the right research question to the right set and amount of data becomes crucially important.
Both qualitative and quantitative methods have attracted respectable researchers in language related areas of research. When tackling similar research problems, there is need for both top-down and bottom-up data-based approaches to reach a solution. Sometimes, this solution is hidden under an in-depth analysis of a small data set and sometimes it is revealed only through analyzing and experimenting with large amounts of data. However, in most cases, there is need for linking the findings of small data sets to understand the bigger picture revealed through patterns in large sets.
Having worked with both small and large language related data in various forms, I will compare pros and cons of working with both types of data across media and contexts and share my own experiences with highlights and lowlights.
The best time to talk with speakers and attendees
The tutorial gives a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using TEI.
Harald Lüngen
Harald has been working on TEI-encoding of CMC data in the CLARIN-D curation project "ChatCorpus2CLARIN" and has been one of the main authors of the most recent TEI schema draft for CMC genres developed within the TEI-SIG "computer-mediated communication".
Michael Beißwenger
Michael is member of the organizing team for the main conference. He has been working on CMC corpora and corpus encoding in several projects, and on TEI encoding for CMC in the TEI-SIG "computer-mediated communication".
Laura Herzberg
Laura is a researcher and lecturer at the Department of German Linguistics (Prof. Dr. Angelika Storrer), University of Mannheim, Germany.
Her research interests are corpus linguistics, linguistic aspects of computer-mediated communication, analysis of interaction signs in written corpora, cross-lingual analysis of Wikipedia talk pages, extraction and evaluation of CMC-language data, classifying und tagging of interaction signs, editing of chat protocols in oXygen (XML editor).
Laura has been working on TEI-encoding of CMC data in the CLARIN-D curation project "ChatCorpus2CLARIN".
The tutorial gives a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using TEI. Read more
Harald Lüngen
Harald has been working on TEI-encoding of CMC data in the CLARIN-D curation project "ChatCorpus2CLARIN" and has been one of the main authors of the most recent TEI schema draft for CMC genres developed within the TEI-SIG "computer-mediated communication".
Michael Beißwenger
Michael is member of the organizing team for the main conference. He has been working on CMC corpora and corpus encoding in several projects, and on TEI encoding for CMC in the TEI-SIG "computer-mediated communication".
The tutorial gives a practical introduction into the annotation of language data from genres of computer-mediated communication (CMC) and social media using TEI. Read more
Harald Lüngen
Harald has been working on TEI-encoding of CMC data in the CLARIN-D curation project "ChatCorpus2CLARIN" and has been one of the main authors of the most recent TEI schema draft for CMC genres developed within the TEI-SIG "computer-mediated communication".
Laura Herzberg
Laura is a researcher and lecturer at the Department of German Linguistics (Prof. Dr. Angelika Storrer), University of Mannheim, Germany.
Her research interests are corpus linguistics, linguistic aspects of computer-mediated communication, analysis of interaction signs in written corpora, cross-lingual analysis of Wikipedia talk pages, extraction and evaluation of CMC-language data, classifying und tagging of interaction signs, editing of chat protocols in oXygen (XML editor).
Laura has been working on TEI-encoding of CMC data in the CLARIN-D curation project "ChatCorpus2CLARIN".
Features a series of invited speakers whose work has greatly contributed to the development of Learner Corpus Research. Read more about the event.
Link to the registration: €15 for the LCR 2017 Pre-Conference Workshop and Reception (without attending the LCR 2017 Conference).
Link to more information about LCR registration fees and policy.
Stefania Spina: Learner Corpus Research and the acquisition of Italian as a second language: the case of the Longitudinal Corpus of Chinese Learners of Italian (LoCCLI).
Societal and demographic changes have contributed to increasing bi- and multilingualism in European countries in recent years, and communication on social media platforms such as Twitter reflects this linguistic diversity. While high rates of English use online have been attested for many European countries by survey research, relatively little work has quantified the extent to which English is used on social media in European contexts. In this study, English use and bilingualism with English in Europe are investigated on Twitter.
A large corpus of Twitter messages with geographical metadata was created by accessing the Twitter APIs. After language detection and filtering, linguistic profiles for European countries were determined and the behavior of bi- and mutlilingual users examined. The analysis supports some previous findings that suggest that a large-scale language shift towards English may be ongoing in Europe. Geographical differences shed light on the dynamics of this process.
Steven Coats
We present a series of experiments to fit a part-of-speech (PoS) tagger towards tagging extremely infrequent PoS tags of which we only have a limited amount of training data.
The objective is to implement a tagger that tags this phenomenon with a high degree of correctness in order to be able to use it as a corpus query tool on plain text corpora, so that new instances of this phenomenon can be easily found in plain text.
We focused on avoiding manual annotation as much as possible and experimented with altering the frequency weight of the PoS tag of interest in the small training data set we have.
This approach was compared to adding machine tagged training data in which only the phenomenon of interest is manually corrected.
We find that adding more training data is unavoidable but machine tagging data and hand correcting the tag of interest is sufficient.
Furthermore, the choice of the tagger plays an important role as some taggers are equipped to deal with rare phenomena more adequately than others.
The best trade off between precision and recall of the phenomenon of interest was achieved by a separation of the tagging into two steps
An evaluation of this phenomenon-fitted tagger on social media plain-text confirmed that the tagger serves as a useful corpus query tool that retrieves instances of the phenomenon including many unseen ones.
Tobias Horsmann, Michael Beißwenger and Torsten Zesch
Emoticons play an important role in digital written communication: they can serve as markers either of emotions or familiarity, and they can intensify or downgrade the pragmatic force of a text.
The aim of this study is to investigate the use of emoticons in Twitter by Italian users, and to verify, by relying on corpus data and on statistical methodologies, some of the prevailing opinions on the use of emoticons: that they are technically-driven resources, that they are mostly used by young people, and more often by females, and that they are superficial and easy ways of expressing emotions using images instead of words.
A mixed-effects model analysis has shown that the use of emoticons on Twitter is affected by a complex interaction of cultural, technological, situational and sociolinguistic variables.
Stefania Spina
This paper investigates the linguistically marked motives that participants attribute to those they call trolls in 991 comment threads of three British political blogs. The study is concerned with how these motives affect the discursive construction of trolling and trolls. Another goal of the paper is to examine whether the mainly emotional motives ascribed to trolls in the academic literature correspond with those that the participants attribute to the alleged trolls in the analysed threads. The paper identifies five broad motives ascribed to trolls: emotional/mental health-related/social reasons, financial gain, political beliefs, being employed by a political body, and unspecified political affiliation. It also points out that depending on these motives, trolling and trolls are constructed in various ways. Finally, the study argues that participants attribute motives to trolls not only to explain their behaviour but also to insult them.
Márton Petykó
The #Idéo2017 platform allows citizens to analyze the tweets of the 11 candidates at the French 2017 Presidential Election. #Idéo2017 processes the messages of the candidates by creating a corpus in almost real time. By using techniques from linguistics supplied with tools, #Idéo2017 is able to provide the main characteristics of the corpus and of the employment of the political lexicon, and allows comparisons between the different candidates.
Julien Longhi, Claudia Marinica, Nader Hassine, Abdulhafiz Alkhouli and Boris Borzic
The present paper deals with Flemish adolescents' informal computer-mediated communication (CMC) in a large corpus (2.9 million tokens) of chat conversations. We analyze deviations from written standard Dutch and possible correlations with the teenagers' gender, age and educational track. The concept of non-standardness is operationalized by means of a wide range of features that serve different purposes, related to the chatspeak maxims of orality, brevity and expressiveness. It will be demonstrated how the different social variables impact on non-standard writing, and, more importantly, how they interact with each other. While the findings for age and education correspond to our expectations (more non-standard markers are used by younger adolescents and students in practice-oriented educational tracks), the results for gender (no significant difference between girls and boys) do not: they call for a more fine-grained analysis of non-standard writing, in which features relating to different chat principles are examined separately.
Lisa Hilte, Reinhild Vandekerckhove and Walter Daelemans
This paper reports on a corpus-based analysis of demonym mentions in the corpus of Slovene tweets. First, we analyze the frequency of mentions for the demonyms for the inhabitants of the European and G8 countries. Then, we focus on the representation of demonyms for residents of Slovenia’s neighboring countries: Austria, Italy, Hungary and Croatia. The main topic of the tweets mentioning Croatians, Austrians and Italians is sport, whereas Hungarians occur most often in relation to the Eurovision. Some economic and political issues are also represented, such as the selling of Slovene companies to foreign firms, the refugee crisis and the arbitration procedure between Slovenia and Croatia. A collocation analysis revealed a highly stereotypical treatment of the neighboring nations and hostility of some Slovene Twitter users to inhabitants of Slovenia’s neighboring countries.
Taja Kuzman and Darja Fišer
Present study compares four computer-mediated conversational registers – comments, FB groups, FB status and tweets – and spoken conversations from Pakistani and US English using Biber's Multidimensional Analysis framework on three dimensions of variation, i.e. (i) Interactive versus Descriptive Explanatory Discourse, (ii) Expression of Stance, and (iii) Informational Focus versus 1st Person Narrative. Spoken conversations have a high score on dimension 2, while CM conversations show register and regional variation on dimension 1 and 3. FB groups are significantly different in both regional varieties, followed by FB status, comments and tweets. Pakistani FB groups discuss self-help related topics, and appear to be slightly interactive and highly informational, while the US ones are interactive and narrative discussing community and political issues. Pakistani FB status and tweets use English mainly for informational purposes, while the US counterparts have interactive and personal orientation indicating a wider functional role of English.
Muhammad Shakir and Dagmar Deuber
The current study addresses the definition of a protocol for collecting, storing data and describing (in a simple and generic way) a databank. Particularly, the transparency of a form aimed at gathering information about the pedagogical context of oral telecollaboration for language learning named Teletandem (TT; Telles, 2006) will be tested before it is spread more widely. To uncover problems in submitting information, data-input-triggers quality and reliability have been tested interviewing professors and language instructors who will be involved in a preliminary phase of Teletandem corpus implementation. General goals of the study are to enlarge the research group, to increase data and to improve efficiency in data collection.
Paola Leone
The paper presents results of a case study that compared the usage of OKAY across genre types (Wikipedia articles vs. talk pages), across media (spoken vs. written interaction), and across languages (German vs. French CMC data from Wikipedia talk pages). The cross-genre study builds on the results of Herzberg 2016, who compared the usage of OKAY in German Wikipedia articles with its usage in Wikipedia talk pages. These results also form the basis for comparing the CMC genre of Wikipedia talk pages with occurrences of OKAY in the German spoken language corpus FOLK. Finally, we compared the results on the usage of OKAY in German Wikipedia talk pages with the usage of OKAY in French Wikipedia talk pages. With our case study, we want to demonstrate that it is worthwhile to investigate interaction signs across genres and languages, and to compare the usage in written CMC with the usage in spoken interaction.
Laura Herzberg and Angelika Storrer
Today’s youths are continuously engaged with social media. The informal language they use in computer-mediated communication (CMC) often deviates from spelling and grammar rules of the standard language. Therefore, parents and teachers fear that social media have a negative impact on youths’ literacy skills. This paper examines whether such worries are justifiable. An experimental study was conducted with 500 Dutch youths of different educational levels and age groups, to find out if social media affect their productive or perceptive writing skills. We measured whether chatting via WhatsApp directly impacts the writing quality of Dutch youths’ narratives or their ability to detect ‘spelling errors’ (deviations from Standard Dutch) in grammaticality judgement tasks. The use of WhatsApp turned out to have no short-term effects on participants’ performances on either of the writing tasks. Thus, the present study gives no cause for great concern about any impact of WhatsApp on youths’ school writing.
Lieke Verheijen and Wilbert Spooren
As a consequence of a recent curation project, the Dortmund Chat Corpus is available in CLARIN-D research infrastructures for download and querying. In a legal expertise it had been recommended that standard measures of anonymisation be applied to the corpus before it could be republished. This paper reports about the anonymisation campaign that was conducted for the corpus. Anonymisation has been realised as categorisation, and the taxonomy of anonymisation categories applied is introduced and the method of applying it to the TEI files is demonstrated. The results of the anonymisation campaign as well as issues of quality management are discussed. Finally, pseudonymisation as an alternative to categorisation is discussed in general as a method of the anonymisation of CMC data, as well as possibilities of a (partial) automatisation of the process.
Harald Lüngen, Michael Beißwenger, Laura Herzberg and Cathrin Pichler
The paper deals with the sociolinguistic concept of prestige imbued in the notion of standard language, and the social status connected to the inherent language skill (or lack thereof). To this end, we analyse Slovenian tweets pertaining to language use and the (in-)correctness of other users’ use of language, propose a typology, especially in cases where language use is used as an argument against someone’s qualifications or beliefs.
Damjan Popič and Darja Fišer
The paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability – with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach.
Michael Beißwenger, Ciara R. Wigham, Carole Etienne, Darja Fišer, Holger Grumt Suárez, Laura Herzberg, Erhard Hinrichs, Tobias Horsmann, Natali Karlova-Bourbonus, Lothar Lemnitzer, Julien Longhi, Harald Lüngen, Lydia-Mai Ho-Dac, Christophe Parisse, Céline Poudat, Thomas Schmidt, Egon Stemle, Angelika Storrer and Torsten Zesch
With the growing volume and importance of computer-mediated communication, the need to understand its linguistic and social dimensions, along with CMC-robust language technologies is on the rise as well. This is reflected in the increasing number of conferences, projects and positions involving analysis of CMC in a wide range of disciplines in Digital Humanities, Social Sciences and Computer Science. As a result, a number of valuable CMC corpora, datasets and tools are being developed but unfortunately, due to non-negligible technical, legal and ethical obstacles, not many are being shared and reused.
Since it is the mission of CLARIN to create and maintain an infrastructure to support the sharing, use and sustainability of language data and tools for researchers in Digital Humanities and Social Sciences, it is our goal to have a good overview of the available resources and tools, to offer support to their developers to overcome the technical, legal and ethical obstacles and deposit them to the CLARIN infrastructure, as well as to the researchers with diverse backgrounds, such as linguistics, media studies, psychology etc., but also to interested parties from the educational, commercial, political, medical and legal sectors of the society who are interested in using them.
The first step in this direction was an interdisciplinary workshop on the creation and use of social media which was organized within the Horizon 2020 CLARIN-PLUS project on 18 and 19 May 2017 in Kaunas, Lithuania. The aims of the workshop were to demonstrate the possibilities of social media resources and natural language processing tools for researchers with a diverse research background and an interest in empirical research of language and social practices in computer-mediated communication, to promote interdisciplinary cooperation possibilities, and to initiate a discussion on the various approaches to social media data collection and processing.
The workshop also served as a platform to conduct a survey of corpora, datasets and tools of computer-mediated communication in the languages spoken in countries that are members and observers of CLARIN ERIC. Apart from identifying the existing resources and tools, our motivation was to establish to which extent they are accessible through the CLARIN infrastructure and how the information and accessibility of them could be further optimized from a user perspective.
In this talk, we will give an overview of the identified corpora, the smaller, more focused datasets and tools that are tailored to processing computer-mediated communication. The focus of the talk will be on the comprehensiveness of the provided metadata, level of availability and accessibility of the identified resources and tools and the degree of their actual or potential inclusion in the CLARIN infrastructure. We will also discuss the simple and long-term possibilities of enriching the current state of the infrastructure and provide guidelines for creating and depositing CMC resources with a CLARIN center.
Darja Fišer
Darja Fišer is Assistant Professor at the Faculty of Arts, University of Ljubljana, currently active in the fields of computer-mediated communication and lexical semantics using corpus-linguistics methods and natural language processing. She is a member of the Management Committee of CLARIN.SI where she is in charge of user involvement at the national level and also a member of the CLARIN ERIC User Involvement Working Group.
This poster gives an overview on the Italian national CLARIN consortium and the status of CLARIN-IT in general. It thus discusses the current state of affairs of the consortium and provides information on the members, especially with regards to what they offer to CLARIN in terms of resources, services and expertise, and what CLARIN offers them to further their own research.
Alexander König and CLARIN-IT colleagues
The poster reports about intermediate results of MoCoDa 2, an ongoing project funded by the Ministry for Innovation, Science, Research and Technology of the German federal state North Rhine-Westphalia in which we are developing a database and web frontend for the repeated, donation-based collection of CMC interactions from smartphone messaging apps like WhatsApp. The database shall serve as a resource not only for quantitative but also for qualitative approaches in the analysis of CMC. MoCoDa 2 builds on experiences from the preceding project MoCoDa which has collected a (relatively small) set of 2,198 interactions with 19,161 user posts or ca.~193,000 tokens since 2012. For MoCoDa 2 the database and web frontend will be re-implemented from the scratch and expanded with additional functions and features:
- A form for donating and editing the data, which involves the donators into the editing and anonymization process and assists them with capturing metadata on the context and topic of the donated sequences as well as on the interlocutors and their social relations. Anonymization will follow an anonymization guideline developed in the CLARIN-D curation project ChatCorpus2CLARIN.
- Part-of-speech annotations which comply with the extended ‘STTS 2.0’ tagset for German CMC and which will be created using a toolchain provided by the Language Technology Lab (LTL) at the University of Duisburg-Essen.
- A TEI export for the collected data on basis of the ‘CLARIN-D TEI schema for CMC’.
Through adopting the STTS 2.0 tagset and a TEI-based export format the corpus data will be interoperable with corpora that are already part of the CLARIN-D corpus infrastructure at the Institute for the German Language (IDS) in Mannheim. To allow for comparative analyses of the MoCoDa 2 data with the discourse found in text corpora and in other CMC corpora, MoCoDa 2 will not only be made available as a standalone resource but also be integrated into the German Reference Corpus (DeReKo) at the IDS Mannheim.
Michael Beißwenger, Marcel Fladrich, Wolfgang Imo and Evelyn Ziegler
/l/-vocalization is a feature normally found in a rather clear cut area in the western part of Switzerland. Its geographical boundaries are well documented as well as social influences on the realisation of the variants. This study, based on a corpus of authentic WhatsApp messages, takes another approach by documenting isolated forms of /l/-vocalization outside the area traditionally attributed to the feature.
Simone Ueberwasser
As social networking sites have become staples in everyday life an increasing number of people worldwide use social media as a source of news. To reach this audiences, news organisations and public service broadcasters have ventured on services such as Facebook, which in terms of news is by far the most important social networking site in many parts of Europe. This poster presentation explores the ways in which public service media from different European countries are delivering news on public Facebook Pages. The analysis is based on public data gathered from different Facebook pages operated by national broadcasting agencies. The data are extracted using the public Facebook Graph API. The corpus contains all the posts and comments of the Facebook Pages as well as related metadata. No personally-identifiable information is collected.
The social media data are explored using statistical research methods to identify and compare different usage patterns and to visualise the reactions of Facebook users. This provides an overview over the different forms of content (i.e. types of posts) and the basic communicative practices that can be observed in the context of the Facebook Pages (i.e. number of comments, shares, likes and Reaction types). To allow deeper insights an exploratory case-study approach is used. Drawing upon media linguistic research the focus is on the micro level of the media texts and their multimodal design. The in-depth analysis aims to characterise different forms of news reporting via Facebook and looks at the different usage of multimodal ressources in the context of the Facebook posts and comments.
This combination of qualitative and quantitative methods should allow a better understanding of how Facebook is used as a means of news distribution by public service media providers on a large scale and how technical affordances shape the design of news content and follow-up interactions. This knowledge is critical for the discussion of the emerging role of social media in the context of public opinion and political decision-making. The poster presents the project as work in progress and shows preliminary findings.
Daniel Pfurtscheller