Offensive Language in the German speaking Twittersphere


This is the first installment around the launch of Hexis API and it is intended to show how – not hand-picked examples, but – real data looks through the lense of our offensive language classifier (which is also available for English). We will use a few snippets of Python code to analyze the data and illustrate our findings.

About the data

The basis for this investigation is a sample of Twitter messages or tweets collected during one day in the beginning of December. The sample consists of a thousand randomly selected tweets for each of the 93 metropolitan areas with more than 100.000 inhabitants across Germany, Austria, and Switzerland. Many messages will already have been caught by Twitter's moderation system before they make it into our dataset. Nevertheless, we can use automatic classification to explore cases which are still out in the open.

Disclaimer: This is by no means to take a stab against Twitter and their precautions for hate speech and offensive language. We love the openness of the platform which also made it possible to quickly conduct this research in the first place. We see all of this as part of a larger societal problem, to which we can offer only a piece of the solution.

Loading the data

We assume the data has already been gathered. Given its sensitive nature, we are looking into sharing the complete dataset (including classification scores) at a later time. In the meanwhile you can easily recreate the following steps using any other Twitter dataset, or even other kinds of texts. So, without further ado, let's start preparing the data.

We assume data.csv to have the following format:

tweet near latitude longitude
...
"Über Qualitätsjournalisten, Arsch und Grundeis Folge 2 URL …" Herne 51.5380394 7.219985
"So etwas lesen zu müssen, macht mich unendlich traurig. Ist #Deutschland wirklich wieder so weit dunkle Kapitel zu wiederholen?" Herne 51.5380394 7.219985
"Edi Glieder ein Schalke-Flop? Das ist eine bodenlose Frechheit!!!!!!!! URL …" Herne 51.5380394 7.219985
"Kommt in deine KGB Akte, du Loser! Und Milli bleibt Genosse, wie $adam Hü$$€¥N oder MATERNUS PILS!!!" Herne 51.5380394 7.219985
"Guten Morgen verehrte TL ☕️" Herne 51.5380394 7.219985
"Sicher. Aber bis so etwas möglich ist ... Hilfe, Aufarbeitung ist ein komplexer, vielschichtiger und langwieriger Prozess, der meist Psychotherapie benötigt. Oft dauert es Jahre. Kurze Empfehlungen gibt es da nicht sondern können irritieren, weil es auf das konkrete Kind ankommt" Herne 51.5380394 7.219985
"USER sind sie schon daaaa. Schon alles fertig Ganz nervös hier.😁😁😂🤪" Herne 51.5380394 7.219985
...


In order to use this data we first want to load it into a spreadsheet-like data structure called dataframe.

import pandas as pd

df = pd.read_csv('data.csv', header=0, engine='python')

Classification

Now that we have our data in place, it's time to get started with classification. This is a straightforward process: We will use the Hexis API endpoint (documentation) which expects a piece of JSON as input. The input should contain an element named text, the value of which is the tweet's text content. The endpoint will in turn respond with an aggressiveness-score for each text. You can get a score for up to 100-word long pieces of text. Of course it also makes sense to analyze longer text, just know that it will need to be split (or get split automatically by the API).

Note: If you don't have an API key yet, head over to the main site and create one for free.

import requests

URL = 'https://api.hexis.ai/off/de/v1'
API_KEY = 'your_api_key_here'

for i, row in df.iterrows():
    # escape characters for JSON compliance
    text = row['tweet'].replace('"','\\"').replace('\n', ' ')

    # send to API
    header = {'Authorization': 'Bearer ' + API_KEY}
    data = {'text': text}
    response = requests.post(URL, json=data, headers=header)

    # process response
    score = response.json()['scores'][0]
    df.loc[i, 'score'] = score

The dataframe df should now look like this:

tweet near latitude longitude score
...
"Über Qualitätsjournalisten, Arsch und Grundeis Folge 2 URL …" Herne 51.5380394 7.219985 0.9908769726753236
"So etwas lesen zu müssen, macht mich unendlich traurig. Ist #Deutschland wirklich wieder so weit dunkle Kapitel zu wiederholen?" Herne 51.5380394 7.219985 0.018021326512098312
"Edi Glieder ein Schalke-Flop? Das ist eine bodenlose Frechheit!!!!!!!! URL …" Herne 51.5380394 7.219985 0.9930542111396791
"Kommt in deine KGB Akte, du Loser! Und Milli bleibt Genosse, wie $adam Hü$$€¥N oder MATERNUS PILS!!!" Herne 51.5380394 7.219985 0.9801527857780457
"Guten Morgen verehrte TL ☕️" Herne 51.5380394 7.219985 0.0029989404138177638
"Sicher. Aber bis so etwas möglich ist ... Hilfe, Aufarbeitung ist ein komplexer, vielschichtiger und langwieriger Prozess, der meist Psychotherapie benötigt. Oft dauert es Jahre. Kurze Empfehlungen gibt es da nicht sondern können irritieren, weil es auf das konkrete Kind ankommt" Herne 51.5380394 7.219985 0.30031758546829224
"USER sind sie schon daaaa. Schon alles fertig Ganz nervös hier.😁😁😂🤪" Herne 51.5380394 7.219985 0.03731132298707962
...

Visualization

Next we will employ the visualization framework Altair/Vega Lite alongside freely available TopoJSON files to produce a geographical mapping of our data.

import altair as alt

TOPOJSON_URL = 'https://raw.githubusercontent.com/deldersveld/topojson/master/countries/germany/dach.json'
map_data = df.groupby(['near']).mean()
topo = alt.topo_feature(TOPOJSON_URL, feature='layer')

background = alt.Chart(topo).mark_geoshape(
    fill='lightgray',
    fillOpacity=1.0,
    stroke='white',
    strokeWidth=1
).properties(
    width=500,
    height=500
).project('mercator')

points = alt.Chart(map_data).mark_circle().encode(
    longitude='longitude:Q',
    latitude='latitude:Q',
    size=alt.Size('score:Q', title='score (mean)'),
    color='score:Q',
    tooltip=['near:N','score:Q']
).properties(title='Aggression in the German-speaking Twittersphere')

plot = background + points
plot.save('map.html')

This results in a nice interactive map in HTML format.

It's interesting to see the distribution of higher average scores across the historically more industrial areas. Nevertheless, averages don't give us the full picture. Let's use our data to produce a bar chart, displaying each metropolitan area and its individual scores, split into bins of 10% each.

df.columns = ['tweet', 'Area', 'latitude', 'longitude', 'Score']  # rename columns for plotting

plot = alt.Chart(df).mark_bar().encode(
    x='count(Score)',
    y=alt.Y(
        'Area',
        sort=alt.EncodingSortField(
            field='Score',
            op='sum',
            order='descending')
        ), color=alt.Color(
            'Score:Q',
            bin=alt.Bin(maxbins=10))
).properties(title='Scores by metropolitan area')

plot.save('barchart.html')

This produces a rather large but quite informative view on our data, making it possible to see an all-encompassing gradient of aggressiveness in our dataset. Roughly 4% of all tweets are problematic (corresponding to the top 90-100% range of scores) and should get looked at closely by human moderators in order to avoid any liabilities on part of the platform.

Wrapping up

We hope you enjoyed this small and ad hoc investigation into the underbelly of the German-speaking Twittersphere. As with any other empirical approach aiming to reach valid conclusions on a certain topic, data collection and curation continues to be the hardest part. On the other hand, if you already have the data, chances are that there is a lot of it and that you can use the help of a capable machine. So that you as the expert only have to look at a fraction of all data, which – given a system tuned to your target domain – amounts to the borderline cases only. We think that should be possible in a breeze. That's why we've created Hexis API.

Stay tuned for our next investigation.