To execute the following code, we need to load three extra packages.
We first import a list of all german text files to have a list of all the files in the folder. This list is the object de_filelist
that we created.
## [1] 221
Then, we apply the function that imports the actual data sets to the entire list of German text files. This corresponds to the object de_files
that we created, which is a list of all 233 data frames for German.
de_files <- lapply(de_filelist, function(x) {
read.delim(x, header = FALSE, sep = "\t", blank.lines.skip = FALSE)
})
length(de_files)
## [1] 221
In order to access an element from a list, we use square brackets like in the following command, which outputs the German text of our list of German texts.
The next bit of code does the same for the English texts, importing all the English texts and creating a list called en_files
that contains all the English texts as data frames.
## [1] 92
en_files <- lapply(en_filelist, function(x) {
read.delim(x, header = FALSE, sep = "\t", blank.lines.skip = FALSE)
})
length(en_files)
## [1] 92
We now need to add the filenames as a column to be able to classify the text data frames as written/spoken and formal/informal and to be able to identify the four texts that belong to the same speaker.
This is what the next bit of code does: it goes through the lists of English and German texts, merges them with the filelists names creating an additional column and converting the list into two large data frames, one for German, one for English.
To those two data frames, we can then add column names to be able to access the important parts of the data set, but also to be able to subset by modality, by speaker, by sentence, etc.
for (i in 1:length(de_files)) { de_files[[i]]<-cbind(de_files[[i]],de_filelist[i])
}
de_files2 <- do.call("rbind", de_files)
colnames(de_files2) <- c("word_ID", "text", "lemma", "POS", "POS_2", "morphosyntax", "dependency", "syntax", "add_1", "add_2", "file")
for (i in 1:length(en_files)) { en_files[[i]]<-cbind(en_files[[i]],en_filelist[i])
}
en_files2 <- do.call("rbind", en_files)
colnames(en_files2) <- c("word_ID", "text", "lemma", "POS", "POS_2", "morphosyntax", "dependency", "syntax", "add_1", "add_2", "file")
The next lines of code add the column language
, specifying the language for each line of the two data frames.
Having an additional column that indicates the language, we can now bind the two data frames together, having one large data frame that contains all German and English texts.
In addition, we need two more columns that will come in handy when subsetting the data for modality and register:
register
, specifying the register for each line of the two data frames (this is done by checking the file
column, if it has an f
, followed by two other letters and .txt
, then the register is formal
, if not, it is informal
)modality
, specifying the modality for each line of the two data frames (done as for the register
column, mutatis muntandis, searching for s
folled by a letter and .txt
)all_files$register <- ifelse(str_detect(all_files$file, "f.{2}\\.txt"),
"formal",
"informal")
all_files$modality <- ifelse(str_detect(all_files$file, "s.\\.txt"),
"spoken",
"written")
We can now take a look at the first and last lines of our data frame to make sure that everything was added correctly.
In order to count words per sentences, we need to exclude those rows that contain punctuation, so that the number of rows per file will correspond to the number of words per text.
We also want to exclude those rows that contain extra-linguistic material or other languages.
To be able to better keep track of the sentences, let’s create an additional column with a sentence ID that gives each sentence an individual number, starting with 1, and incrementing by +1 for each new sentence.
To do so, we define a counter that starts with 1. We will use this counter in a loop later to add +1 for each new sentence.
We also need to define a column that will contain our sentence ID. We will assign it the value one for all rows for now.
We now need a loop that goes through every row of the data set: every time that it encounters an empty cell in the POS column, it adds +1 to the counter. In addition, in the loop we set the value of the sentence_id
column to the value of the counter. Since the counter is incremented by +1 everytime it sees a sentence boundary, the sentence_ID
column will in the end have a different number for each sentence.
fun_sent_helper <- function(x) {
for (i in 1:nrow(x)) {
if(x[i, "POS"]=="") {
counter = counter + 1
}
x[i, "sentence_id"] <- counter
}
return(x)
}
all_texts <- split(all_files, all_files$file)
all_texts <- lapply(all_texts, fun_sent_helper)
length(all_texts)
## [1] 313
Let’s see if the loop properly generated the sentence IDs in the sentence_ID
column. Look at the rightmost column (you may need to click on the right arrow to make it visible) of the informal-spoken text: it starts a new sentence_ID
with every new sentence.
Since we can now keep track of single sentences, we no longer need the otherwise empty rows. The following command deletes those additional rows.
## [1] 52121
## [1] 47349
In order to compare the length of sentences across modalities and registers, we first need to calculate the length of sentences in words. We can then average across sentences per variety and compare those averages.
The first part splits the texts into a list of sentences, so that each sentence is its own data frame in the list. This way, we can perform the same operations on each single sentence.
The next chunk of code defines a function to count the words per sentence. Words are counted in numbers of rows, since the text contains one word per row.
We now apply the word counting function to our list of sentences, add column names and a column with the information on the file name to be able to relate those counts to languages and varieties later.
length_sent <- lapply(all_texts, fun_sent_length) %>%
unlist() %>%
as.data.frame()
colnames(length_sent) = c("sent_length")
length_sent$file <- rownames(length_sent)
The next chunk of code adds the register, modality, and language information to our data frame that contains the length of all sentences
length_sent$register <- ifelse(str_detect(length_sent$file, "f.{2}\\.txt"),
"formal",
"informal")
length_sent$modality <- ifelse(str_detect(length_sent$file, "s.\\.txt"),
"spoken",
"written")
length_sent$language <- ifelse(str_detect(length_sent$file, "DE"),
"German",
"English")
To make the results of sentence lengths interpretable, we can summarize the sentence for language, modality, and register groups. We do that by averaging over the length of sentences that all belong to the same language, modality, and register combinations.
The table below shows the length of sentences (in words) for each of the 8 combinations we are interested in.
We may also want to compare the average length of texts (in number of words) across the 8 varieties.
To do so, the next code chunk adds a column to the data set that counts the number of words per text.
fun_text_length <- function(x) {
x <- mutate(x, word_id = row_number())
x
}
text_length <- lapply(all_texts, fun_text_length)
all_files <- as.data.frame(do.call(rbind, text_length))
With this counter, we can calculate the average text lengths in number of words across the 8 varieties:
The total number of sentences simply corresponds to the highest value of the sentence_ID column.
To calculate the number of clauses, of specific clause types, and the ratio of main and subordinate clauses, requires a number of different steps.
The relevant annotation is mostly contained in the syntax
column. The tags used are those of the Universal Dependency treebanks. The UD treebanks are large, crosslinguistic collection of corpora (texts) that are syntactically annotated.
What we need are the tags that they use for coordination and subordination. You can find the detailed documentation of the tag set as well as explanations here: https://universaldependencies.org/u/overview/complex-syntax.html#subordination
Those are the tags that we are interested in:
tag | explanation |
---|---|
cc | conjoined clauses (technically only coordination, although also used for subordination) |
to you and the car and that crazy guy | |
acl | adnominal clause modifier |
in the end everybody’s ok which is really what matters | |
advcl | adverbial clause |
the other day you called me when something crazy happened | |
csubj | clausal subject |
hitting the dog had to break the first car | |
to get together these two cars were coming down the street | |
xcomp | complement clause with obligatory control |
they seemed fine | |
I found that really interesting | |
ccomp | complement clause without control |
I know that they’re fine | |
I see a minor accident happen | |
rel | relative clause pronoun |
check your own damage which is what this driver did |
Let’s take a look at some examples
Relative clauses are less consistently annotated. We can combine the counts for the following tags to estimate the number of relative clauses:
POS_2
column contains a PRELS
tag to signal relative markersmorphosyntax
column contains information on the pronoun type; in most cases, relative pronouns are marked as such by PronType=Rel
syntax
column uses acl
mostly for relative clauses as wellall_files[all_files$POS_2=="PRELS" | all_files$syntax=="acl" | str_detect(all_files$morphosyntax, "PronType=Rel"), ] %>%
sample_n(20)
The next chunk of code calculates the number of those clause types per text and saves the counts as a data frame.
clause_counts <- lapply(all_texts, function(x){
n_sent <- max(x$sentence_id)
n_cc <- nrow(x[x$syntax=="cc", ])
n_advcl <- nrow(x[x$syntax=="advcl", ])
n_csubj <- nrow(x[x$syntax=="csubj", ])
n_comp <- nrow(x[x$syntax=="xcomp", ]) + nrow(x[x$syntax=="ccomp", ])
n_rel <- nrow(x[x$POS_2=="PRELS" | x$syntax=="acl" | str_detect(x$morphosyntax, "PronType=Rel"), ])
file <- as.character(x$file[1])
d <- cbind(n_sent, n_cc, n_advcl, n_csubj, n_comp, n_rel, file)
d
})
clause_counts <- as.data.frame(do.call(rbind, clause_counts))
The head of this data frame looks like this:
Again, to be able to compare the number of different types of clauses across varieties, we need to add language, modality, and register information to the data frame containing the clause counts.
The next chunk of code first converts the counts into from characters into numbers. With this, we can calculate the ratios of clause types per number of sentences that each text contains. Remember, this is important because the texts have different lengths.
To interpret the clause/sentence ratios for different types of clauses, we need to average and summarize across the 8 different variations. This is done in the second part of the code chunk below.
clause_counts[, 1:6] <- lapply(clause_counts[, 1:6], as.character)
clause_counts[, 1:6] <- lapply(clause_counts[, 1:6], as.numeric)
clause_types_summary <- clause_counts %>%
group_by(language, modality, register) %>%
summarize(mean_cc = round(mean(n_cc/n_sent), 2),
mean_advcl = round(mean(n_advcl/n_sent), 2),
mean_comp = round(mean((n_comp)/n_sent), 2),
mean_csubj = round(mean(n_csubj/n_sent), 2),
mean_rel = round(mean(n_rel/n_sent), 2)
) %>%
as.data.frame
clause_counts_summary <- clause_counts %>%
group_by(language, modality, register) %>%
summarize(mean_subtotal = round(mean((n_advcl + n_csubj + n_comp + n_rel)/n_sent), 2)
) %>%
as.data.frame
The table below shows the summary of the various clause types in terms of how frequent they are in the 8 varieties we are comparing. Their frequency is represented relatively to the number of sentences.
The next table shows the overvall number of subordinate clauses in relation to the number of sentences across the 8 varieties.
CCONJ
There is another issue that comes with counting conjunctions and that we have ignored so far. Many sentences often start with and, but, or. In the clause_count_summary
, those are included. This means, we may have found a higher number of conjoined clauses than we actually have, since those clauses were counted as new sentences anyway. In order to exclude conjunctions that do not occur within sentences but at the beginning of sentences for our clause counts, we can do the following.
We do not necessarily keep track of sentence-initial CCONJ
only to exlude those from the clause counts. This also shows how often sentence-initial conjunctions occur across the 8 varieties that we want to compare.
We first split our data frame into a list of data frames such that each data frame corresponds to a single sentence and is a separate element in our list.
Now, we want to go through the list and check for all sentences separately if their first element is CCONJ
. This is what the next commands do. In addition, they create a data frame that relates each sentence to its file, and keeps track whether or not it is true
for each sentence that it starts with CCONJ
.
Again, to make this interpretable, we group all sentences by their file, and sum up the number of sentences that start with CCONJ
.
sent_init_summary <-
sent_init %>%
group_by(file) %>%
summarize(sum(is_cconj==TRUE)) %>%
as.data.frame()
colnames(sent_init_summary) <- c("file", "sent_init_cconj")
The table below shows the number of sentence-initial CCONJ
for all files in the data set. Some numbers are pretty high compared to others, suggesting that, indeed, it is important to look at their distribution across the different text varieties.
Again, to examine the number of sentence-initial conjunctions across the 8 varieties, we need to add register, modality, and language information to our summary data set.
sent_init_summary$register <- ifelse(str_detect(sent_init_summary$file, "f.{2}\\.txt"),
"formal",
"informal")
sent_init_summary$modality <- ifelse(str_detect(sent_init_summary$file, "s.\\.txt"),
"spoken",
"written")
sent_init_summary$language <- ifelse(str_detect(sent_init_summary$file, "DE"),
"German",
"English")
The next code chunk summarizes the counts of sentence-initial CCONJ
across the 8 varieties.
sent_init_summary[, 2] <- as.character(sent_init_summary[, 2])
sent_init_summary[, 2] <- as.numeric(sent_init_summary[, 2])
sent_init_summary2 <- sent_init_summary %>%
group_by(language, modality, register) %>%
summarize(mean_sent_init_conj = round(mean(sent_init_cconj), 2)
) %>%
as.data.frame
sent_init_summary2
Let’s look at the types of conjunctions and subjunctions used in the 8 varieties in more detail.
The following code chunk selects all rows that are tagged as either CCONJ
or SCONJ
in POS
, ie. the rows that contain conjunctions and subjunctions. In addition, the German corpus uses KOUS
in the POS_2
column to mark subjunctions.
## [1] 2492
## [1] 646
We then split both the conj
and subj
data sets into 8 data sets according to language, modality, and register.
conj <- with(conj, split(conj, list(conj$language, conj$modality, conj$register)))
subj <- with(subj, split(subj, list(subj$language, subj$modality, subj$register)))
Now we can group conjunctions and subjunctions by lemma and rank the lemmas according to their frequency. The next two code chunks do that for the 8 varieties for both conjunctions and subjunctions.
conj_lemmas <- lapply(conj, function(x) {
x %>%
group_by(lemma) %>%
summarize(n_lemmas = n()) %>%
as.data.frame()
})
subj_lemmas <- lapply(subj, function(x) {
x %>%
group_by(lemma) %>%
summarize(n_lemmas = n()) %>%
as.data.frame()
})
The only thing left now is to rank the lemmas according to their frequency. This is what the next code chunk does, again for both both conjunctions and subjunctions.
The 8 data frames below rank the conjunctions used in the texts. Additionally, it shows the proportions for each lemma.
conj_lemmas_list <- lapply(conj_lemmas_list, function(x) {
mutate(x, prop = n_lemmas / sum(n_lemmas))
})
conj_lemmas_list
## $English.spoken.formal
## lemma n_lemmas rank prop
## 1 and 201 1 0.893333333
## 2 but 15 2 0.066666667
## 3 or 8 3 0.035555556
## 4 either 1 4 0.004444444
##
## $German.spoken.formal
## lemma n_lemmas rank prop
## 1 und 546 1 0.833587786
## 2 also 56 2 0.085496183
## 3 aber 21 3 0.032061069
## 4 oder 14 4 0.021374046
## 5 genau 4 5 0.006106870
## 6 sondern 3 6 0.004580153
## 7 jedoch 2 7 0.003053435
## 8 als 1 8 0.001526718
## 9 weil 1 9 0.001526718
## 10 beziehungsweise 1 10 0.001526718
## 11 ja 1 11 0.001526718
## 12 nur 1 12 0.001526718
## 13 währenddessen 1 13 0.001526718
## 14 denn 1 14 0.001526718
## 15 entweder 1 15 0.001526718
## 16 beziehungweise 1 16 0.001526718
##
## $English.written.formal
## lemma n_lemmas rank prop
## 1 and 112 1 0.9032258
## 2 but 6 2 0.0483871
## 3 or 6 3 0.0483871
##
## $German.written.formal
## lemma n_lemmas rank prop
## 1 und 285 1 0.907643312
## 2 als 10 2 0.031847134
## 3 oder 4 3 0.012738854
## 4 also 4 4 0.012738854
## 5 aber 2 5 0.006369427
## 6 bzw. 2 6 0.006369427
## 7 beziehungsweise 1 7 0.003184713
## 8 sondern 1 8 0.003184713
## 9 weder 1 9 0.003184713
## 10 jedoch 1 10 0.003184713
## 11 sowie 1 11 0.003184713
## 12 doch 1 12 0.003184713
## 13 Währendessen 1 13 0.003184713
##
## $English.spoken.informal
## lemma n_lemmas rank prop
## 1 and 234 1 0.850909091
## 2 but 29 2 0.105454545
## 3 or 11 3 0.040000000
## 4 because 1 4 0.003636364
##
## $German.spoken.informal
## lemma n_lemmas rank prop
## 1 und 464 1 0.768211921
## 2 also 49 2 0.081125828
## 3 aber 48 3 0.079470199
## 4 oder 28 4 0.046357616
## 5 weil 6 5 0.009933775
## 6 sondern 4 6 0.006622517
## 7 als 3 7 0.004966887
## 8 beziehungsweise 1 8 0.001655629
## 9 weder 1 9 0.001655629
##
## $English.written.informal
## lemma n_lemmas rank prop
## 1 and 82 1 0.88172043
## 2 but 10 2 0.10752688
## 3 or 1 3 0.01075269
##
## $German.written.informal
## lemma n_lemmas rank prop
## 1 und 156 1 0.772277228
## 2 aber 22 2 0.108910891
## 3 als 8 3 0.039603960
## 4 also 6 4 0.029702970
## 5 oder 3 5 0.014851485
## 6 doch 2 6 0.009900990
## 7 Also 2 7 0.009900990
## 8 Und 1 8 0.004950495
## 9 denn 1 9 0.004950495
## 10 Als 1 10 0.004950495
The 8 data frames below rank the subjunctions used in the texts. Additionally, it shows the proportions for each lemma.
subj_lemmas_list <- lapply(subj_lemmas_list, function(x) {
mutate(x, prop = n_lemmas / sum(n_lemmas))
})
subj_lemmas_list
## $English.spoken.formal
## lemma n_lemmas rank prop
## 1 that 21 1 0.37500000
## 2 as 10 2 0.17857143
## 3 because 6 3 0.10714286
## 4 so 5 4 0.08928571
## 5 while 5 5 0.08928571
## 6 when 3 6 0.05357143
## 7 which 2 7 0.03571429
## 8 if 2 8 0.03571429
## 9 like 1 9 0.01785714
## 10 than 1 10 0.01785714
##
## $German.spoken.formal
## lemma n_lemmas rank prop
## 1 dass 45 1 0.288461538
## 2 als 33 2 0.211538462
## 3 wie 15 3 0.096153846
## 4 weil 14 4 0.089743590
## 5 um 7 5 0.044871795
## 6 da 5 6 0.032051282
## 7 ob 5 7 0.032051282
## 8 wo 5 8 0.032051282
## 9 sodass 5 9 0.032051282
## 10 soweit 4 10 0.025641026
## 11 falls 4 11 0.025641026
## 12 nachdem 4 12 0.025641026
## 13 wenn 3 13 0.019230769
## 14 damit 2 14 0.012820513
## 15 was 1 15 0.006410256
## 16 sobald 1 16 0.006410256
## 17 indem 1 17 0.006410256
## 18 bevor 1 18 0.006410256
## 19 während 1 19 0.006410256
##
## $English.written.formal
## lemma n_lemmas rank prop
## 1 as 20 1 0.32786885
## 2 while 12 2 0.19672131
## 3 that 10 3 0.16393443
## 4 when 10 4 0.16393443
## 5 if 3 5 0.04918033
## 6 so 1 6 0.01639344
## 7 because 1 7 0.01639344
## 8 like 1 8 0.01639344
## 9 after 1 9 0.01639344
## 10 until 1 10 0.01639344
## 11 whilst 1 11 0.01639344
##
## $German.written.formal
## lemma n_lemmas rank prop
## 1 als 39 1 0.375000000
## 2 dass 12 2 0.115384615
## 3 da 10 3 0.096153846
## 4 wie 9 4 0.086538462
## 5 sodass 9 5 0.086538462
## 6 um 4 6 0.038461538
## 7 während 3 7 0.028846154
## 8 ob 2 8 0.019230769
## 9 jedoch 2 9 0.019230769
## 10 bevor 2 10 0.019230769
## 11 denn 2 11 0.019230769
## 12 weil 1 12 0.009615385
## 13 beziehungsweise 1 13 0.009615385
## 14 die 1 14 0.009615385
## 15 wo 1 15 0.009615385
## 16 sobald 1 16 0.009615385
## 17 falls 1 17 0.009615385
## 18 sowie 1 18 0.009615385
## 19 indem 1 19 0.009615385
## 20 nachdem 1 20 0.009615385
## 21 als 1 21 0.009615385
##
## $English.spoken.informal
## lemma n_lemmas rank prop
## 1 so 28 1 0.30769231
## 2 that 24 2 0.26373626
## 3 because 16 3 0.17582418
## 4 as 10 4 0.10989011
## 5 when 7 5 0.07692308
## 6 like 2 6 0.02197802
## 7 though 2 7 0.02197802
## 8 cause 1 8 0.01098901
## 9 until 1 9 0.01098901
##
## $German.spoken.informal
## lemma n_lemmas rank prop
## 1 weil 30 1 0.288461538
## 2 dass 23 2 0.221153846
## 3 als 17 3 0.163461538
## 4 wie 7 4 0.067307692
## 5 wenn 6 5 0.057692308
## 6 ob 5 6 0.048076923
## 7 sodass 2 7 0.019230769
## 8 damit 2 8 0.019230769
## 9 obwohl 2 9 0.019230769
## 10 um 1 10 0.009615385
## 11 da 1 11 0.009615385
## 12 also 1 12 0.009615385
## 13 soweit 1 13 0.009615385
## 14 wo 1 14 0.009615385
## 15 sobald 1 15 0.009615385
## 16 bis 1 16 0.009615385
## 17 bevor 1 17 0.009615385
## 18 während 1 18 0.009615385
## 19 nachdem 1 19 0.009615385
##
## $English.written.informal
## lemma n_lemmas rank prop
## 1 so 8 1 0.23529412
## 2 as 7 2 0.20588235
## 3 when 5 3 0.14705882
## 4 because 5 4 0.14705882
## 5 that 3 5 0.08823529
## 6 while 2 6 0.05882353
## 7 though 2 7 0.05882353
## 8 if 1 8 0.02941176
## 9 since 1 9 0.02941176
##
## $German.written.informal
## lemma n_lemmas rank prop
## 1 weil 11 1 0.275
## 2 als 9 2 0.225
## 3 dass 3 3 0.075
## 4 sodass 3 4 0.075
## 5 nachdem 3 5 0.075
## 6 wie 2 6 0.050
## 7 da 2 7 0.050
## 8 wo 2 8 0.050
## 9 wenn 2 9 0.050
## 10 um 1 10 0.025
## 11 damit 1 11 0.025
## 12 während 1 12 0.025
The code below does the same for the combination of conjunctions and subjunctions. This way, we can compare the proportions of coordinating and subordinating con/subjunctions.
## [1] 2781
cc <- with(cc, split(cc, list(cc$language, cc$modality, cc$register)))
cc_lemmas <- lapply(cc, function(x) {
x %>%
group_by(lemma) %>%
summarize(n_lemmas = n()) %>%
as.data.frame()
})
cc_lemmas_list <- lapply(cc_lemmas, function(x) {
helper <- x[order(-x$n_lemmas), ]
helper$rank <- c(1:nrow(x))
helper
})
cc_lemmas_list <- lapply(cc_lemmas_list, function(x) {
mutate(x, prop = n_lemmas / sum(n_lemmas))
})
The tables below show the proportion of coordinating and subordinating con/subjunctions.
## $English.spoken.formal
## lemma n_lemmas rank prop
## 1 and 201 1 0.89732143
## 2 but 15 2 0.06696429
## 3 or 8 3 0.03571429
##
## $German.spoken.formal
## lemma n_lemmas rank prop
## 1 und 566 1 0.752659574
## 2 also 42 2 0.055851064
## 3 dass 26 3 0.034574468
## 4 aber 20 4 0.026595745
## 5 als 16 5 0.021276596
## 6 oder 14 6 0.018617021
## 7 weil 10 7 0.013297872
## 8 äh 5 8 0.006648936
## 9 ja 5 9 0.006648936
## 10 ob 5 10 0.006648936
## 11 wie 4 11 0.005319149
## 12 da 4 12 0.005319149
## 13 soweit 4 13 0.005319149
## 14 falls 4 14 0.005319149
## 15 sodass 3 15 0.003989362
## 16 sondern 3 16 0.003989362
## 17 nachdem 3 17 0.003989362
## 18 wo 2 18 0.002659574
## 19 wenn 2 19 0.002659574
## 20 damit 2 20 0.002659574
## 21 genau 1 21 0.001329787
## 22 um 1 22 0.001329787
## 23 beziehungsweise 1 23 0.001329787
## 24 nur 1 24 0.001329787
## 25 sobald 1 25 0.001329787
## 26 währenddessen 1 26 0.001329787
## 27 jedoch 1 27 0.001329787
## 28 indem 1 28 0.001329787
## 29 während 1 29 0.001329787
## 30 denn 1 30 0.001329787
## 31 entweder 1 31 0.001329787
## 32 beziehungweise 1 32 0.001329787
##
## $English.written.formal
## lemma n_lemmas rank prop
## 1 and 112 1 0.9032258
## 2 but 6 2 0.0483871
## 3 or 6 3 0.0483871
##
## $German.written.formal
## lemma n_lemmas rank prop
## 1 und 295 1 0.776315789
## 2 als 31 2 0.081578947
## 3 dass 10 3 0.026315789
## 4 da 8 4 0.021052632
## 5 sodass 7 5 0.018421053
## 6 oder 4 6 0.010526316
## 7 jedoch 3 7 0.007894737
## 8 aber 2 8 0.005263158
## 9 also 2 9 0.005263158
## 10 ob 2 10 0.005263158
## 11 sowie 2 11 0.005263158
## 12 denn 2 12 0.005263158
## 13 wie 1 13 0.002631579
## 14 beziehungsweise 1 14 0.002631579
## 15 die 1 15 0.002631579
## 16 bzw. 1 16 0.002631579
## 17 sobald 1 17 0.002631579
## 18 sondern 1 18 0.002631579
## 19 weder 1 19 0.002631579
## 20 falls 1 20 0.002631579
## 21 indem 1 21 0.002631579
## 22 doch 1 22 0.002631579
## 23 während 1 23 0.002631579
## 24 Währendessen 1 24 0.002631579
##
## $English.spoken.informal
## lemma n_lemmas rank prop
## 1 and 236 1 0.851985560
## 2 but 29 2 0.104693141
## 3 or 11 3 0.039711191
## 4 because 1 4 0.003610108
##
## $German.spoken.informal
## lemma n_lemmas rank prop
## 1 und 499 1 0.715925395
## 2 aber 52 2 0.074605452
## 3 also 37 3 0.053084648
## 4 oder 30 4 0.043041607
## 5 weil 25 5 0.035868006
## 6 dass 14 6 0.020086083
## 7 als 11 7 0.015781923
## 8 ob 4 8 0.005738881
## 9 sondern 4 9 0.005738881
## 10 wenn 4 10 0.005738881
## 11 wie 2 11 0.002869440
## 12 ja 2 12 0.002869440
## 13 damit 2 13 0.002869440
## 14 Hund 1 14 0.001434720
## 15 da 1 15 0.001434720
## 16 beziehungsweise 1 16 0.001434720
## 17 soweit 1 17 0.001434720
## 18 sobald 1 18 0.001434720
## 19 sodass 1 19 0.001434720
## 20 weder 1 20 0.001434720
## 21 bis 1 21 0.001434720
## 22 bevor 1 22 0.001434720
## 23 nachdem 1 23 0.001434720
## 24 obwohl 1 24 0.001434720
##
## $English.written.informal
## lemma n_lemmas rank prop
## 1 and 82 1 0.8723404
## 2 but 11 2 0.1170213
## 3 or 1 3 0.0106383
##
## $German.written.informal
## lemma n_lemmas rank prop
## 1 und 165 1 0.708154506
## 2 aber 23 2 0.098712446
## 3 als 12 3 0.051502146
## 4 weil 7 4 0.030042918
## 5 also 5 5 0.021459227
## 6 dass 3 6 0.012875536
## 7 oder 3 7 0.012875536
## 8 sodass 2 8 0.008583691
## 9 wenn 2 9 0.008583691
## 10 doch 2 10 0.008583691
## 11 nachdem 2 11 0.008583691
## 12 Also 2 12 0.008583691
## 13 Auto 1 13 0.004291845
## 14 wie 1 14 0.004291845
## 15 Und 1 15 0.004291845
## 16 denn 1 16 0.004291845
## 17 Als 1 17 0.004291845
Two other measures that have been shown to vary across different types of text is lexical diversity and lexical density.
Lexical diversity
Lexical density
Instead of relying on a single measure, we can compare various ratios.
The Part of Speech (POS) tags used can be found here: https://universaldependencies.org/u/pos/
The most relevant lexical POSs are:
NOUN
: nounVERB
: verbADJ
: adjectiveADV
: adverbThe most relevant grammatical POSs are:
ADP
: adpositionPRON
: pronounAUX
: auxiliaryCCONJ
: conjunctionDET
: determinerAnother relevant POS is:
INTJ
: interjectionThe following code chunk calculates the ratio between lexical POS types and the total number of words for all text files separately.
POSlex_overview <- lapply(all_texts, function(x){
n_NOUN <- nrow(x[x$POS=="NOUN", ]) / nrow(x)
n_VERB <- nrow(x[x$POS=="VERB", ]) / nrow(x)
n_ADJ <- nrow(x[x$POS=="ADJ", ]) / nrow(x)
n_ADV <- nrow(x[x$POS=="ADV", ]) / nrow(x)
file <- as.character(x$file[1])
d <- cbind(
n_NOUN,
n_VERB,
n_ADJ,
n_ADV,
file)
d
})
POSlex_overview <- as.data.frame(do.call(rbind, POSlex_overview))
POSlex_overview[, 1:4] <- lapply(POSlex_overview[, 1:4], as.character)
POSlex_overview[, 1:4] <- lapply(POSlex_overview[, 1:4], as.numeric)
To summarize across the 8 varieties, we add columns for language, register, and modality.
POSlex_overview$language <- ifelse(str_detect(POSlex_overview$file, "DE"),
"German",
"English")
POSlex_overview$register <- ifelse(str_detect(POSlex_overview$file, "f.{2}\\.txt"),
"formal",
"informal")
POSlex_overview$modality <- ifelse(str_detect(POSlex_overview$file, "s.\\.txt"),
"spoken",
"written")
The code below summarizes the POS ratios for lexical parts of speech across the 8 varieties:
summary_POSlex <- POSlex_overview %>%
group_by(language, modality, register) %>%
summarize(
prop_NOUN = round(mean(n_NOUN), 2),
prop_VERB = round(mean(n_VERB), 2),
prop_ADJ = round(mean(n_ADJ), 2),
prop_ADV = round(mean(n_ADV), 2)
) %>%
as.data.frame()
summary_POSlex
The following code chunk calculates the ratio between grammatical and other POS types and the total number of words for all texts. Again, we then add language, register, and modality columns in order to average across varieties.
POSgram_overview <- lapply(all_texts, function(x){
n_PRON <- nrow(x[x$POS=="PRON", ]) / nrow(x)
n_DET <- nrow(x[x$POS=="DET", ]) / nrow(x)
n_ADP <- nrow(x[x$POS=="ADP", ]) / nrow(x)
n_AUX <- nrow(x[x$POS=="AUX", ]) / nrow(x)
n_CCONJ <- nrow(x[x$POS=="CCONJ", ]) / nrow(x)
n_INTJ <- nrow(x[x$POS=="INTJ", ]) / nrow(x)
file <- as.character(x$file[1])
d <- cbind(
n_PRON,
n_DET,
n_ADP,
n_AUX,
n_CCONJ,
n_INTJ,
file)
d
})
POSgram_overview <- as.data.frame(do.call(rbind, POSgram_overview))
POSgram_overview[, 1:6] <- lapply(POSgram_overview[, 1:6], as.character)
POSgram_overview[, 1:6] <- lapply(POSgram_overview[, 1:6], as.numeric)
POSgram_overview$language <- ifelse(str_detect(POSgram_overview$file, "DE"),
"German",
"English")
POSgram_overview$register <- ifelse(str_detect(POSgram_overview$file, "f.{2}\\.txt"),
"formal",
"informal")
POSgram_overview$modality <- ifelse(str_detect(POSgram_overview$file, "s.\\.txt"),
"spoken",
"written")
The code below summarizes the grammatical POS ratios across the 8 varieties:
summary_POSgram <- POSgram_overview %>%
group_by(language, modality, register) %>%
summarize(
prop_PRON = round(mean(n_PRON), 2),
prop_DET = round(mean(n_DET), 2),
prop_ADP = round(mean(n_ADP), 2),
prop_AUX = round(mean(n_AUX), 2),
prop_CCONJ = round(mean(n_CCONJ), 2)
) %>%
as.data.frame()
summary_POSgram
The next table shows the ratio between interjections and number of words per text across the 8 varieties.
summary_intj <- POSgram_overview %>%
group_by(language, modality, register) %>%
summarize(prop_INTJ = round(mean(n_INTJ), 2)) %>%
as.data.frame()
summary_intj
Another way of examining the proportion between lexical and other words in a text in order to compare texts is the following: for each text, we can rank the lemmas (word stems) according to their frequency. For our purposes, we rather want to compare the most frequent words in the 8 different varieties.
Comparing the 10, 20, … most frequent words across the 4 modality-register combinations in both English and German can also reveal differences across registers and modalities.
To do so, we first need to create a list of 8 data frames containing single text variety each. This is what the next code chunk does.
all_varieties <- with(all_files, split(all_files, list(all_files$language, all_files$modality, all_files$register)))
The next step is to count the lemmas, find the 20 most frequent lemmas for each variety, and rank them:
The next code chunk shows the top 20 frequent lemmas for each of the 8 varieties.
## $English.spoken.formal
## lemma counts rank
## 96 the 436 1
## 25 and 201 2
## 32 be 188 3
## 20 a 158 4
## 37 car 123 5
## 101 to 104 6
## 73 of 91 7
## 18 ball 72 8
## 3 in 51 9
## 43 dog 51 10
## 9 man 45 11
## 109 woman 45 12
## 74 on 42 13
## 92 street 41 14
## 57 have 40 15
## 63 it 38 16
## 98 they 38 17
## 107 with 38 18
## 193 then 38 19
## 59 i 33 20
## 97 there 31 21
## 114 he 31 22
## 55 grocery 30 23
## 68 lot 30 24
## 78 parking 30 25
## 103 two 30 26
## 105 walk 30 27
## 152 I 30 28
## 14 her 29 29
## 95 that 29 30
##
## $German.spoken.formal
## lemma counts rank
## 14 d 1241 1
## 3 äh 698 2
## 47 und 566 3
## 42 sein 508 4
## 18 ein 375 5
## 27 haben 336 6
## 8 auf 227 7
## 33 ich 227 8
## 10 Auto 212 9
## 11 Ball 176 10
## 16 dann 171 11
## 37 mit 140 12
## 34 in 127 13
## 32 Hund 120 14
## 68 Frau 118 15
## 43 Straße 108 16
## 36 Mann 103 17
## 66 es 95 18
## 73 ihr 92 19
## 122 auch 91 20
## 94 zu 90 21
## 64 er 88 22
## 91 von 80 23
## 117 ja 78 24
## 5 an 69 25
## 175 also 64 26
## 277 Parkplatz 64 27
## 131 diese 63 28
## 29 halt 57 29
## 110 so 54 30
##
## $English.written.formal
## lemma counts rank
## 85 the 452 1
## 30 be 140 2
## 21 a 135 3
## 26 and 112 4
## 35 car 101 5
## 89 to 92 6
## 64 of 84 7
## 19 ball 72 8
## 39 dog 51 9
## 12 man 50 10
## 5 in 45 11
## 97 woman 43 12
## 48 grocery 35 13
## 82 street 35 14
## 60 lot 34 15
## 95 with 34 16
## 16 her 33 17
## 65 on 33 18
## 55 it 31 19
## 27 as 30 20
## 54 into 28 21
## 93 walk 28 22
## 167 driver 28 23
## 32 blue 27 24
## 69 parking 26 25
## 44 first 25 26
## 23 accident 23 27
## 74 she 23 28
## 103 he 23 29
## 91 two 22 30
##
## $German.written.formal
## lemma counts rank
## 13 d 1274 1
## 17 ein 310 2
## 43 und 295 3
## 38 sein 241 4
## 7 auf 223 5
## 9 Auto 210 6
## 10 Ball 195 7
## 33 mit 184 8
## 24 haben 151 9
## 30 in 151 10
## 28 Hund 125 11
## 75 ihr 122 12
## 32 Mann 121 13
## 69 Frau 115 14
## 39 Straße 111 15
## 101 zu 101 16
## 29 ich 97 17
## 65 er 83 18
## 238 Parkplatz 76 19
## 12 blau 75 20
## 4 an 65 21
## 172 Einkauf 61 22
## 103 aus 60 23
## 174 fahren 56 24
## 3 als 55 25
## 182 rollen 52 26
## 142 Kinderwagen 49 27
## 26 Hand 48 28
## 127 bremsen 45 29
## 141 hinter 45 30
##
## $English.spoken.informal
## lemma counts rank
## 90 the 360 1
## 32 and 236 2
## 36 be 180 3
## 27 a 119 4
## 41 car 99 5
## 71 of 90 6
## 94 to 78 7
## 63 it 72 8
## 6 so 60 9
## 23 ball 55 10
## 45 dog 49 11
## 107 he 49 12
## 152 I 49 13
## 3 in 45 14
## 91 there 44 15
## 73 one 43 16
## 154 like 43 17
## 210 then 43 18
## 92 they 39 19
## 72 on 38 20
## 80 she 38 21
## 59 i 36 22
## 18 her 33 23
## 89 that 33 24
## 190 guy 33 25
## 87 street 32 26
## 133 not 32 27
## 22 just 29 28
## 40 but 29 29
## 100 with 29 30
##
## $German.spoken.informal
## lemma counts rank
## 14 d 912 1
## 41 und 499 2
## 37 sein 493 3
## 3 äh 362 4
## 18 ein 357 5
## 23 haben 314 6
## 29 ich 229 7
## 16 dann 216 8
## 108 so 195 9
## 8 auf 185 10
## 11 Ball 169 11
## 10 Auto 153 12
## 28 Hund 112 13
## 33 mit 109 14
## 55 da 104 15
## 38 Straße 100 16
## 25 halt 98 17
## 59 er 97 18
## 64 gerade 92 19
## 30 in 90 20
## 118 ja 86 21
## 123 auch 81 22
## 63 Frau 76 23
## 86 zu 75 24
## 32 Mann 72 25
## 87 aber 70 26
## 163 also 66 27
## 74 nicht 65 28
## 61 es 60 29
## 67 ihr 59 30
##
## $English.written.informal
## lemma counts rank
## 83 the 196 1
## 31 be 97 2
## 23 a 96 3
## 27 and 84 4
## 36 car 67 5
## 19 ball 45 6
## 87 to 38 7
## 56 it 34 8
## 40 dog 33 9
## 65 of 31 10
## 3 in 27 11
## 175 see 25 12
## 99 he 22 13
## 93 with 21 14
## 80 street 20 15
## 53 i 19 16
## 168 guy 19 17
## 66 on 18 18
## 106 run 18 19
## 67 one 17 20
## 85 they 17 21
## 91 walk 17 22
## 6 so 16 23
## 18 just 16 24
## 46 first 16 25
## 50 grocery 16 26
## 101 his 16 27
## 55 into 15 28
## 61 lot 15 29
## 139 I 15 30
##
## $German.written.informal
## lemma counts rank
## 12 d 524 1
## 15 ein 261 2
## 34 sein 248 3
## 38 und 167 4
## 20 haben 143 5
## 6 auf 122 6
## 8 Auto 115 7
## 9 Ball 109 8
## 25 Hund 86 9
## 26 ich 82 10
## 30 mit 69 11
## 35 Straße 68 12
## 13 dann 66 13
## 63 gerade 64 14
## 58 er 61 15
## 62 Frau 61 16
## 27 in 49 17
## 99 so 49 18
## 29 Mann 48 19
## 82 zu 43 20
## 65 ihr 37 21
## 54 da 36 22
## 61 fallen 33 23
## 83 aber 31 24
## 152 fahren 31 25
## 39 Unfall 30 26
## 44 wollen 30 27
## 214 Parkplatz 30 28
## 28 kommen 28 29
## 96 passieren 28 30
The other measures established above rather play into lexical density, since they looked at the ratios of certain POS types.
As mentioned before, lexical diversity is another useful measure to compare texts. Lexical diversity can be measured simply by dividing the number of lemmas by the number of words, giving us a ratio of different lemmas per words, i.e. how diverse a text is.
However, this measure can only be used to compare texts of the same length, since with more words, the numbers of lexical and grammatical words do not increase in the same way.
Is this a problem for this study? The texts differ in length, but they refer to the same event afterall…
We can assess lexical diversity comparing two measures:
ttr_all
: type-token-ratio for the entire texts (may be biased by text lengths and not comparable in the strict sense)ttr_100
: type-token-ratio for the first 100 words of each textThe following code chunk calculates those two ttr measures for each text file.
ttr_overview_all <- lapply(all_texts, function(x) {
ttr_helper <- length(unique(x$lemma)) / nrow(x)
})
ttr_overview_100 <- lapply(all_texts, function(x) {
h <- head(x, 100)
ttr_helper <- length(unique(h$lemma)) / nrow(h)
})
The code below formats the output file and adds language, modality, and register information.
ttr_overview_all <- as.data.frame(do.call(rbind, ttr_overview_all))
ttr_overview_all$file <- rownames(ttr_overview_all)
ttr_overview_100 <- as.data.frame(do.call(rbind, ttr_overview_100))
ttr_overview_100$file <- rownames(ttr_overview_100)
ttr_overview_all$language <- ifelse(str_detect(ttr_overview_all$file, "DE"),
"German",
"English")
ttr_overview_all$register <- ifelse(str_detect(ttr_overview_all$file, "f.{2}\\.txt"),
"formal",
"informal")
ttr_overview_all$modality <- ifelse(str_detect(ttr_overview_all$file, "s.\\.txt"),
"spoken",
"written")
ttr_overview_100$language <- ifelse(str_detect(ttr_overview_100$file, "DE"),
"German",
"English")
ttr_overview_100$register <- ifelse(str_detect(ttr_overview_100$file, "f.{2}\\.txt"),
"formal",
"informal")
ttr_overview_100$modality <- ifelse(str_detect(ttr_overview_100$file, "s.\\.txt"),
"spoken",
"written")
The next code chunk summarizes the two ttr measures across the 8 varieties.
The two tables below summarize the ttr measures across the 8 varieties. While ttr_summary_all
considers the entire texts and is less suited for comparison across texts of different lengths, ttr_summary_100
only takes the 100 first words of each text, and is thus a more comparable measure.
Another relevant measure is the proportion of 3rd person pronouns vs. 1st and 2nd person pronouns.
The underlying assumption of written and formal texts is that the addressee is much more distant, and both the speaker and the addressee are less involved. Thus, we would expect to see a higher proportion of 1st and 2nd person pronouns in the spoken and informal varieties compared with the written and formal ones.
The code chunk below calculates three ratios:
pro_overview <- lapply(all_texts, function(x){
n_3 <- length(str_subset(x$morphosyntax, "Person=3")) / length(str_subset(x$morphosyntax, "Person="))
n_2 <- length(str_subset(x$morphosyntax, "Person=2")) / length(str_subset(x$morphosyntax, "Person="))
n_1 <- length(str_subset(x$morphosyntax, "Person=1")) / length(str_subset(x$morphosyntax, "Person="))
file <- as.character(x$file[1])
d <- cbind(n_3,
n_2,
n_1,
file)
d
})
pro_overview <- as.data.frame(do.call(rbind, pro_overview))
The next code chunk formates those person ratios, adding language, modality, and register columns.
pro_overview[, 1:3] <- lapply(pro_overview[, 1:3], as.character)
pro_overview[, 1:3] <- lapply(pro_overview[, 1:3], as.numeric)
pro_overview$language <- ifelse(str_detect(pro_overview$file, "DE"),
"German",
"English")
pro_overview$register <- ifelse(str_detect(pro_overview$file, "f.{2}\\.txt"),
"formal",
"informal")
pro_overview$modality <- ifelse(str_detect(pro_overview$file, "s.\\.txt"),
"spoken",
"written")
The table below shows the ratios of 3rd vs. 1st and 2nd person pronouns across the 8 varieties.
summary_pronoun <- pro_overview %>%
group_by(language, modality, register) %>%
summarize(prop_3 = round(mean(n_3, na.rm=TRUE), 2),
prop_1_2 = round(mean(n_2 + n_1, na.rm=TRUE), 2),
) %>%
as.data.frame()
summary_pronoun
Another interesting perspective concerns not only the differences between the register-modality varieties but includes the speakers:
The next code chunk merges the different overview data frames containing the different syntactic and lexical measures for each file. We also add a column combining register and modality information, i.e. a variable with the following four values:
## 'data.frame': 313 obs. of 10 variables:
## $ n_sent : num 9 12 22 10 18 17 17 12 45 13 ...
## $ n_cc : num 10 9 7 6 12 9 11 6 41 7 ...
## $ n_advcl : num 0 1 1 1 1 2 1 0 4 1 ...
## $ n_csubj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_comp : num 1 2 1 3 1 0 1 0 7 3 ...
## $ n_rel : num 0 7 0 3 5 7 6 4 23 14 ...
## $ file : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ register: chr "formal" "formal" "informal" "informal" ...
## $ modality: chr "spoken" "written" "spoken" "written" ...
## $ language: chr "German" "German" "German" "German" ...
## 'data.frame': 313 obs. of 5 variables:
## $ sent_length: num 13.56 12.25 6.18 8.1 11.17 ...
## $ file : chr "DEbi01FT_fsD.txt" "DEbi01FT_fwD.txt" "DEbi01FT_isD.txt" "DEbi01FT_iwD.txt" ...
## $ register : chr "formal" "formal" "informal" "informal" ...
## $ modality : chr "spoken" "written" "spoken" "written" ...
## $ language : chr "German" "German" "German" "German" ...
## 'data.frame': 313 obs. of 5 variables:
## $ file : chr "DEbi01FT_fsD.txt" "DEbi01FT_fwD.txt" "DEbi01FT_isD.txt" "DEbi01FT_iwD.txt" ...
## $ sent_init_cconj: num 6 5 6 5 9 8 8 3 27 1 ...
## $ register : chr "formal" "formal" "informal" "informal" ...
## $ modality : chr "spoken" "written" "spoken" "written" ...
## $ language : chr "German" "German" "German" "German" ...
## 'data.frame': 313 obs. of 8 variables:
## $ n_NOUN : num 0.164 0.211 0.14 0.198 0.179 ...
## $ n_VERB : num 0.082 0.102 0.0735 0.1358 0.0846 ...
## $ n_ADJ : num 0.0246 0.0748 0.0515 0.0123 0.0746 ...
## $ n_ADV : num 0.0902 0.0476 0.2353 0.1235 0.0647 ...
## $ file : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ language: chr "German" "German" "German" "German" ...
## $ register: chr "formal" "formal" "informal" "informal" ...
## $ modality: chr "spoken" "written" "spoken" "written" ...
## 'data.frame': 313 obs. of 10 variables:
## $ n_PRON : num 0.041 0.0748 0.0956 0.0494 0.0896 ...
## $ n_DET : num 0.18 0.17 0.118 0.198 0.169 ...
## $ n_ADP : num 0.1066 0.1361 0.0662 0.0741 0.0995 ...
## $ n_AUX : num 0.1066 0.0816 0.1029 0.0988 0.0796 ...
## $ n_CCONJ : num 0.082 0.0612 0.0515 0.0741 0.0597 ...
## $ n_INTJ : num 0.0738 0.0068 0.0515 0.0123 0.0846 ...
## $ file : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ language: chr "German" "German" "German" "German" ...
## $ register: chr "formal" "formal" "informal" "informal" ...
## $ modality: chr "spoken" "written" "spoken" "written" ...
## 'data.frame': 313 obs. of 7 variables:
## $ n_3 : num 0.737 0.826 0.68 1 0.789 ...
## $ n_2 : num 0 0.087 0 0 0 ...
## $ n_1 : num 0.263 0.087 0.32 0 0.211 ...
## $ file : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ language: chr "German" "German" "German" "German" ...
## $ register: chr "formal" "formal" "informal" "informal" ...
## $ modality: chr "spoken" "written" "spoken" "written" ...
## 'data.frame': 313 obs. of 5 variables:
## $ V1 : num 0.45 0.51 0.56 0.543 0.56 ...
## $ file : chr "DEbi01FT_fsD.txt" "DEbi01FT_fwD.txt" "DEbi01FT_isD.txt" "DEbi01FT_iwD.txt" ...
## $ language: chr "German" "German" "German" "German" ...
## $ register: chr "formal" "formal" "informal" "informal" ...
## $ modality: chr "spoken" "written" "spoken" "written" ...
## merge all overview dfs
all_overview <- merge(clause_counts, sent_init_summary)
all_overview <- merge(all_overview, length_sent)
all_overview <- merge(all_overview, POSlex_overview)
all_overview <- merge(all_overview, POSgram_overview)
all_overview <- merge(all_overview, pro_overview)
all_overview <- merge(all_overview, ttr_overview_100)
str(all_overview)
## 'data.frame': 313 obs. of 26 variables:
## $ file : Factor w/ 313 levels "DEbi01FT_fsD.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ register : chr "formal" "formal" "informal" "informal" ...
## $ modality : chr "spoken" "written" "spoken" "written" ...
## $ language : chr "German" "German" "German" "German" ...
## $ n_sent : num 9 12 22 10 18 17 17 12 45 13 ...
## $ n_cc : num 10 9 7 6 12 9 11 6 41 7 ...
## $ n_advcl : num 0 1 1 1 1 2 1 0 4 1 ...
## $ n_csubj : num 0 0 0 0 0 0 0 0 0 0 ...
## $ n_comp : num 1 2 1 3 1 0 1 0 7 3 ...
## $ n_rel : num 0 7 0 3 5 7 6 4 23 14 ...
## $ sent_init_cconj: num 6 5 6 5 9 8 8 3 27 1 ...
## $ sent_length : num 13.56 12.25 6.18 8.1 11.17 ...
## $ n_NOUN : num 0.164 0.211 0.14 0.198 0.179 ...
## $ n_VERB : num 0.082 0.102 0.0735 0.1358 0.0846 ...
## $ n_ADJ : num 0.0246 0.0748 0.0515 0.0123 0.0746 ...
## $ n_ADV : num 0.0902 0.0476 0.2353 0.1235 0.0647 ...
## $ n_PRON : num 0.041 0.0748 0.0956 0.0494 0.0896 ...
## $ n_DET : num 0.18 0.17 0.118 0.198 0.169 ...
## $ n_ADP : num 0.1066 0.1361 0.0662 0.0741 0.0995 ...
## $ n_AUX : num 0.1066 0.0816 0.1029 0.0988 0.0796 ...
## $ n_CCONJ : num 0.082 0.0612 0.0515 0.0741 0.0597 ...
## $ n_INTJ : num 0.0738 0.0068 0.0515 0.0123 0.0846 ...
## $ n_3 : num 0.737 0.826 0.68 1 0.789 ...
## $ n_2 : num 0 0.087 0 0 0 ...
## $ n_1 : num 0.263 0.087 0.32 0 0.211 ...
## $ V1 : num 0.45 0.51 0.56 0.543 0.56 ...
## add the variety column
all_overview$type <- paste(all_overview$register, all_overview$modality, sep = ".")
Then, it is useful to split the data sets into an English and a German data set, since we can only compare the measures for files of the same language.
First, we are going to examine the similarity of the text files in terms of their syntactic properties that we examined. The measures are listed again here:
n_sent
: number of sentences per textsent_length
: average number of words per sentencen_cc
: number of clause-combining conjunctionsn_advcl
: number of adverbial clausesn_comp
: number of complement clausesn_rel
: number of relative clausesn_csubj
: number of clausal subjectssent_init_cconj
: number of sentence-initial conjunctionsFirst, we need to extract the columns containing the syntactic information for both languages.
In addition, we need to normalize the values of the different syntactic measures, because the actual numbers vary across different measures.
This means that for each column (i.e. single syntactic measure), all values are centered around 0. While we can no longer interprete the values as such, we can now compare them quantitatively across different syntactic measures.
## n_sent n_cc n_advcl n_csubj n_comp n_rel
## [1,] 0.3936443 0.15619438 -0.5370272 -0.352854 0.4047278 0.01067641
## [2,] -1.0341980 -0.23930699 -0.8497265 2.352360 0.1406501 1.32031621
## [3,] -0.2553749 0.15619438 0.7137703 -0.352854 0.4047278 0.01067641
## [4,] -1.0341980 -0.50297457 -0.8497265 -0.352854 -0.1234276 -0.64414349
## [5,] 0.2638404 0.02436059 -0.2243278 2.352360 -0.3875053 -0.31673354
## [6,] -0.7745903 -0.37114078 -0.8497265 -0.352854 -0.1234276 0.33808636
## sent_init_cconj sent_length
## [1,] -0.10697305 0.4596646
## [2,] -0.46484653 4.4294174
## [3,] 0.07196369 0.2471800
## [4,] -0.64378328 0.5116053
## [5,] 0.25090043 -0.5568889
## [6,] -0.64378328 2.0125853
## n_sent n_cc n_advcl n_csubj n_comp n_rel
## [1,] -1.0295501 0.09276751 -0.8505749 -0.09534724 -0.09024014 -0.9160959
## [2,] -0.5980762 -0.04575695 0.3467312 -0.09534724 0.59745194 0.7590818
## [3,] 0.8401701 -0.32280586 0.3467312 -0.09534724 -0.09024014 -0.9160959
## [4,] -0.8857254 -0.46133032 0.3467312 -0.09534724 1.28514402 -0.1981626
## [5,] 0.2648716 0.36981642 0.3467312 -0.09534724 -0.09024014 0.2804596
## [6,] 0.1210470 -0.04575695 1.5440372 -0.09534724 -0.77793222 0.7590818
## sent_init_cconj sent_length
## [1,] -0.007134418 1.52024729
## [2,] -0.204222707 1.03115996
## [3,] -0.007134418 -1.24210279
## [4,] -0.204222707 -0.52351337
## [5,] 0.584130451 0.62532154
## [6,] 0.387042162 -0.01006349
In order to quantify the similarity between the individual texts, we can calculate a so-called distance matrix on the basis of all 8 syntactic measures.
The distance matrix gives us a value (i.e. distance) for each pair of texts. This means that we can now say, in quantitative terms, how similar or distant a pair of texts is based on their syntactic properties.
Of course, with 313 texts in total, a matrix with distances for all pairs of texts in the 2 languages is not very useful.
In other words, we cannot interpret the distance matrix as such. Instead, we can use it as the basis to visually represent the distances/similarities between single texts in order to see if the texts cluster according to single speakers or modality/register.
There are many different techniques to calculate and represent such clusters.
We are going to use Multi-dimensional scaling (MDS) here.
Roughly speaking, for our purposes, we can say that MDS can help to condense complex information into two dimensions along which the texts we are comparing differ. We can actually choose the number of dimensions that we want MDS to output; we could also choose a higher number. Two dimensions, however, can easily be represented in a coordinate system, like a map, which makes it useful for visual interpretation.
Thus, we end up with a coordinate system with dimension 1
on the x-axis and dimension 2
on the y-axis. dimension 1
contains the largest portion of the distance between texts, dimension 2
the second largest.
While the method behind MDS is much more complex than is mentioned here, the interpretation of a 2-dimensional MDS plot is very simple:
The texts that appear closer to each other are more similar to each other, the texts that are further away from each other differ more.
The next code chunk computes the MDS for both the English and the German distance matrix.
mds_syntax_en <- cmdscale(dist_syntax_en, k = 2) %>%
as.data.frame()
colnames(mds_syntax_en) <- c("dim1", "dim2")
mds_syntax_en$type <- overview_en$type
mds_syntax_en$file <- overview_en$file
mds_syntax_de <- cmdscale(dist_syntax_de, k = 2) %>%
as.data.frame()
colnames(mds_syntax_de) <- c("dim1", "dim2")
mds_syntax_de$type <- overview_de$type
mds_syntax_de$file <- overview_de$file
Finally, the plot below shows the MDS of the texts for English and German separately. In the plots, the modality/register combinations are shown in different colors.
We can now do the same for the lexical measures, i.e. calculate a distance matrix and use MDS to visualize the similarity between texts.
The lexical measures that we used are listed again here:
n_NOUN
: proportion of nouns
n_VERB
: proportion of verbs
n_ADJ
: proportion of adjectives
n_ADV
: proportion of adverbs
n_PRON
: proportion of pronouns
n_DET
: proportion of determiners
n_ADP
: proportion of adpositions
n_AUX
: proportion of auxiliaries
n_CCONJ
: proportion of conjunctions
n_INTJ
: proportion of interjections
ttr_100
: type-token-ratio for first 100 words of each text
n_3
: proportion of 3rd person pronouns out of all pronouns
n_2
: proportion of 2nd person pronouns out of all pronouns
n_1
: proportion of 1st person pronouns out of all pronouns
## n_NOUN n_VERB n_ADJ n_ADV n_PRON n_DET
## [1,] 0.1111359 -0.4336216 0.6791604 -0.01046832 0.8847536 0.3500977
## [2,] 0.4743459 -0.2982926 1.1331359 -0.88217187 1.2134097 0.4561988
## [3,] -0.3148356 1.1251699 -0.8184441 0.37095112 2.7278652 -0.6085615
## [4,] 0.8580521 -0.2372738 -0.5168242 -0.09721476 0.4655737 0.3535738
## [5,] -0.2352600 -0.9845336 0.9720414 -0.20925299 0.3519474 -0.4656259
## [6,] 0.6683780 -1.6651148 0.6662840 -0.58785260 -0.2996314 0.6428683
## n_ADP n_AUX n_CCONJ n_INTJ n_3 n_2
## [1,] 0.86731957 -0.3377273 -0.2581483 -0.267615058 1.05556731 -0.3726356
## [2,] 0.86267010 0.4709188 -0.1889002 -0.834501252 1.44571047 -0.3726356
## [3,] -0.10161617 0.1363206 0.7873688 -0.834501252 -2.99216796 1.2131215
## [4,] -0.02516109 0.1949957 0.5188312 -0.834501252 -1.09022006 -0.3726356
## [5,] -0.26708419 1.4946337 0.3227420 -0.004711316 0.27528099 -0.3726356
## [6,] -0.43341668 2.2242133 -0.4553403 -0.834501252 0.09321418 -0.3726356
## n_1 V1
## [1,] -1.06243657 0.2399312
## [2,] -1.51964780 -0.4579833
## [3,] 2.93816169 -0.3416642
## [4,] 1.45222519 1.2722631
## [5,] -0.14801411 -0.6906215
## [6,] 0.06535113 -0.5743024
## n_NOUN n_VERB n_ADJ n_ADV n_PRON n_DET
## [1,] -0.18942507 -0.7020980 -0.6641138 -0.3237806 -0.94253076 0.5162903
## [2,] 0.57874545 -0.1816690 1.0634017 -1.0500005 0.03248249 0.3340076
## [3,] -0.58584002 -0.9208568 0.2601808 2.1535190 0.63046826 -0.5973358
## [4,] 0.36026254 0.6936369 -1.0851456 0.2445118 -0.70057708 0.8219293
## [5,] 0.05877949 -0.6344336 1.0564192 -0.7588364 0.45658892 0.3177725
## [6,] 1.19332947 0.3934428 0.1989353 -1.1206817 -0.33388951 0.9540789
## n_ADP n_AUX n_CCONJ n_INTJ n_3 n_2
## [1,] 0.8025942 0.723808047 0.86166554 0.9771003 -0.76308215 -0.6358419
## [2,] 1.6782085 0.058497067 0.07717710 -0.8403808 -0.02233588 2.4917413
## [3,] -0.3961055 0.627281483 -0.29171482 0.3718890 -1.23488055 -0.6358419
## [4,] -0.1616667 0.515819123 0.56314754 -0.6899469 1.42116968 -0.6358419
## [5,] 0.5931711 0.004292961 0.01957745 1.2703881 -0.32623179 -0.6358419
## [6,] 1.5113905 -0.959950962 -0.12417126 -0.8564350 0.42515084 -0.6358419
## n_1 V1
## [1,] 0.9570048 -1.6901138
## [2,] -0.5813496 -0.9654577
## [3,] 1.4532739 -0.3615776
## [4,] -1.3405375 -0.5643620
## [5,] 0.4974964 -0.3615776
## [6,] -0.2928582 -0.3615776
mds_lexicon_en <- cmdscale(dist_lexicon_en, k = 2) %>%
as.data.frame()
colnames(mds_lexicon_en) <- c("dim1", "dim2")
mds_lexicon_en$type <- overview_en$type
mds_lexicon_en$file <- overview_en$file
mds_lexicon_de <- cmdscale(dist_lexicon_de, k = 2) %>%
as.data.frame()
colnames(mds_lexicon_de) <- c("dim1", "dim2")
mds_lexicon_de$type <- overview_de$type
mds_lexicon_de$file <- overview_de$file
The plots below show the MDS of the texts for English and German separately. This time, we see the the distance/similarities between texts according to their lexical properties.
In the plots, the modality/register combinations are shown in different colors.
plot_lexicon_en <- ggscatter(mds_lexicon_en, x = "dim1", y = "dim2",
label = mds_lexicon_en$file,
color = "type",
repel = FALSE) +
theme_gray() +
xlab("dimemsion 1") +
ylab("dimemsion 2") +
theme(legend.position = "none") +
ggtitle("English: Similarity of individual texts (lexicon)")
plot_lexicon_en