<- "This car, is my car.") (sentence
[1] "This car, is my car."
Conducting mixed-methods research is challenging for everyone. It requires an understanding of different methods and data types and particular knowledge in determining how one can mix different methods to improve the insights compared to a single method approach. This chapter looks at one possibility of conducting mixed-methods research in R. It is likely the most evident use of computational software for qualitative data.
While I consider myself comfortable in qualitative and quantitative research paradigms, this chapter could be somewhat uncomfortable if you are used to only one or the other approach, i.e. only quantitative or only qualitative research. However, with advancements in big data science, it is impossible to code, for example, two million tweets qualitatively. Thus, the presented methodologies should not be considered in isolation from other forms of analysis. For some, they might serve as a screening tool to sift through large amounts of data and find those nuggets of most significant interest that deserve more in-depth analysis. For others, these tools constitute the primary research design to allow to generalise to a larger population. In short: the purpose of your research will dictate the approach.
Analysing text quantitatively, though, is not new. The field of Corpus Analysis has been doing this for many years. If you happen to be a Corpus Analyst, then this section might be of particular interest.
This chapter will cover two analytical approaches for textual data:
Word frequencies, and
Word networks with n-grams.
Before we dive head-first into this exciting facet of research, we first need to understand how we can represent and work with qualitative data in R. Similar to working with tidy quantitative data, we also want to work with tidy qualitative data. This book adopts the notion of tidy text data from Silge and Robinson (2017), which follows the terminology used in Corpus Linguistics:
We (…) define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.
In other words, what used to be an ‘observation’ is now called a ‘token’. Therefore, a token is the smallest chosen unit of analysis. The term ‘word token’ can easily be confused with ‘word type’. The first one usually represents the instance of a ‘type’. Thus, the frequency of a word type is determined by the number of its tokens. For example, consider the following sentence, which consists of five tokens, but only four types:
<- "This car, is my car.") (sentence
[1] "This car, is my car."
Because the word car
appears twice, we have two tokens of the word type car
in our dataset. However, how would we represent this in a rectangular dataset? If each row represents one token, we would expect to have five rows in our data frame. As mentioned above, the process of converting a text into individual tokens is called ‘tokenization’. To turn our text into a tokenized data frame, we have to perform two steps:
text
object into a data frame, andWhile the first part can be achieved using tibble()
, we need a new function to perform tokenisation. The package tidytext
will be our primary tool of choice, and it comes with the function unnest_tokens()
. Among many other valuable applications, unnest_tokens()
can tokenize the text for us.
# Convert text object into a data frame
<- tibble(text = sentence)) (df_text
# A tibble: 1 × 1
text
<chr>
1 This car, is my car.
# Tokenization
library(tidytext)
<- df_text |> unnest_tokens(output = word,
(df_text input = text))
# A tibble: 5 × 1
word
<chr>
1 this
2 car
3 is
4 my
5 car
If you paid careful attention, two elements got lost in our data during the process of tokenization: the .
and ,
are no longer included. In most cases, commas and full stops are of less interest because they tend to carry no particular meaning when performing such an analysis. The function unnest_tokens()
conveniently removes them for us and also converts all letters to lower-case.
From here onwards we find ourselves in more familiar territory because the variable word
looks like any other character variable we encountered before. For example, we can count()
the number of occurrences for each word.
|> count(word) df_text
# A tibble: 4 × 2
word n
<chr> <int>
1 car 2
2 is 1
3 my 1
4 this 1
Since we have our data nicely summarised, we can also easily visualise it using ggplot()
.
|>
df_text count(word) |>
ggplot(aes(x = word,
y = n)) +
geom_col()
Once we have converted our data into a data frame, all the techniques we covered for non-quantitative variables can be applied. It only requires a single function to turn ourselves into novice corpus analysts. If this sounds too easy, then you are right. Hardly ever will we retrieve a tidy dataset that allows us to work with it in the way we just did. Data cleaning and wrangling still need to be performed but in a slightly different way. Besides, there are many other ways to tokenize text than using individual words, some of which we cover in this chapter.
In the previous example, we already performed an important data wrangling process, i.e. tokenization. However, besides changing the text format, we also need to take care of other components in our data that are usually not important, for example, removing ‘stop words’. To showcase this step (and a little more), I will draw on the comments
dataset, which holds data about students’ study experiences.
In my industry, i.e. Higher Education, understanding students’ study experiences is crucial for improving academic programmes and support services. The insights gathered from student feedback can significantly influence how a university enhances its offerings. With the help of the comments
dataset, we can empirically investigate what themes emerge from students comments about their social inclusion experiences. However, we first have to tidy the data and then clean it. An essential step in this process is the removal of ‘stop words’.
‘Stop words’ are words that we want to exclude from our analysis, because they carry no particular meaning. The tidytext
package comes with a data frame that contains common stop words in English.
stop_words
# A tibble: 1,149 × 2
word lexicon
<chr> <chr>
1 a SMART
2 a's SMART
3 able SMART
4 about SMART
5 above SMART
6 according SMART
7 accordingly SMART
8 across SMART
9 actually SMART
10 after SMART
# ℹ 1,139 more rows
It is important to acknowledge that there is no unified standard of what constitutes a stop word. The data frame stop_words
covers many words, but in your research context, you might not even want to remove them. For example, if we aim to identify topics in a text, stop words would be equivalent to white noise, i.e. many of them are of no help to identify relevant topics. Instead, stop words would confound our analysis. For obvious reasons, stop words highly depend on the language of your data. Thus, what works for English will not work for German, Russian or Chinese. You will need a separate set of stop words for each language.
The removal of stop words can be achieved with a function we already know: anti-join()
. In other words, we want to subtract all words from stop_words
from a given text. Let’s begin with the tokenization of our comments.
<-
comments_token |>
comments unnest_tokens(word, comment) |>
select(word, social_inclusion)
comments_token
# A tibble: 3,299 × 2
word social_inclusion
<chr> <dbl>
1 joining 8
2 the 8
3 international 8
4 club 8
5 was 8
6 a 8
7 game 8
8 changer 8
9 i 8
10 made 8
# ℹ 3,289 more rows
Our new data frame has considerably more rows, i.e. 3299
, which hints at around 16 words per comment on average. If you are the curious type, like me, we already can peek at the frequency of words in our dataset.
|>
comments_token mutate(word2 = fct_lump_n(word, 5,
other_level = "other words")) |>
count(word2) |>
ggplot(aes(x = reorder(word2, n),
y = n)) +
geom_col() +
coord_flip()
The result is somewhat disappointing. None of these words carries any particular meaning because they are all stop words, except for group
. Thus, we must clean our data before conducting such an analysis.
I also sneaked in a new function called fct_lump_n()
from the forcats
package. It creates a new factor level called other words
and ‘lumps’ together all the other factor levels. You likely have seen plots before which show a category called ‘Other’. We usually apply this approach if we have many factor levels with very low frequencies. It would not be meaningful to plot 50 words as a barplot which only occurred once in our data. They are less important. Thus, it is sometimes meaningful to pool factor levels together. The function fct_lump_n(word, 5)
returns the five most frequently occurring words and pools the other words together into a new category. There are many different ways to ‘lump’ factors. For a detailed explanation and showcase of all available alternatives, have a look at the forcats website.
In our next step, we have to remove stop words using the stop_words
data frame and apply the function anti_join()
.
<- anti_join(comments_token, stop_words,
comments_no_sw by = "word")
comments_no_sw
# A tibble: 1,510 × 2
word social_inclusion
<chr> <dbl>
1 joining 8
2 international 8
3 club 8
4 game 8
5 changer 8
6 friends 8
7 world 8
8 studying 8
9 campus 10
10 diversity 10
# ℹ 1,500 more rows
The dataset shrank from 3299
rows to 1510
rows. Thus, about 54% of our data was actually noise. With this cleaner dataset, we can now look at the frequency count again.
|>
comments_no_sw count(word) |>
filter(n > 10) |>
ggplot(aes(x = reorder(word, n),
y = n)) +
geom_col() +
coord_flip()
To keep the number of bars somewhat manageable, I removed those words with less than 10
occurrences in the dataset. Some of the most frequently occurring words include, for example, feel
, cultural
, friends
, campus
, events
, projects
. There is also a reference to the word i've
which refers to i have
and, therefore, constitutes a stop word. However, it seems our list of stopwords
did not pick this one up. It is always good to remember that our tools for analysis are never perfect and some more manual data cleaning is often necessary. It is very easy to remove this occurence of words using our trusty filter()
function.
<-
comments_no_sw |>
comments_no_sw filter(word != "i’ve")
|>
comments_no_sw count(word) |>
filter(n > 10) |>
ggplot(aes(x = reorder(word, n),
y = n)) +
geom_col() +
coord_flip()
Another commonly used method for visualising word frequencies are word clouds. You likely have seen them before since they are very popular, especially for websites and poster presentations. Word clouds use the frequency of words to determine the font size of each word and arrange them so that it resembles the shape of a cloud. The package wordcloud
and its function wordcloud()
make this very easy. You can modify a word cloud in several ways. Here are some settings I frequently change for my data visualisations:
random.order = FALSE
: This ensures that the most important words appear in the middle of the plot.
scale = c()
: This attribute determines the size difference between the most and least frequent words in the dataset.
min.freq
: This setting determines which words are included based on their frequency, which is helpful if you have many words to plot.
max.words
: This attribute limits the number of words that should be plotted and is also useful in cutting down on large datasets.
colors
: This setting allows you to provide custom colours for different frequencies of words. This substantially improves the readability of your data visualisation.
To create a word cloud you have to provide at least a column with words and a column with their frequencies. All other settings are entirely customisable and optional. Here is an example based on the data we just used for our bar plot.
# Create a dataframe with the word count
<-
word_frequencies |>
comments_no_sw count(word)
# Plot word cloud
::wordcloud(words = word_frequencies$word,
wordcloudfreq = word_frequencies$n,
random.order = FALSE,
scale = c(2, 0.5),
min.freq = 1,
max.words = 100,
colors = c("#6FA8F5",
"#FF4D45",
"#FFC85E")
)
Our word cloud nicely summarises keywords that were particularly important, assuming we consider frequency of words an indicator of importance. Students seem to talk about their feelings, culture and friends/friendships. While this constitutes a good starting point for further analysis we would require more context about what was said, for example what kind of feelings were expressed. This is something we will explore more shortly. Before we advance to this stage, though, we can investigate which words were mostly associated with each social_inclusion
scores, i.e. what words were most frequently used by students who scores a 1
, 2
, 3
, etc. In other words, each social_inclusion
will become its own little group/sub-sample that we look at.
We can still use count()
but we have do approach this slightly differently:
We create the frequencies for each word with count()
and include both variables, social_inclusion
and word
. This way, count will look for all combinations of the social inclusion scores and the word.
Then, we group our data based on social_inclusion
.
We select the top 5 most frequently occurring words for each group.
<-
comments_grouped |>
comments_no_sw count(social_inclusion, word) |>
group_by(social_inclusion) |>
slice_max(n, n = 5) |>
ungroup() |>
arrange(desc(social_inclusion))
comments_grouped
# A tibble: 58 × 3
social_inclusion word n
<dbl> <chr> <int>
1 10 friendships 7
2 10 feel 5
3 10 friends 5
4 10 diversity 3
5 10 times 3
6 8 friends 14
7 8 friendships 14
8 8 study 6
9 8 classmates 5
10 8 love 5
# ℹ 48 more rows
When you first look at this you might be slightly confused what it actually shows. It is best to think of social_inclusion
as a grouping variable and not as a number anymore. For example, in group 10
, which refers to comments that scored very high on social_inclusion
, friendships
are mentioned 7
times, while feel
and friends
are mentioned 5
times. Similarly, in group 8
, friends
and friendships
are each mentioned 14
times. The frequency is very much affected by the size of each group, i.e. how many students scored a 10
or an 8
. Thus, comparing results across groups based on n
is not wise. However, looking at words within each group we can get a sense of what made students score very high or low.
We can visualise this data which will make it even clearer what the value of this analysis is. However, first we have to do some housekeeping since the meaning of our variables has changed. Our variable social_inclusion
should now be a factor
. Let’s transform it in the same way as we did at the very beginning of the book - going full circle.
<-
comments_grouped |>
comments_grouped mutate(social_inclusion = as_factor(social_inclusion))
glimpse(comments_grouped)
Rows: 58
Columns: 3
$ social_inclusion <fct> 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, …
$ word <chr> "friendships", "feel", "friends", "diversity", "times…
$ n <int> 7, 5, 5, 3, 3, 14, 14, 6, 5, 5, 6, 4, 4, 4, 3, 3, 3, …
With this in place we can go ahead and create a plot for each group by usingfacet_wrap().
|>
comments_grouped ggplot(aes(x = word,
y = n)
+
) geom_col() +
coord_flip() +
facet_wrap(~ social_inclusion)
One need not be a graphic designer to understand that this is not a well-designed visualisation and very hard to understand. Let’s improve it further by letting each plot have their own y-axis. In this case, this is perfectly fine, since there are different words ranked for each facet plot. We can achieve this by inserting scales = "free_y"
into the facet_wrap()
function.
|>
comments_grouped ggplot(aes(x = word,
y = n)
+
) geom_col() +
coord_flip() +
facet_wrap(~ social_inclusion,
scales = "free_y")
This looks a lot better and we start to see which words appeared most frequently in comments for each level of social_inclusion
. There is, however, one more step we should take to make this visualisation even more meaningful. Ideally, we want the top word show up at the top for each plot. We can see that currently is seems each bar is in a rather random order. This problem might sound familiar, since we learnt how to reorder()
barplots very early on in this book. So, let’s apply our knowledge to this plot.
|>
comments_grouped ggplot(aes(x = reorder(word, n),
y = n)
+
) geom_col() +
coord_flip() +
facet_wrap(~ social_inclusion,
scales = "free_y")
Well, this brought us a little closer, but not quite were we intended to be. Some plots are in the correct order but others are not. The reason for this odd phenomenon is that we reordered our variable word
for the entire dataset and not for each facet in our visualisation. Unfortunately, neither reorder()
nor fct_reorder()
can help us in this case. Instead, we need to use two new functions from the tidytext
package to help us perfect our plot: reorder_within()
and scale_x_reordered()
. The reorder_within()
functions allows us to order a variable based on another quantitative variable (n
) for each level of a factor (social_inclusion
). The scale_x_reordered()
function cleans up the labels of our x-axis which were changed in the process of using reorder_within()
. If you are curious how the plot would look like without using scale_x_reordered()
you can simply remove it.
|>
comments_grouped
# Reorder our data for for each group level ion social_inclusion
mutate(word = reorder_within(x = word,
by= n,
within = social_inclusion)) |>
# Create the plot
ggplot(aes(x = word,
y = n,
fill = social_inclusion)) +
geom_col(show.legend = FALSE) +
coord_flip() +
# Adjust the x-axis to obtain the correct labels
scale_x_reordered() +
facet_wrap(~ social_inclusion,
scales = "free_y")
The final visualisation, with a splash of colour added, reveals interesting clusters of words that are directly linked to the social_inclusion
scores. For example, it seems students who score 5
or lower on social_inclusion
never refer to friends
or friendships
. Most importantly, we can very quickly understand how individuals with very low scores felt: lonely
, isolated
, anxiety
. This is clearly something a department would aim to improve and maybe learn from students who managed to built flourishing relationships with their peers.
While these might all be fascinating insights, we have not exhausted our options and can further expand on what we did so far by looking at more than just single, isolated, words.
Besides looking at words in isolation, it is often more interesting to understand combinations of words to provide much-needed context. For example, the difference between ‘like’ and ‘not like’ can be crucial when trying to understand sentiments in data. The co-occurrence of words follows the same idea as correlations, i.e. how often one word appears together with another. If the frequency of word pairs is high, the relationship between these two words is strong. The technical term for looking at tokens that represent pairs for words is ‘bigram’. If we look at more than two words, we would consider them as ‘n-gram’, where the ‘n’ stands for the number of words.
Creating a bigram is relatively simple in R and follows similar steps as counting word frequencies:
unnest_tokens()
,word1
and word2
, using separate()
.filter()
for word1
and word2
.word1
and word2
into one column again, i.e. unite()
them. (optional)count()
.There are a lot of new functions covered in this part. However, by now, you likely understand how they function already. Let’s proceed step-by-step. For this final example about mixed-methods research, we will look at the comment
variable in the comments
dataset. Since we already have the text as a data frame, we can skip step one. Next, we need to engage in tokenization. While we use the same function as before, we need to provide different arguments to retrieve bigrams. As before, we need to define an output
column and an input
column. In addition, we also have to provide the correct type of tokenization, i.e. determine token
, which we need to set to "ngrams"
. We also need to define the n
in ‘ngrams’, which will be 2
for bigrams. After this, we apply count()
to our variable bigram
, and we achieved our task. If we turn this into R code, we get the following (with considerably less characters than its explanation):
# Create bigrams from variable synopsis
<-
comments_bigram |>
comments unnest_tokens(bigram, comment, token = "ngrams", n = 2)
comments_bigram
# A tibble: 3,090 × 2
social_inclusion bigram
<dbl> <chr>
1 8 joining the
2 8 the international
3 8 international club
4 8 club was
5 8 was a
6 8 a game
7 8 game changer
8 8 changer i
9 8 i made
10 8 made friends
# ℹ 3,080 more rows
# Inspect frequency of bigrams
|> count(bigram, sort = TRUE) comments_bigram
# A tibble: 2,098 × 2
bigram n
<chr> <int>
1 can be 19
2 i appreciate 18
3 i often 18
4 group work 16
5 sense of 16
6 group projects 14
7 i enjoy 14
8 but i 13
9 i love 13
10 i sometimes 13
# ℹ 2,088 more rows
As before, the most frequent bigrams are those that contain stop words. Thus, we need to clean our data and remove them. This time, though, we face the challenge that the variable bigram has two words and not one. Thus, we cannot simply use anti_join()
because the dataset stop_words
only contains individual words, not pairs. To remove stop words successfully, we have to separate()
the variable into two variables so that each word has its own column. The package tidyr
makes this is an effortless task.
<-
bigram_split |>
comments_bigram separate(col = bigram,
into = c("word1", "word2"),
sep = " ")
bigram_split
# A tibble: 3,090 × 3
social_inclusion word1 word2
<dbl> <chr> <chr>
1 8 joining the
2 8 the international
3 8 international club
4 8 club was
5 8 was a
6 8 a game
7 8 game changer
8 8 changer i
9 8 i made
10 8 made friends
# ℹ 3,080 more rows
With this new data frame, we can remove stop words in each variable. Of course, we could use anti_join()
as before and perform the step twice, but there is a much more elegant solution. Another method to compare values across vectors/columns is the %in%
operator. It is very intuitive to use because it tests whether values in the variable on the left of %in%
exist in the variable to the right. Let’s look at a simple example. Assume we have an object that contains the names of people related to family
and work
. We want to know whether people in our family
are also found in the list of work
. If you remember Table 5.1, which lists all logical operators, you might be tempted to try family == work
. However, this would assess the object in its entirety and not tell us which values, i.e. which names, can be found in both lists.
# Define the two
<- tibble(name = c("Fiona", "Ida", "Lukas", "Daniel"))
family <- tibble(name = c("Fiona", "Katharina", "Daniel"))
work
# Compare the two objects using '=='
$name == work$name family
Warning in family$name == work$name: longer object length is not a multiple of
shorter object length
[1] TRUE FALSE FALSE FALSE
The output suggests that all but one value in family
and work
are not equal, i.e. they are FALSE
. This is because ==
compares the values in order. If we changed the order, some of the FALSE
values would turn TRUE
. In addition, we get a warning telling us that these two objects are not equally long because family
holds four names, while work
contains only three names. Lastly, the results are not what we expected because Fiona
and Daniel
appear in both objects. Therefore we would expect that there should be two TRUE
values. In short, ==
is not the right choice to compare these two objects.
If we use %in%
instead, we can test whether each name appears in both objects, irrespective of their length and the order of the values.
# Compare the two objects using '%in%'
$name %in% work$name family
[1] TRUE FALSE FALSE TRUE
If we want R to return the values that are the same across both objects, we can ask: which()
names are the same?
# Compare the two objects using '%in%'
<- which(family$name %in% work$name)) (same_names
[1] 1 4
The numbers returned by which()
refer to the position of the value in our object. In our case, the first value in family
is Fiona
, and the fourth value is Daniel
. It is TRUE
that these two names appear in both objects. If you have many values that overlap in a large dataset, you might not want to know the row number but retrieve the actual values. This can be achieved by slice()
ing our dataset, i.e. filtering our data frame by providing row numbers.
# Retrieve the values which exist in both objects
|>
family slice(same_names)
# A tibble: 2 × 1
name
<chr>
1 Fiona
2 Daniel
While this might be a nice little exercise, it is important to understand how %in%
can help us remove stop words. Technically, we try to achieve the opposite, i.e. we want to keep the values in word1
and word2
that are not a word in stop_words
. If we use the language of dplyr
, we filter()
based on whether a word in bigram_split
is not %in%
the data frame stop_words
. Be aware that %in%
requires a variable and not an entire data frame to work as we intend.
<-
bigram_cleaned |>
bigram_split filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word)
<- bigram_cleaned |>
bigram_cleaned count(word1, word2, sort = TRUE)
bigram_cleaned
# A tibble: 420 × 3
word1 word2 n
<chr> <chr> <int>
1 cultural events 8
2 study sessions 8
3 lasting friendships 7
4 late night 5
5 night study 5
6 campus life 4
7 leave feeling 4
8 campus events 3
9 deeper connections 3
10 feel grateful 3
# ℹ 410 more rows
The result is a clean dataset, which reveals that cultural
and events
are the most common bigram together with study
and session
in the comments
dataset. Clearly, students spend a lot of time at social events or in study groups. Thus, when they talk about their experiences, these could be considered common places where they have them.
Sometimes we might wish to re-unite()
the two columns that we separated. For example, when plotting the results into a ggplot()
, we need both words in one column1 to use the actual bigrams as labels.
|>
bigram_cleaned unite(col = bigram,
word1, word2,sep = " ")
# A tibble: 420 × 2
bigram n
<chr> <int>
1 cultural events 8
2 study sessions 8
3 lasting friendships 7
4 late night 5
5 night study 5
6 campus life 4
7 leave feeling 4
8 campus events 3
9 deeper connections 3
10 feel grateful 3
# ℹ 410 more rows
So far, we primarily looked at frequencies as numbers in tables or as bar plots. However, it is possible to create network plots of words with bigrams by drawing a line between word1
and word2
. Since there will be overlaps across bigrams, they would mutually connect and create a network of linked words.
I am sure you have seen visualisations of networks before but might not have engaged with them on a more analytical level. A network plot consists of ‘nodes’, which represent observations in our data, and ‘edges’, which represent the link between nodes. We need to define both to create a network.
Unfortunately, ggplot2
does not enable us to easily create network plots, but the package ggraph
offers such features using the familiar ggplot2
syntax. Network plots are usually made using specific algorithms to arrange values in a network efficiently. Thus, if you want your plots to be reproducible, you have to set.seed()
in advance. This way, random aspects of some algorithms are set constant.
There is one more complication, but with a simple solution. The function ggraph()
, which is an equivalent to ggplot()
, requires us to create a graph
object that can be used to plot networks. The package igraph
has a convenient function that produces a graph_from_data_frame()
. As a final step we need to choose a layout for the network. This requires some experimentation. Details about different layouts and more for the ggraph
package can be found on its website.
library(ggraph)
# Make plot reproducible
set.seed(1234)
# Create the special igraph object
<- igraph::graph_from_data_frame(bigram_cleaned)
graph
# Plot the network graph
|>
graph ggraph(layout = "kk") + # Choose a layout, e.g. "kk"
geom_edge_link() + # Draw lines between nodes
geom_node_point() # Add node points
The result looks like a piece of modern art. Still, it is not easy to understand what the plot shows us. Thus, we need to remove some bigrams that are not so frequent. For example, we could remove those that only appear once.
# Filter bigrams
<- bigram_cleaned |>
network_plot filter(n > 1) |>
::graph_from_data_frame()
igraph
# Plot network plot
|>
network_plot ggraph(layout = "kk") +
geom_edge_link() +
geom_node_point()
It seems that many bigrams were removed. Thus, the network plot looks much smaller and less busy. Nevertheless, we cannot fully understand what this visualisation shows because there are no labels for each node. We can add them using geom_node_text()
. I also recommend offsetting the labels with vjust
(vertical adjustment) and hjust
(horizontal adjustment), making them easier to read. To enhance the network visualisation further, we could colour the edges based on the frequency of each token.
# Plot network plot
|>
network_plot ggraph(layout = "kk") +
geom_edge_link(aes(col = factor(n))) +
geom_node_point() +
geom_node_text(aes(label = name),
vjust = 1,
hjust = 1,
size = 3)
This visualisation looks considerably more informative than any of the previous attempts. Still, it seems the chosen layout is not ideal to identify themes/topics in our data. Thus, it is worth experimenting with another layout, for example graphopt
.
set.seed(1234)
# Plot network plot but using a different layout
|>
network_plot ggraph(layout = "graphopt") +
geom_edge_link(aes(col = factor(n))) +
geom_node_point() +
geom_node_text(aes(label = name),
vjust = 0.5,
hjust = -0.25,
size = 3)
The final plot shows us how themes like events
, friendships
, and feel
(i.e. ‘feelings’) connect to other words. They constitute important topics that students talked about when describing their study experiences. While this type of analysis cannot fully replace manual analysis, it does offer unique insights into our data from a macro perspective without spending days analysing these comments.
Our example is a simple analysis based on a relatively small ‘corpus’, i.e. a small dataset. However, the application of n-grams in larger datasets can reveal topic areas and links between them even more effectively. There are more approaches to exploring topics in large datasets, for example, ‘topic modelling’, which are far more complex, but offer more sophisticated analytical insights.
This chapter can only be considered a teaser for mixed-methods research in R. There is much more to know and much more to learn. If this type of work is of interest, especially when working with social media data or historical text documents, there are several R packages I can recommend:
tm
: This text mining package includes many functions that help to explore documents systematically.
quanteda
: This package offers natural language processing features and incorporates functions that are commonly seen in corpus analysis. It also provides tools to visualise frequencies of text, for example, wordclouds.
topicmodels
: To analyse topics in large datasets, it is necessary to use special Natural Language Processing techniques. This package offers ways to perform Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM). Both are exciting and promising pathways to large-scale text analysis.
The tidytext
package we used above also has a lot more to offer than what we covered, for example, performing sentiment analyses.
If you want to explore the world of textual analysis in greater depth, there are two fantastic resources I wholeheartedly can recommend:
Text mining with tidy data principles: This free online course introduces additional use cases of the tidytext package by letting you work with data interactively right in your browser. Julia Silge developed these materials.
Supervised Machine Learning for Text Analysis in R: This book is the ideal follow-up resource if you worked your way through Silge and Robinson (2017). It introduces the foundations of natural language analysis and covers advanced text analysis methods such as regression, classification, and deep learning.
I hope I managed to showcase the versatility of R to some degree in this chapter. Needless to say, there is also a lot I still have to learn, which makes using R exciting and, more than once, has inspired me to approach my data in unconventional ways.
With glue::glue()
you can also combine values from two different columns directly in ggplot()
, but it seems simpler to just use unite()
. Remember, there is always more than one way of doing things in R.↩︎