14 Mixed-methods research: Analysing qualitative data in R

Conducting mixed-methods research is challenging for everyone. It requires an understanding of different methods and data types and particular knowledge in determining how one can mix different methods to improve the insights compared to a single-method approach. This chapter looks at one possibility of conducting mixed-methods research in R. It is likely the most evident use of computational software for qualitative data. However, there are many more options available and worth exploring.

While I consider myself comfortable in qualitative and quantitative research paradigms, this chapter could be somewhat uncomfortable if you are used to only one or the other approach, i.e. only quantitative or only qualitative research. However, with advancements in big data science, it is impossible to manually code, for example, two million tweets qualitatively. Thus, the presented methodologies should not be considered in isolation from other forms of analysis. For some, they might serve as a screening tool to sift through large amounts of data and find those nuggets of most significant interest that deserve more in-depth analysis. For others, these tools constitute the primary research design to allow to generalise to a larger population. In short: the purpose of your research will dictate the approach.

Analysing text quantitatively, though, is not new. The field of Corpus Analysis has been doing this for many years. If you happen to be a Corpus Analyst, then you will feel right at home.

This chapter will cover two analytical approaches for textual data:

Word frequencies, and
Word networks with n-grams.

14.1 The tidy process of working with textual data

Before we dive head-first into this exciting facet of research, we first need to understand how we can represent and work with qualitative data in R. Similar to working with tidy quantitative data, we also want to work with tidy qualitative data. This book adopts the notion of tidy text data from Silge and Robinson (2017), which follows the terminology used in Corpus Linguistics:

We (…) define the tidy text format as being a table with one-token-per-row. A token is a meaningful unit of text, such as a word, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.

In other words, what used to be an observation is now called a token. Therefore, a token is the smallest chosen unit of analysis. The term word token can easily be confused with word type. The first one usually represents the instance of a type. Thus, the frequency of a word type is determined by the number of its tokens. For example, consider the following sentence, which consists of five tokens, but only four types:

(sentence <- "This car, is my car.")

[1] "This car, is my car."

Because the word car appears twice, we have two tokens of the word type car in our dataset. However, how would we represent this in a rectangular dataset? If each row represents one token, we would expect to have five rows in our data frame. As mentioned above, the process of converting a text into individual tokens is called ‘tokenization’. To turn our text into a tokenized data frame, we have to perform two steps:

Convert our text object into a data frame, and
Split this text into individual words, i.e. tokenization.

While the first part can be achieved using tibble(), we need a new function to perform tokenisation. The package tidytext will be our primary tool of choice, and it comes with the function unnest_tokens(). Among many other valuable applications, unnest_tokens() can tokenize the text for us.

# Convert text object into a data frame
(df_text <- tibble(text = sentence))

# A tibble: 1 × 1
  text                
  <chr>               
1 This car, is my car.

# Tokenization
library(tidytext)
(df_text <- df_text |> unnest_tokens(output = word,
                                      input = text))

# A tibble: 5 × 1
  word 
  <chr>
1 this 
2 car  
3 is   
4 my   
5 car

If you paid careful attention, two elements got lost in our data during the process of tokenization: the . and , are no longer included. In most cases, commas and full stops are of less interest because they tend to carry no particular meaning when performing such an analysis. The function unnest_tokens() conveniently removes them for us and also converts all letters to lower-case.

From here onwards we find ourselves in more familiar territory because the variable word looks like any other character variable we encountered before. For example, we can count() the number of occurrences for each word.

df_text |> count(word)

# A tibble: 4 × 2
  word      n
  <chr> <int>
1 car       2
2 is        1
3 my        1
4 this      1

Since we have our data nicely summarised, we can also easily visualise it using ggplot().

df_text |>
  count(word) |>
  ggplot(aes(x = word,
             y = n)) +
  geom_col()

This bar chart represents the frequency of words in a sample dataset. The word 'car' appears most frequently with a count of 2, while 'is', 'my', and 'this' each appear once.

Once we have converted our data into a data frame, all the techniques we covered for non-quantitative variables can be applied. It only requires a single function to turn ourselves into novice corpus analysts. If this sounds too easy, then you are right. Hardly ever will we retrieve a tidy dataset that allows us to work with it in the way we just did. Data cleaning and wrangling still need to be performed but in a slightly different way. Besides, there are many other ways to tokenize text than using individual words, some of which we cover in this chapter.

14.2 Stop words: Removing noise in the data

In the previous example, we already performed an important data wrangling process, i.e. tokenization. However, besides changing the text format, we also need to take care of other components in our data that are usually not important, for example, removing ‘stop words’. To showcase this step (and a little more), I will draw on the comments dataset, which holds data about students’ study experiences.

In my industry, i.e. Higher Education, understanding students’ study experiences is crucial for improving academic programmes and support services. The insights gathered from student feedback can significantly influence how a university enhances its offerings. With the help of the comments dataset, we can empirically investigate what themes emerge from students comments about their social inclusion experiences. However, we first have to tidy the data and then clean it. An essential step in this process is the removal of ‘stop words’.

‘Stop words’ are words that we want to exclude from our analysis, because they carry no particular meaning. The tidytext package comes with a data frame that contains common stop words in English.

stop_words

# A tibble: 1,149 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           SMART  
 2 a's         SMART  
 3 able        SMART  
 4 about       SMART  
 5 above       SMART  
 6 according   SMART  
 7 accordingly SMART  
 8 across      SMART  
 9 actually    SMART  
10 after       SMART  
# ℹ 1,139 more rows

It is important to acknowledge that there is no unified standard of what constitutes a stop word. The data frame stop_words covers many words, but in your research context, you might not even want to remove them. For example, if we aim to identify topics in a text, stop words would be equivalent to white noise, i.e. many of them are of no help to identify relevant topics. Instead, stop words would confound our analysis. For obvious reasons, stop words highly depend on the language of your data. Thus, what works for English will not work for German, Russian, Chinese or any other language. You will need a separate set of stop words for each language.

The removal of stop words can be achieved with a function we already know: anti-join(). In other words, we want to subtract all words from stop_words from a given text. Let’s begin with the tokenization of our comments.

comments_token <-
  comments |>
  unnest_tokens(word, comment) |>
  select(word, social_inclusion)

comments_token

# A tibble: 3,299 × 2
   word          social_inclusion
   <chr>                    <dbl>
 1 joining                      8
 2 the                          8
 3 international                8
 4 club                         8
 5 was                          8
 6 a                            8
 7 game                         8
 8 changer                      8
 9 i                            8
10 made                         8
# ℹ 3,289 more rows

Our new data frame has considerably more rows, i.e. 3299, which hints at around 16 words per comment on average. If you are the curious type, like me, we already can peek at the frequency of words in our dataset.

comments_token |>
  mutate(word2 = fct_lump_n(word, 5,
                            other_level = "other words")) |>
  count(word2) |>
  ggplot(aes(x = reorder(word2, n),
             y = n)) +
  geom_col() +
  coord_flip()

A bar chart which shows the frequency of specific words, with 'other words' being the most frequent by a large margin. Following '“'other words', the most common terms are 'i', 'to', 'the', 'group', and 'my', each with significantly lower counts.

The result is somewhat disappointing. None of these words carries any particular meaning because they are all stop words, except for group. Thus, we must clean our data before conducting such an analysis.

I also sneaked in a new function called fct_lump_n() from the forcats package. It creates a new factor level called other words and lumps together all the other factor levels. You likely have seen plots before which show a category called Other. We usually apply this approach if we have many factor levels with very low frequencies. It would not be meaningful to plot 50 words as a barplot which only occurred once in our data. They are less important. Thus, it is sometimes meaningful to pool factor levels together. The function fct_lump_n(word, 5) returns the five most frequently occurring words and pools the other words together into a new category. There are many different ways to ‘lump’ factors. For a detailed explanation and showcase of all available alternatives, have a look at the forcats website.

In our next step, we have to remove stop words using the stop_words data frame and apply the function anti_join().

comments_no_sw <- anti_join(comments_token, stop_words,
                          by = "word")

comments_no_sw

# A tibble: 1,510 × 2
   word          social_inclusion
   <chr>                    <dbl>
 1 joining                      8
 2 international                8
 3 club                         8
 4 game                         8
 5 changer                      8
 6 friends                      8
 7 world                        8
 8 studying                     8
 9 campus                      10
10 diversity                   10
# ℹ 1,500 more rows

The dataset shrank from 3299 rows to 1510 rows. Thus, about 54% of our data was actually noise. With this cleaner dataset, we can now look at the frequency count again.

comments_no_sw |>
  count(word) |>
  filter(n > 10) |>
  ggplot(aes(x = reorder(word, n),
             y = n)) +
  geom_col() +
  coord_flip()

This bar chart displays the frequency of specific words, with 'feel', 'cultural', and 'friends' being the most common. Other frequently used words include 'friendships', 'campus', 'events', and 'projects', indicating common themes likely related to social and academic experiences.

To keep the number of bars somewhat manageable, I removed those words with less than 10 occurrences in the dataset. Some of the most frequently occurring words include, for example, feel, cultural, friends, campus, events, projects. There is also a reference to the word i've which refers to i have and, therefore, constitutes a stop word. However, it seems our list of stopwords did not pick this one up. It is always good to remember that our tools for analysis are never perfect and some more manual data cleaning is often necessary. It is very easy to remove this occurence of words using our trusty filter() function.

comments_no_sw <- 
  comments_no_sw |>
  filter(word != "i’ve")
  
comments_no_sw |>
  count(word) |>
  filter(n > 10) |>
  ggplot(aes(x = reorder(word, n),
             y = n)) +
  geom_col() +
  coord_flip()

This bar plot is the same as the previous one, but with the bar for 'i've' removed.

Another commonly used method for visualising word frequencies are word clouds. You likely have seen them before since they are very popular, especially for websites and poster presentations. Word clouds use the frequency of words to determine the font size of each word and arrange them so that it resembles the shape of a cloud. The package wordcloud and its function wordcloud() make this very easy. You can modify a word cloud in several ways. Here are some settings I frequently change for my data visualisations:

random.order = FALSE: This ensures that the most important words appear in the middle of the plot.
scale = c(): This attribute determines the size difference between the most and least frequent words in the dataset.
min.freq: This setting determines which words are included based on their frequency, which is helpful if you have many words to plot.
max.words: This attribute limits the number of words that should be plotted and is also useful in cutting down on large datasets.
colors: This setting allows you to provide custom colours for different frequencies of words. This substantially improves the readability of your data visualisation.

To create a word cloud you have to provide at least a column with words and a column with their frequencies. All other settings are entirely customisable and optional. Here is an example based on the data we just used for our bar plot.

# Create a dataframe with the word count
word_frequencies <-
  comments_no_sw |>
  count(word)

# Plot word cloud
wordcloud::wordcloud(words = word_frequencies$word,
                     freq = word_frequencies$n,
                     random.order = FALSE,
                     scale = c(2, 0.5),
                     min.freq = 1,
                     max.words = 100,
                     colors = c("#6FA8F5",
                                "#FF4D45",
                                "#FFC85E")
                     )

A word cloud which highlights prominent words, with ‘feel’, ‘cultural’, ‘friends’, ‘friendships’, and ‘campus’ standing out. Other common terms include ‘events’, ‘projects’, ‘classmates’, and ‘community’, reflecting themes related to social interactions, cultural experiences, and campus life.

Our word cloud nicely summarises keywords that were particularly important, assuming we consider frequency of words an indicator of importance. Students seem to talk about their feelings, culture and friends/friendships. While this constitutes a good starting point for further analysis we would require more context about what was said, for example what kind of feelings were expressed. This is something we will explore more shortly. Before we advance to this stage, though, we can investigate which words were mostly associated with each social_inclusion score, i.e. what words were most frequently used by students who scored a 1, 2, 3, etc. In other words, each social_inclusion will become its own little group/sub-sample that we look at.

We can still use count() but we have do approach this slightly differently:

We create the frequencies for each word with count() and include both variables, social_inclusion and word. This way, count will look for all combinations of the social inclusion scores and the word.
Then, we group our data based on social_inclusion.
We select the top 5 most frequently occurring words for each group.

comments_grouped <-
  comments_no_sw |>
  count(social_inclusion, word) |>
  group_by(social_inclusion) |>
  slice_max(n, n = 5) |>
  ungroup() |>
  arrange(desc(social_inclusion))

comments_grouped

# A tibble: 58 × 3
   social_inclusion word            n
              <dbl> <chr>       <int>
 1               10 friendships     7
 2               10 feel            5
 3               10 friends         5
 4               10 diversity       3
 5               10 times           3
 6                8 friends        14
 7                8 friendships    14
 8                8 study           6
 9                8 classmates      5
10                8 love            5
# ℹ 48 more rows

When you first look at this you might be slightly confused what it actually shows. It is best to think of social_inclusion as a grouping variable and not as a number anymore. For example, in group 10, which refers to comments that scored very high on social_inclusion, friendships are mentioned 7 times, while feel and friends are mentioned 5 times. Similarly, in group 8, friends and friendships are each mentioned 14 times. The frequency is very much affected by the size of each group, i.e. how many students scored a 10 or an 8. Thus, comparing results across groups based on n is not wise. However, looking at words within each group we can get a sense of what made students score very high or low.

We can visualise this data and it will make become clearer what the value of this analysis is. However, first we have to do some housekeeping since the meaning of our variables has changed. Our variable social_inclusion should now be a factor. Let’s transform it in the same way as we did at the very beginning of the book - going full circle.

comments_grouped <-
  comments_grouped |>
  mutate(social_inclusion = as_factor(social_inclusion))

glimpse(comments_grouped)

Rows: 58
Columns: 3
$ social_inclusion <fct> 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, …
$ word             <chr> "friendships", "feel", "friends", "diversity", "times…
$ n                <int> 7, 5, 5, 3, 3, 14, 14, 6, 5, 5, 6, 4, 4, 4, 3, 3, 3, …

With this in place we can go ahead and create a plot for each group by usingfacet_wrap().

comments_grouped |>
  ggplot(aes(x = word,
             y = n)
         ) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ social_inclusion)

This faceted bar chart displays word frequencies across different groups (1, 3, 5, 6, 7, 8, and 10) along the horizontal axis, labeled by count ‘n’. The words on the vertical axis vary in frequency within each group.

One need not be a graphic designer to understand that this is not a well-designed visualisation and very hard to understand. Let’s improve it further by letting each plot have their own y-axis. In this case, this is perfectly fine, since there are different words ranked for each facet plot. We can achieve this by inserting scales = "free_y" into the facet_wrap() function.

comments_grouped |>
  ggplot(aes(x = word,
             y = n)
         ) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ social_inclusion,
             scales = "free_y")

The same plot as the previous one, but this time thy categories on the y-axis can be read more easily, because each facet has its own y axis and does not share it across all facets as before.

This looks a lot better and we start to see which words appeared most frequently in comments for each level of social_inclusion. There is, however, one more step we should take to make this visualisation even more meaningful. Ideally, we want the top word show up at the top for each plot. We can see that currently is seems each bar is in a rather random order. This problem might sound familiar, since we learnt how to reorder() barplots very early on in this book. So, let’s apply our knowledge to this plot.

comments_grouped |>
  ggplot(aes(x = reorder(word, n),
             y = n)
         ) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ social_inclusion,
             scales = "free_y")

This is the same bar plot as the previous one. However, some of the plots are now sorted by size of 'n'. This is not consistent, though.

Well, this brought us a little closer, but not quite were we intended to be. Some plots are in the correct order but others are not. The reason for this odd phenomenon is that we reordered our variable word for the entire dataset and not for each facet in our visualisation. Unfortunately, neither reorder() nor fct_reorder() can help us in this case. Instead, we need to use two new functions from the tidytext package to help us perfect our plot: reorder_within() and scale_x_reordered(). The reorder_within() functions allows us to order a variable based on another quantitative variable (n) for each level of a factor (social_inclusion). The scale_x_reordered() function cleans up the labels of our x-axis which were changed in the process of using reorder_within(). If you are curious how the plot would look like without using scale_x_reordered() you can simply remove it.

comments_grouped |>
  
  # Reorder our data for for each group level of social_inclusion
  mutate(word = reorder_within(x = word,
                               by= n,
                               within = social_inclusion)) |>

 # Create the plot
    ggplot(aes(x = word,
               y = n,
               fill = social_inclusion)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +

  # Adjust the x-axis to obtain the correct labels    
  scale_x_reordered() +
      
  facet_wrap(~ social_inclusion,
             scales = "free_y")

The final version of the bar plot which shows now words ordered by their frequency for each facet we look at. In addition, each group has its own colour to create a better contrast visually.

The final visualisation, with a splash of colour added, reveals interesting clusters of words that are directly linked to the social_inclusion scores. For example, it seems students who score 5 or lower on social_inclusion never refer to friends or friendships. Most importantly, we can very quickly understand how individuals with very low scores felt: lonely, isolated, anxiety. This is clearly something a department would aim to improve and maybe learn from students who managed to built flourishing relationships with their peers.

While these might all be fascinating insights, we have not exhausted our options and can further expand on what we did so far by looking at more than just single, isolated, words.

14.3 N-grams: Exploring correlations of words

Besides looking at words in isolation, it is often more interesting to understand combinations of words to provide much-needed context. For example, the difference between ‘like’ and ‘not like’ can be crucial when trying to understand sentiments in data. The co-occurrence of words follows the same idea as correlations, i.e. how often one word appears together with another. If the frequency of word pairs is high, the relationship between these two words is strong. The technical term for looking at tokens that represent pairs for words is ‘bigram’. If we look at more than two words, we would consider them as ‘n-gram’, where the ‘n’ stands for the number of words.

Creating a bigram is relatively simple in R and follows similar steps as counting word frequencies:

Turn text into a data frame.
Tokenize the variable that holds the text, i.e. unnest_tokens(),
Split the two words into separate variables, e.g. word1 and word2, using separate().
Remove stop words from both variables listwise, i.e. use filter() for word1 and word2.
Merge columns word1 and word2 into one column again, i.e. unite() them. (optional)
Count the frequency of bigrams, i.e. count().

There are a lot of new functions covered in this part. However, by now, you likely understand how they function already. Let’s proceed step-by-step. For this final example about mixed-methods research, we will look at the comment variable in the comments dataset. Since we already have the text as a data frame, we can skip step one. Next, we need to engage in tokenization. While we use the same function as before, we need to provide different arguments to retrieve bigrams. As before, we need to define an output column and an input column. In addition, we also have to provide the correct type of tokenization, i.e. determine token, which we need to set to "ngrams". We also need to define the n in ‘ngrams’, which will be 2 for bigrams. After this, we apply count() to our variable bigram, and we achieved our task. If we turn this into R code, we get the following (with considerably less characters than its explanation):

# Create bigrams from variable synopsis
comments_bigram <-
  comments |>
  unnest_tokens(bigram, comment, token = "ngrams", n = 2)

comments_bigram

# A tibble: 3,090 × 2
   social_inclusion bigram            
              <dbl> <chr>             
 1                8 joining the       
 2                8 the international 
 3                8 international club
 4                8 club was          
 5                8 was a             
 6                8 a game            
 7                8 game changer      
 8                8 changer i         
 9                8 i made            
10                8 made friends      
# ℹ 3,080 more rows

# Inspect frequency of bigrams
comments_bigram |> count(bigram, sort = TRUE)

# A tibble: 2,098 × 2
   bigram             n
   <chr>          <int>
 1 can be            19
 2 i appreciate      18
 3 i often           18
 4 group work        16
 5 sense of          16
 6 group projects    14
 7 i enjoy           14
 8 but i             13
 9 i love            13
10 i sometimes       13
# ℹ 2,088 more rows

As before, the most frequent bigrams are those that contain stop words. Thus, we need to clean our data and remove them. This time, though, we face the challenge that the variable bigram has two words and not one. Thus, we cannot simply use anti_join() because the dataset stop_words only contains individual words, not pairs. To remove stop words successfully, we have to separate() the variable into two variables so that each word has its own column. The package tidyr makes this an effortless task.

bigram_split <-
  comments_bigram |>
  separate(col = bigram,
           into = c("word1", "word2"),
           sep = " ")

bigram_split

# A tibble: 3,090 × 3
   social_inclusion word1         word2        
              <dbl> <chr>         <chr>        
 1                8 joining       the          
 2                8 the           international
 3                8 international club         
 4                8 club          was          
 5                8 was           a            
 6                8 a             game         
 7                8 game          changer      
 8                8 changer       i            
 9                8 i             made         
10                8 made          friends      
# ℹ 3,080 more rows

With this new data frame, we can remove stop words in each variable. Of course, we could use anti_join() as before and perform the step twice, but there is a much more elegant solution. Another method to compare values across vectors/columns is the %in% operator. It is very intuitive to use because it tests whether values in the variable on the left of %in% exist in the variable to the right. Let’s look at a simple example. Assume we have an object that contains the names of people related to family and work. We want to know whether people in our family are also found in the list of work. If you remember Table 5.1, which lists all logical operators, you might be tempted to try family == work. However, this would assess the object in its entirety and not tell us which values, i.e. which names, can be found in both lists.

# Define the two
family <- tibble(name = c("Fiona", "Ida", "Lukas", "Daniel"))
work <- tibble(name = c("Fiona", "Katharina", "Daniel"))

# Compare the two objects using '=='
family$name == work$name

Warning in family$name == work$name: longer object length is not a multiple of
shorter object length

[1]  TRUE FALSE FALSE FALSE

The output suggests that all but one value in family and work are not equal, i.e. they are FALSE. This is because == compares the values in order. If we changed the order, some of the FALSE values would turn TRUE. In addition, we get a warning telling us that these two objects are not equally long because family holds four names, while work contains only three names. Lastly, the results are not what we expected because Fiona and Daniel appear in both objects. Therefore we would expect that there should be two TRUE values. In short, == is not the right choice to compare these two objects.

If we use %in% instead, we can test whether each name appears in both objects, irrespective of their length and the order of the values.

# Compare the two objects using '%in%'
family$name %in% work$name

[1]  TRUE FALSE FALSE  TRUE

If we want R to return the values that are the same across both objects, we can ask: which() names are the same?

# Compare the two objects using '%in%'
(same_names <- which(family$name %in% work$name))

[1] 1 4

The numbers returned by which() refer to the position of the value in our object. In our case, the first value in family is Fiona, and the fourth value is Daniel. It is TRUE that these two names appear in both objects. If you have many values that overlap in a large dataset, you might not want to know the row number but retrieve the actual values. This can be achieved by slice()ing our dataset, i.e. filtering our data frame by providing row numbers.

# Retrieve the values which exist in both objects
family |>
  slice(same_names)

# A tibble: 2 × 1
  name  
  <chr> 
1 Fiona 
2 Daniel

While this might be a nice little exercise, it is important to understand how %in% can help us remove stop words. Technically, we try to achieve the opposite, i.e. we want to keep the values in word1 and word2 that are not a word in stop_words. If we use the language of dplyr, we filter() based on whether a word in bigram_split is not %in% the data frame stop_words. Be aware that %in% requires a variable and not an entire data frame to work as intended.

bigram_cleaned <-
  bigram_split |>
  filter(!word1 %in% stop_words$word) |>
  filter(!word2 %in% stop_words$word)

bigram_cleaned <- bigram_cleaned |>
  count(word1, word2, sort = TRUE)

bigram_cleaned

# A tibble: 420 × 3
   word1    word2           n
   <chr>    <chr>       <int>
 1 cultural events          8
 2 study    sessions        8
 3 lasting  friendships     7
 4 late     night           5
 5 night    study           5
 6 campus   life            4
 7 leave    feeling         4
 8 campus   events          3
 9 deeper   connections     3
10 feel     grateful        3
# ℹ 410 more rows

The result is a clean dataset, which reveals that cultural and events are the most common bigram together with study and session in the comments dataset. Clearly, students spend a lot of time at social events or in study groups. Thus, when they talk about their experiences, these could be considered common places where they have them.

Sometimes we might wish to re-unite() the two columns that we separated. For example, when plotting the results into a ggplot(), we need both words in one column¹ to use the actual bigrams as labels.

bigram_cleaned |>
  unite(col = bigram,
        word1, word2,
        sep = " ")

# A tibble: 420 × 2
   bigram                  n
   <chr>               <int>
 1 cultural events         8
 2 study sessions          8
 3 lasting friendships     7
 4 late night              5
 5 night study             5
 6 campus life             4
 7 leave feeling           4
 8 campus events           3
 9 deeper connections      3
10 feel grateful           3
# ℹ 410 more rows

So far, we primarily looked at frequencies as numbers in tables or as bar plots. However, it is possible to create network plots of words with bigrams by drawing a line between word1 and word2. Since there will be overlaps across bigrams, they would mutually connect and create a network of linked words.

I am sure you have seen visualisations of networks before but might not have engaged with them on a more analytical level. A network plot consists of ‘nodes’, which represent observations in our data, and ‘edges’, which represent the link between nodes. We need to define both to create a network.

Unfortunately, ggplot2 does not enable us to easily create network plots, but the package ggraph offers such features using the familiar ggplot2 syntax. Network plots are usually made using specific algorithms to arrange values in a network efficiently. Thus, if you want your plots to be reproducible, you have to set.seed() in advance. This way, random aspects of some algorithms are set constant.

There is one more complication, but with a simple solution. The function ggraph(), which is an equivalent to ggplot(), requires us to create a graph object that can be used to plot networks. The package igraph has a convenient function that produces a graph_from_data_frame(). As a final step we need to choose a layout for the network. This requires some experimentation. Details about different layouts and more for the ggraph package can be found on its website.

library(ggraph)

# Make plot reproducible
set.seed(1234)

# Create the special igraph object
graph <- igraph::graph_from_data_frame(bigram_cleaned)

# Plot the network graph
graph |>
  ggraph(layout = "kk") +  # Choose a layout, e.g. "kk"
  geom_edge_link() +       # Draw lines between nodes
  geom_node_point()        # Add node points

This graph visualizes a network structure, showing nodes (points) and edges (lines) connecting them. The dense cluster in the center suggests a high degree of connectivity among nodes, while the arc of nodes around the periphery shows connections that may be less integrated with the core.

The result looks like a piece of modern art. Still, it is not easy to understand what the plot shows us. Thus, we need to remove some bigrams that are not so frequent. For example, we could remove those that only appear once.

# Filter bigrams
network_plot <- bigram_cleaned |>
  filter(n > 1) |>
  igraph::graph_from_data_frame()

# Plot network plot
network_plot |>
  ggraph(layout = "kk") +
  geom_edge_link() +
  geom_node_point()

This network graph depicts a set of nodes connected by edges, with multiple smaller clusters scattered around. Unlike the dense, centralized networks, this structure shows more loosely connected and isolated clusters.

It seems that many bigrams were removed. Thus, the network plot looks much smaller and less busy. Nevertheless, we cannot fully understand what this visualisation shows because there are no labels for each node. We can add them using geom_node_text(). I also recommend offsetting the labels with vjust (vertical adjustment) and hjust (horizontal adjustment), making them easier to read. To enhance the network visualisation further, we could colour the edges based on the frequency of each token.

# Plot network plot
network_plot |>
  ggraph(layout = "kk") +
  geom_edge_link(aes(col = factor(n))) +
  geom_node_point() +
  geom_node_text(aes(label = name),
                 vjust = 1,
                 hjust = 1,
                 size = 3)

This network graph showcases connections between different words, with edges colored based on the factor ‘n’ levels, representing frequencies of word combinations. The clustering indicates groupings of related words, reflecting how these terms are associated in the dataset. The improved layout makes individual terms more readable, and color-coded edges add visual clarity to distinguish relationships.

This visualisation looks considerably more informative than any of the previous attempts. Still, it seems the chosen layout is not ideal to identify themes/topics in our data. Thus, it is worth experimenting with another layout, for example graphopt.

set.seed(1234)

# Plot network plot but using a different layout
network_plot |>
  ggraph(layout = "graphopt") +
  geom_edge_link(aes(col = factor(n))) +
  geom_node_point() +
  geom_node_text(aes(label = name),
                 vjust = 0.5,
                 hjust = -0.25,
                 size = 3)

The same plot as above, but with a different algothirm used to draw the layout. Clusters of words are now more distinctly separated from each other.

The final plot shows us how themes like events, friendships, and feel (i.e. feelings) connect to other words. They constitute important topics that students talked about when describing their study experiences. While this type of analysis cannot fully replace manual analysis, it does offer unique insights into our data from a macro perspective without spending days analysing these comments.

Our example is a simple analysis based on a relatively small corpus, i.e. a small dataset. However, the application of n-grams in larger datasets can reveal topic areas and links between them even more effectively. There are more approaches to exploring topics in large datasets, for example, ‘topic modelling’, which are far more complex, but offer more sophisticated analytical insights.

14.4 Exploring more mixed-methods research approaches in R

This chapter can only be considered a teaser for mixed-methods research in R. There is much more to know and much more to learn. If this type of work is of interest, especially when working with social media data or historical text documents, there are several R packages I can recommend:

tm: This text mining package includes many functions that help to explore documents systematically.
quanteda: This package offers natural language processing features and incorporates functions that are commonly seen in corpus analysis. It also provides tools to visualise frequencies of text, for example, wordclouds.
topicmodels: To analyse topics in large datasets, it is necessary to use special Natural Language Processing techniques. This package offers ways to perform Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM). Both are exciting and promising pathways to large-scale text analysis.
The tidytext package we used above also has a lot more to offer than what we covered, for example, performing sentiment analyses.

If you want to explore the world of textual analysis in greater depth, there are two fantastic resources I wholeheartedly can recommend:

Text mining with tidy data principles: This free online course introduces additional use cases of the tidytext package by letting you work with data interactively right in your browser. Julia Silge developed these materials.
Supervised Machine Learning for Text Analysis in R: This book is the ideal follow-up resource if you worked your way through Silge and Robinson (2017). It introduces the foundations of natural language analysis and covers advanced text analysis methods such as regression, classification, and deep learning.

I hope I managed to showcase the versatility of R to some degree in this chapter. Needless to say, there is also a lot I still have to learn, which makes using R exciting and, more than once, has inspired me to approach my data in unconventional ways.

14.5 Exercises

We almost reached the end of our journey. Let’s review what we have learned about Mixed Methods in R. Copy the code below to get started. The r4np package has to be installed to start the exercises.

# Option 1:
learnr::run_tutorial("ex_mixed_methods", package = "r4np")

With glue::glue() you can also combine values from two different columns directly in ggplot(), but it seems simpler to just use unite(). Remember, there is always more than one way of doing things in R.↩︎