pacman::p_load(tidyverse, readtext,
quanteda, tidytext)In-class_Ex05
Put all data into one tibular dataframe.
data_folder <- "data/MC1/articles"Text sensing to extract text
text_data <- readtext(paste0("data/MC1/articles",
"/*"))OR
text_data <- readtext("data/MC1/articles")Basic tokenisation
usenet_words <- text_data %>%
unnest_tokens(word, text) %>% #reading the text data
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word) #remove stop wordsusenet_words %>%
count(word, sort = TRUE)readtext object consisting of 3261 documents and 0 docvars.
# A data frame: 3,261 × 3
word n text
<chr> <int> <chr>
1 fishing 2177 "\"\"..."
2 sustainable 1525 "\"\"..."
3 company 1036 "\"\"..."
4 practices 838 "\"\"..."
5 industry 715 "\"\"..."
6 transactions 696 "\"\"..."
# ℹ 3,255 more rows
Observations- Most common words are: fishing, sustainable, company
temp_table <- usenet_words %>%
count(word, sort = TRUE)Creating a table to observe word counts
corpus_text <- corpus(text_data)
summary(corpus_text, 5)Corpus consisting of 338 documents, showing 5 documents:
Text Types Tokens Sentences
Alvarez PLC__0__0__Haacklee Herald.txt 206 433 18
Alvarez PLC__0__0__Lomark Daily.txt 102 170 12
Alvarez PLC__0__0__The News Buoy.txt 90 200 9
Alvarez PLC__0__1__Haacklee Herald.txt 96 187 8
Alvarez PLC__0__1__Lomark Daily.txt 241 504 21
To separate the data; with 2 columns X & Y. Some text are “1” hence the split does not occur
text_data_splitted <- text_data %>%
separate_wider_delim("doc_id",
delim = "__0__",
names = c("X", "Y"),
too_few = "align_end")pacman::p_load(jsonlite, tidyverse)##pacman::p_load(jsonlite, tidygraph,
##ggraph, tidyverse, readtext,
##quanteda, tidytext)In the code chunk below, fromJSON() of jsonlite package is used to import MC3.json into R environment.
mc1_data <- fromJSON("data/MC1/mc1.json")