Text Mining - Ошибка в создании текстовой матрицы документа («Расширенная»)

Я пытаюсь разработать базовый текстовый анализ на R с пакетом tm.Text Mining - Ошибка в создании текстовой матрицы документа («Расширенная»)

входного файла: CSV файла отзывы cointaining нескольких отелей

Я импортировал его и понял, некоторые очистки данных задач с помощью преобразования предложенного пакета тм.

Затем, когда я создаю документ Term Matrix со следующим сценарием:

DocumentTermMatrix(tm_map(reviewc, PlainTextDocument))

что я получаю матрица без слов, но с символами без всякого смысла:

inspect(try[1:5, 200:500]) 
<<DocumentTermMatrix (documents: 5, terms: 301)>> 
Non-/sparse entries: 0/1505 
Sparsity   : 100% 
Maximal term length: 25 
Weighting   : term frequency (tf) 

       Terms 
Docs    â€œextensiveâ€\u009d â€œextraâ€\u009d â€œfinest â€œfreeâ€\u009d â€œfromâ€\u009d â€œfunkyâ€\u009d â€œgoodâ€\u009d â€œhalf 
    character(0)    0   0   0   0   0   0   0  0 
    character(0)    0   0   0   0   0   0   0  0 
    character(0)    0   0   0   0   0   0   0  0 
    character(0)    0   0   0   0   0   0   0  0 
    character(0)    0   0   0   0   0   0   0  0

Anyone знает, что я должен сделать, чтобы избежать этой ошибки?

Заранее благодарим за помощь

Cheers!

источник

2016-06-27 Francesco Pastore

Не могли бы вы создать ссылку на ваш файл csv? – DemetriusRPaula

@DemetriusRPaula https://drive.google.com/file/d/0B9HzLOkZVFz5WUhOcHRFeWdqUjg/view?usp=sharing –

это похоже на проблему с кодировкой; r не считывает кавычки вправо. попробуйте сыграть с параметром 'fileEncoding' при чтении файла: [docs] (https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html), установив его на «utf-8» или в любом формате, в котором находятся ваши входные данные. См. раздел «кодировка» [здесь] (https://stat.ethz.ch/R-manual/R-devel/library/base/html /connections.html) – patrick

library(tm) 
library(SnowballC) 
library(ggplot2) 
library(FactoMineR) 
library(RColorBrewer) 
library(ape) 
library(wordcloud) 
library(stringr) 

beijing_review <- read.csv("~/Downloads/beijing_review.csv", sep=";", comment.char="#") 

# Remove this words 
cleanwords = c("germany","alemania","bravcger", "\U0001f604\U0001f60a\U0001f44d\U0001f44d") ## Remove words 

tryTolower = function(x) 
{ 
    y = NA 
    try_error = tryCatch(tolower(x), error=function(e) e) 
    if (!inherits(try_error, "error")) 
    y = tolower(x) 
    return(y) 
} 

clean.text = function(x) 
{ 
    # tolower 
    x = tryTolower(x) 
    # remove rt 
    x = gsub("rt ", "", x) 
    # remove at 
    x = gsub("@\\w+", "", x) 
    # remove punctuation 
    x = gsub("[[:punct:]]", "", x) 
    # remove numbers 
    x = gsub("[[:digit:]]", "", x) 
    # remove links http 
    x = gsub("http\\w+", "", x) 
    # remove tabs 
    x = gsub("[ |\t]{2,}", "", x) 
    # remove blank spaces at the beginning 
    x = gsub("^ ", "", x) 
    # remove blank spaces at the end 
    x = gsub(" $", "", x) 
    x = str_replace_all(x, "[^[:alnum:]]", " ") 
    #return(x) 
} 


texto_c = clean.text(beijing_review$text) # Get column text 
texto_ac= paste(texto_c, collapse=" ") 

rmNonAlphabet <- function(str) { 
    words <- unlist(strsplit(str, " ")) 
    in.alphabet <- grep(words, pattern = "[a-z]", ignore.case = T) 
    nice.str <- paste(words[in.alphabet], collapse = " ") 
    nice.str 
} 

texto_ac = rmNonAlphabet(texto_ac) 

busca_corpus = Corpus(VectorSource(texto_ac)) 

tdm = TermDocumentMatrix(busca_corpus, 
         control = list(removePunctuation = TRUE, 
             stopwords = c(cleanwords,stopwords("english"),stopwords("spanish"),stopwords("portuguese"),cleanwords), 
             removeNumbers = TRUE, tryTolower = TRUE, stopwords=TRUE)) 

m = as.matrix(tdm) 

palavras_freqs = sort(rowSums(m), decreasing=TRUE) # Contagem das palavras e ordenação 

dm= data.frame(word=names(palavras_freqs), freq=palavras_freqs) 

dtm = DocumentTermMatrix(busca_corpus) 

dtm_matrix = as.matrix(dtm) 

top_palavras = head(palavras_freqs, 30) # nesse caso 10 usuários que mais tweetaram 

barplot(top_palavras, border=NA, las=1, main="30 Top Words", xlab="# of Rep", cex.main=1, horiz=TRUE, cex.names=0.65, axis.lty=1) 

# Plot WordCloud - Max word =100 and Freq >= 50 
wordcloud(dm$word, dm$freq, random.order=FALSE, min.freq=50,colors=brewer.pal(8, "Dark2"), max.words = 100)

источник

2016-06-28 13:56:00 DemetriusRPaula

Text Mining - Ошибка в создании текстовой матрицы документа («Расширенная»)

ответ

Смежные вопросы