Я пытаюсь оптимизировать свой R-код с помощью parSapply. У меня есть xmlfile и X как глобальные переменные.parSapply проблемы в параллельном коде R
Когда я не использовал clusterExport (cl, «X») и clusterExport (cl, «xmlfile»), я получил «объект xmlfile не найден».
Когда я использовал эти два clusterExport У меня ошибка: «объект типа« externalptr »не является подмножеством».
С обычным sapply он работает нормально.
Может кто-нибудь увидеть проблему?
У меня есть этот код: R
require("XML")
library(parallel)
setwd("C:/PcapParser")
# A helper function that enables the dynamic additon of new rows and unseen variables to a data.frame
# field is an R XML leaf-node (capturing a field of a protocol)
# X is the current data.frame to which the feature in field should be added
# rowNum is the row (packet) to which the feature should be added. [must be that rowNum <= dim(X)[1]+1]
addFeature <- function(field, X, rowNum)
{
# extract xml name and value
featureName = xmlAttrs(field)['name']
if (featureName == "")
featureName = xmlAttrs(field)['show']
value = xmlAttrs(field)['value']
if (is.na(value) | value=="")
value = xmlAttrs(field)['show']
# attempt to add feature (add rows/cols if neccessary)
if (!(featureName %in% colnames(X))) #we are adding a new feature
{
#Special cases
#Bad column names: anything that has the prefix...
badCols = list("<","Content-encoded entity body"," ","\\?")
for(prefix in badCols)
if(grepl(paste("^",prefix,sep=""),featureName))
return(X) #don't include this new feature
X[[featureName]]=array(dim=dim(X)[1]) #add this new feature column with NAs
}
if (rowNum > dim(X)[1]) #we are trying to add a new row
{X = rbind(X,array(dim=dim(X)[2]))} #add row of NA
X[[featureName]][rowNum] = value
return(X)
}
firstLoop<-function(x)
{
packet = xmlfile[[x]]
# Iterate over all protocols in this packet
for (prot in 1:xmlSize(packet))
{
protocol = packet[[prot]]
numFields = xmlSize(protocol)
# Iterate over all fields in this protocol (recursion is not used since the passed dataset is large)
if(numFields>0)
for (f in 1:numFields)
{
field = protocol[[f]]
if (xmlSize(field) == 0) # leaf
X<<-addFeature(field,X,x)
else #not leaf xml element (assumption: there are at most three more steps down)
{
# Iterate over all sub-fields in this field
for (ff in 1:xmlSize(field))
{ #extract sub-field data for this packet
subField = field[[ff]]
if (xmlSize(subField) == 0) # leaf
X<<-addFeature(subField,X,x)
else #not leaf xml element (assumption: there are at most two more steps down)
{
# Iterate over all subsub-fields in this field
for (fff in 1:xmlSize(subField))
{ #extract sub-field data for this packet
subsubField = subField[[fff]]
if (xmlSize(subsubField) == 0) # leaf
X<<-addFeature(subsubField,X,x)
else #not leaf xml element (assumption: there is at most one more step down)
{
# Iterate over all subsubsub-fields in this field
for (ffff in 1:xmlSize(subsubField))
{ #extract sub-field data for this packet
subsubsubField = subsubField[[ffff]]
X<<-addFeature(subsubsubField,X,x) #must be leaf
}
}
}
}
}
}
}
}
}
# Given the path to a pcap file, this function returns a dataframe 'X'
# with m rows that contain data fields extractable from each of the m packets in XMLcap.
# Wireshark must be intalled to work
raw_feature_extractor <- function(pcapPath){
## Step 1: convert pcap into PDML XML file with wireshark
#to run this line, wireshark must be installed in the location referenced in the pdmlconv.bat file
print("Converting pcap file with Wireshark.")
system(paste("pdmlconv",pcapPath,"tmp.xml"))
## Step 2: load XML file into R
print("Parsing XML.")
xmlfile<<-xmlRoot(xmlParse("tmp.xml"))
## Step 3: Extract all feature into data.frame
print("Extracting raw features.")
X <<- data.frame(num=NA) #first feature is packet number
# Iterate over all packets
# Calculate the number of cores
no_cores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(3)
parSapply (cl,seq(from=1,to=xmlSize(xmlfile),by=1),firstLoop)
print("Done.")
return(X)
}
Что мне делать неправильно с parSapply? (Возможно, принимая во внимание глобальные переменные)
Спасибо
спасибо, я добавил кластерExport для глобальных переменных, но все еще получаю эту ошибку с помощью externalptr, любые идеи? – tomiliJons
Вы не можете вызвать глобальные переменные. Вы * должны * либо принуждать их, определять их внутри параллельной функции (для функций), либо вызывать их явно. –