R: Извлечение времени из файла srt (субтитров)

Мне нужно рассчитать скорость речи для каждой строки субтитров. . Содержание файла SRT (субтитры) выглядит следующим образом:R: Извлечение времени из файла srt (субтитров)

1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you 2 00:00:22,000 --> 00:00:23,989 the latest from the campaign trail, plus debate and analysis. 3 00:00:24,000 --> 00:00:28,989 The Liberal Democrats promise to protect the pay of millions

Например, требуется 4 секунды 989 миллисекунды чтобы сказать 10 слов «Либеральные демократы обещают защитить платить миллионы» , Средняя скорость речи этих 10 слов: 498,9 миллисекунды за слово.

Как прочитать файл SRT, так что я могу иметь dataframe с STARTTIME, EndTime, и текстовой строкой WordCount в виде столбцов и строк субтитров, как строки, как ниже?

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000") endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989") textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions") wordCount<-c(12,10,10) rate.df<-data.frame(startTime, endTime, textString, wordCount)

Как вычесть STARTTIME из EndTime в R, когда время представлено в виде часов: минуты: секунды, миллисекунды?

источник

2016-04-10 Ninjacat

мне удалось задачу с помощью MS Excel, но у меня есть слишком много данных, чтобы использовать Excel для этой задачи. – Ninjacat

Вот возможное решение (код говорит само за себя):

text=" 

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, 
plus debate 
and analysis. 



3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect 
the pay of millions" 

con<-textConnection(text) 
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and 
# they should be replaced by the following single line in the real case 
# lines <- readLines(srtFileName) 

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){ 
    block <- lines[blockIdx] 
    block <- block[!grepl("^\\s*$",block)] 
    if(length(block) == 0){ 
     return(NULL) 
    } 
    if(length(block) < 3){ 
     warning("a block not respecting srt standards has been found") 
    } 
    return(data.frame(id=block[1], 
         times=block[2], 
         textString=paste0(block[3:length(block)],collapse="\n"), 
         stringsAsFactors = FALSE)) 
    }) 
m <- do.call(rbind,listOfEntries) 


# split start and end times 
tmp <- do.call(rbind,strsplit(m[,'times'],' --> ')) 
m$startTime <- tmp[,1] 
m$endTime <- tmp[,2] 

# parse times 
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric)) 
m$fromSeconds <- tmp %*% c(60*60,60,1,1/1000) 

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric)) 
m$toSeconds <- tmp %*% c(60*60,60,1,1/1000) 

# compute time difference in seconds 
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds 

# word count 
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1 

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. : 
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1 

m$millisecsPerWord <- m$timeDiffInSecs * 1000/m$wordCount

Результат:

> m 
    id       times                textString 
2 1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you 
3 2 00:00:22,000 --> 00:00:23,989  the latest from the campaign trail, \nplus debate \nand analysis. 
6 3 00:00:24,000 --> 00:00:28,989   The Liberal Democrats promise to protect \nthe pay of millions 
    startTime  endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord 
2 00:00:19,000 00:00:21,989   19 21.989   2.989  14   213.5000 
3 00:00:22,000 00:00:23,989   22 23.989   1.989  11   180.8182 
6 00:00:24,000 00:00:28,989   24 28.989   4.989  10   498.9000

источник

2016-04-10 16:36:00 digEmAll

Ох. Это восхитительно! Большое вам спасибо, digEmAll! Коды просто красивы! – Ninjacat

Большое спасибо, @digemall – Ninjacat

R: Извлечение времени из файла srt (субтитров)

ответ

Смежные вопросы