2017-02-06 38 views
0

У меня есть много данных XML, который выглядит следующим образом:Извлечение данных из текстовых XML

<contextfile concordance=brown> 
<context filename=br-a02 paras=yes> 
<p pnum=1> 
<s snum=1> 
<wf cmd=done pos=NN lemma=committee wnsn=1 lexsn=1:14:00::>Committee</wf> 
<wf cmd=done pos=NN lemma=approval wnsn=1 lexsn=1:04:02::>approval</wf> 
<wf cmd=ignore pos=IN>of</wf> 
<wf cmd=done rdf=person pos=NNP lemma=person wnsn=1 lexsn=1:03:00:: pn=person>Gov._Price_Daniel</wf> 
<wf cmd=ignore pos=POS>'s</wf> 
<punc>``</punc> 
<wf cmd=done pos=JJ lemma=abandoned wnsn=1 lexsn=5:00:00:uninhabited:00>abandoned</wf> 
<wf cmd=done pos=NN lemma=property wnsn=1 lexsn=1:21:00::>property</wf> 
<punc>''</punc> 
<wf cmd=done pos=NN lemma=act wnsn=1 lexsn=1:10:01::>act</wf> 
<wf cmd=done pos=VB lemma=seem wnsn=1 lexsn=2:39:00::>seemed</wf> 
<wf cmd=done pos=JJ lemma=certain wnsn=4 lexsn=3:00:03::>certain</wf> 
<wf cmd=done pos=NN lemma=thursday wnsn=1 lexsn=1:28:00::>Thursday</wf> 
<wf cmd=ignore pos=IN>despite</wf> 
<wf cmd=ignore pos=DT>the</wf> 
<wf cmd=done pos=JJ lemma=adamant wnsn=1 lexsn=5:00:00:inflexible:02>adamant</wf> 
<wf cmd=done pos=NN lemma=protest wnsn=1 lexsn=1:10:00::>protests</wf> 
<wf cmd=ignore pos=IN>of</wf> 
<wf cmd=done pos=NN lemma=texas wnsn=1 lexsn=1:15:00::>Texas</wf> 
<wf cmd=done pos=NN lemma=banker wnsn=1 lexsn=1:18:00::>bankers</wf> 
<punc>.</punc> 
</s> 
</p> 

От этого мне нужно, чтобы извлечь слова как раз перед </wf>, чтобы получить выход:

Committee approval of Gov. Price Daniel's `` abandoned property '' act seemed certain Thursday despite the adamant protests of Texas bankers .

Я никогда не работал с текстом xml, поэтому я немного незнакомец.

Я попытался извлечь это с помощью некоторого примера xml-кода, который я нашел в Интернете, но получил сообщение об ошибке: ожидается, что для атрибута «concordance», связанного с типом элемента «contextfile», будет открыта цитата. Все файлы, которые я хочу, чтобы разобрать начать с:

<contextfile concordance=brown> 
<context filename=br-a02 paras=yes> 

Но последующие данные в файл начинается с:

<p pnum=2> 
<s snum=2> 

........ 
</s> 
</p> 
+0

Возможно, этот ответ помогает: http://stackoverflow.com/questions/10890323/using-sax-with-jaxbcontext –

+0

Возможный дубликат [Как читать XML с использованием XPath в Java] (http://stackoverflow.com/ вопросы/2811001/how-to-read-xml-using-xpath-in-java) –

+0

@TimothyTruckle Возможно, вопросы похожи, но без наличия тега, как я извлекаю слова? – serendipity

ответ

0

Это похоже на работу:

//s[@snum]/string-join(wf | punc, " ") 

Я проверил его на http://xpather.com/Va3jRPr4 (онлайн-тестер xpath), используя ваш пример содержимого тега «p» дважды. Вы можете немного перетащить горизонтальную полосу, чтобы увидеть весь результат.