Python: нужно очистить данные в таблице, которая будет расширяться с помощью lxml

Я хочу получить результат квартала Прибыль Attb. для SH ('000) от http://klse.i3investor.com/servlets/stk/fin/8982.jsp Python: нужно очистить данные в таблице, которая будет расширяться с помощью lxml

<b>Quarter Result:</b><br/> <table cellpadding="0" cellspacing="0" border="0" class="nc" width="100%"> <tr> <th class="left">F.Y.</th> <th class="left">Quarter</th> <th class="right">Revenue ('000)</th> <th class="right">Profit before Tax ('000)</th> <th class="right">Profit ('000)</th> <th class="right">Profit Attb. to SH ('000)</th> <th class="right">EPS (Cent)</th> <th class="right">DPS (Cent)</th> <th class="right">NAPS</th> <th class="center" width="33"></th> </tr> <tr> <td class="left" valign="top" nowrap="nowrap"> 2016-12-31 </td> <td class="left" valign="top" nowrap="nowrap"> 2016-09-30 </td> <td class="right" valign="top" nowrap="nowrap"> 79,082 </td> <td class="right" valign="top" nowrap="nowrap"> 14,376 </td> <td class="right" valign="top" nowrap="nowrap"> 10,692 </td> <td class="right" valign="top" nowrap="nowrap"> 10,398 </td> <td class="right" valign="top" nowrap="nowrap"> 3.37 </td> <td class="right" valign="top" nowrap="nowrap"> 0.00 </td> <td class="right" valign="top" nowrap="nowrap"> 1.5100 </td> <td class="center" valign="top" nowrap="nowrap"> <a href="" onclick="viewFinancialSource('62459');return false;" title="View Source"> <img class="sp view16" src="http://cdn1.i3investor.com/cm/icon/trans16.gif" width="16px;" alt="View Source"/> </a> 
<span class="hide" id="financialSourceTitle62459"> Quarter: 2016-09-30 </span> <span class="hide" id="financialSourceDetail62459"> <p> <a target="_blank" href ="/servlets/staticfile/290836.jsp"> <img src="http://cdn1.i3investor.com/cm/icon/file-download-small.png" width="16px" height="16px" alt="3rd Q2016_CGB.PDF"/> 3rd Q2016_CGB.PDF </a> </p> </span> </td> </tr> <tr> <td class="left" valign="top" nowrap="nowrap"> 2016-12-31 </td> <td class="left" valign="top" nowrap="nowrap"> 2016-06-30 </td> <td class="right" valign="top" nowrap="nowrap"> 51,277 </td> <td class="right" valign="top" nowrap="nowrap"> 7,050 </td> <td class="right" valign="top" nowrap="nowrap"> 5,364 </td> <td class="right" valign="top" nowrap="nowrap"> 5,068 </td> <td class="right" valign="top" nowrap="nowrap"> 1.64 </td> <td class="right" valign="top" nowrap="nowrap"> 0.00 </td> <td class="right" valign="top" nowrap="nowrap"> 1.4800 </td> <td class="center" valign="top" nowrap="nowrap"> <a href="" onclick="viewFinancialSource('56288');return false;" title="View Source"> <img class="sp view16" src="http://cdn1.i3investor.com/cm/icon/trans16.gif" width="16px;" alt="View Source"/> </a>

примеру, есть годовой таблица результатов и результат четверть таблицы, я хочу только данные за последние два квартала на прибыль Attb. до SH ('000), который составляет 10 398 и 5 068

Однако таблица расширяется с каждым кварталом. Я хотел бы иметь надежный способ извлечения данных с помощью lxml с помощью xpath или cssselect. Так что, когда ближайшие данные в следующем квартале, мой код все еще работает.

from lxml import html 
import requests 

page = requests.get('http://klse.i3investor.com/servlets/stk/fin/8982.jsp') 
tree = html.fromstring(page.content) 
output = tree.xpath('//table[contains(@class,"nc")]/text()')

, но возвращают заготовки [ '', '', '', '']

источник

2016-11-14 vindex

Что вы пытались до сих пор? Отправьте свой код. – James

@James, добавьте код, который не сможет вернуть результат – vindex

Попробуйте использовать индексы и положение():

import requests 
from lxml import html 

response = requests.get("http://klse.i3investor.com/servlets/stk/fin/8982.jsp") 
tree = html.fromstring(response.content) 

profit_attb_to_sh = [x.strip() for x in tree.xpath('//table[2]/tr[position()>1 and position()<=3]/td[@class="right"][4]/text()')] 

print '\n'.join(profit_attb_to_sh)

таблицы [2] означает: вы получите вторую таблицу из DOM.

тр [положение()> 1 и положение() < = 3] означает: вы будете получать каждый тр в этом диапазоне индекса (2 и 3). Исключая первый, поскольку он содержит заголовки.

тд [@ класс = «правый»] [4] означает: вы получите четвертый тд от каждого из этих т.р., которые соответствовали выше правилу.

Пожалуйста, помните, что XPath индексы начинаются с 1 (а не от 0)

Результат:

10,398 
5,068

PS:

Вы можете изменить:

profit_attb_to_sh = [x.strip() for x in tree.xpath('//table[2]/tr[position()>1 and position()<=3]/td[@class="right"][4]/text()')]

для:

profit_attb_to_sh = map(lambda x: x.strip(), tree.xpath('//table[2]/tr[position()>1 and position()<=3]/td[@class="right"][4]/text()'))

источник

2016-11-15 20:28:26

Python: нужно очистить данные в таблице, которая будет расширяться с помощью lxml

ответ

Смежные вопросы