2016-08-19 6 views
4

У меня возникли проблемы с выбором этого объекта div в Beautiful Soup, а затем анализ данных внутри.Анализ тегов привязки данных в HTML с красивым супом

Сначала мне нужно декодировать объекты HTML, как функция на этом сайте (https://mothereff.in/html-entities).

Какие шаги предпринять, чтобы, например, программно выбрать

(Extralarge: '/ s3/fhphotos/CIRD-72K6-H9_SID_1.jpg, ширина = 1000 & высота = 1000 & режим = макс')

из кода ниже

<div data-bind="component: { name: &#39;product-detail&#39;, params: {hasVariants:true,name:&#39;BROOKS LOUNGE CHAIR&#39;,hasCategory:true,superCategoryName:&#39;Furniture&#39;,categoryDisplayName:&#39;Living Room&#39;,categorySlug:&#39;living-room&#39;,subcategoryDisplayName:&#39;Chairs&#39;,subcategorySlug:&#39;chairs&#39;,collection:{id:1529,name:&#39;Irondale&#39;,description:&#39;Each piece is a striking conversation-starter. Tables are made from reclaimed doors paired with salvaged architecture or old machine parts. Storage solutions are inspired by libraries of the 1940’s. Cast iron beds with linen panels as well as seating in linen, lush velvet and top-grain leather offer a distinctive found feel.&#39;,isFeatured:true,isNew:false,image:&#39;/FourHandsMarketplace/media/General/Featured%20Collections/IRONDALE.jpg?width=500&#39;,shortDescription:&#39;Moving from Parisian flea market to modern to industrial, understated elegance is a common theme. Waxed leathers and distressed irons mix with fabrics for an intriguing style blend.\r\n&#39;,uri:&#39;/collections/irondale&#39;},attributes:[{id:384,name:&#39;COVER&#39;,displayOrder:30,swatches:true,values:[{id:12710,name:&#39;EBONY&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-G6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12711,name:&#39;STONEWASH DARK GREEN&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]},{id:385,name:&#39;FINISH&#39;,displayOrder:40,swatches:true,values:[{id:12712,name:&#39;BLACK WASH WEATHERED&#39;,displayOrder:1,swatchUrl:&#39;/s3/fhphotos/Y C11458-K5_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;},{id:12713,name:&#39;DISTRESSED WASHED OLD OAK&#39;,displayOrder:2,swatchUrl:&#39;/s3/fhphotos/Y C11458-K6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;}]}],products:[{attributeValueIds:[12710,12712],description:&#39;Our take on the classic Adirondack emphasizes comfort with thick, top-grain leather cushioning. Wire-brushed oak is finished in black and hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.75&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >88&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Black Washed Weathered&#39;,&#39;Ebony&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_DET_3.jpg&#39;},{order:11,thumb:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K5-G6H6_ROM_4.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K5-G6H6_ROM_4.jpg&#39;}],priceHtml:&#39;$520.00&#39;,itemNumber:&#39;CIRD-72K5-G6H6&#39;,name:&#39;Brooks Lounge Chair-Ebony, Blk Wsh Weath&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false},{attributeValueIds:[12711,12713],description:&#39;Our take on the classic Adirondack emphasizes comfort with green, stonewashed cotton canvas cushioning. Wire-brushed oak is hand-distressed for a naturally weathered patina.&#39;,dimensions:&#39;W: 27.75&quot; H: 29&quot; D: 34.5&quot;&#39;,availabilityDescription:&#39;&lt;strong>Quantity in Stock: &lt;/strong>&lt;span >147&lt;/span>&lt;br />&lt;strong>More on the Way: &lt;/strong>&lt;span >Yes&lt;/span>&lt;br />&lt;strong>Estimated Arrival Date: &lt;/strong>&lt;span >1 to 2 weeks&lt;/span>&#39;,colors:[&#39;Distressed Washed Old Oak&#39;,&#39;Stonewash Dark Green&#39;],weightPounds:45.0,volumeCubicFeet:18.72,images:[{order:1,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_PRM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_PRM_1.jpg&#39;},{order:2,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_1.jpg&#39;},{order:3,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_2.jpg&#39;},{order:4,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_1.jpg&#39;},{order:5,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_2.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_2.jpg&#39;},{order:6,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_BCK_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_BCK_1.jpg&#39;},{order:7,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_FRT_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_FRT_1.jpg&#39;},{order:8,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_SID_1.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_SID_1.jpg&#39;},{order:9,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_ROM_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_ROM_3.jpg&#39;},{order:10,thumb:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=200&amp;height=200&amp;mode=crop&#39;,large:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=600&amp;height=600&amp;mode=max&#39;,extraLarge:&#39;/s3/fhphotos/CIRD-72K6-H9_DET_3.jpg?width=1000&amp;height=1000&amp;mode=max&#39;,full:&#39;http://fhphotos.s3-website-us-east-1.amazonaws.com/CIRD-72K6-H9_DET_3.jpg&#39;}],priceHtml:&#39;$290.00&#39;,itemNumber:&#39;CIRD-72K6-H9&#39;,name:&#39;Brooks Lounge Chair-Stonewsh Drk Green&#39;,availableForImmediateShipment:true,isNew:false,isCloseout:false}],activeItemNumber:&#39;CIRD-72K5-G6H6&#39;,priceDescription:&#39;Wholesale Price&#39;} }"></div> 

?

ответ

0

Это не совсем понятно, где этот HTML-нить приходит и что именно вы заинтересованы в извлечении, но для Beautiful Soup части вам нужно просто:

soup = BeautifulSoup(s) 
text = soup.div['data-bind'] 

где s является строка в вашем вопросе , Сначала мы получаем «div» tag перед тем, как получить «привязку данных» attribute.

Формат меня смущает, поскольку он похож на json и похож на словарь python, но ни один из этих парсеров не любил вход. Я думаю, его javascript? Я написал быстрый и грязный цикл подсчета скобка вдохновленный этой question:

nest_lvl = 0 
lvl_string = list() 
for char in text: 
    if char == '{': 
     nest_lvl += 1 
    elif char == '}': 
     nest_lvl -= 1 

    try: 
     lvl_string[nest_lvl] += char 
    except IndexError:   # first iter 
     lvl_string.append(char) 

    if char == '}': 
     print nest_lvl, lvl_string[nest_lvl] 
     lvl_string[nest_lvl] = '' 

, который, мы надеемся, вы начали. Опять же, часть синтаксического анализа действительно зависит от того, насколько общий парсер должен быть и что именно вы хотите извлечь.