How Do I Extract A Dictionary From A Cdata Embedded In Html?
I used python to scrape an HTML file, but the data I really need is embedded in a CDATA file. My code: import requests from bs4 import BeautifulSoup url='https://www.website.com' p
Solution 1:
This example will print string inside the <script>
tag and then parses the data with re
/json
module:
import re
import json
from bs4 import BeautifulSoup
txt = '''<div class="react-container" id="react-container">
<script type="text/javascript">
//<![CDATA[
window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
</script>
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
# select desired <script> tag
script_tag = soup.select_one('#react-container script')
# print contents of the <script> tag:print(script_tag.string)
# parse the json data inside <script> tag to variable 'data'
data = json.loads( re.search(r'window\.REACT_OPTS = ({.*})', script_tag.string).group(1) )
# print data to screen:print(json.dumps(data, indent=4))
Prints:
//<![CDATA[window.REACT_OPTS = {"components":[{"component_name":"","props":{},"router":true,"redux":true,"selector":"#react-container","ignoreMissingSelector":false}]}
// ]]>
{
"components": [
{
"component_name": "",
"props": {},
"router": true,
"redux": true,
"selector": "#react-container",
"ignoreMissingSelector": false
}
]
}
Post a Comment for "How Do I Extract A Dictionary From A Cdata Embedded In Html?"