DOMDocument::loadHTML

(PHP 5, PHP 7)

DOMDocument::loadHTML Load HTML from a string

说明

public bool DOMDocument::loadHTML ( string $source [, int $options = 0 ] )

The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object. The static invocation may be used when no DOMDocument properties need to be set prior to loading.

参数

source

The HTML string.

options

Since PHP 5.4.0 and Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters.

返回值

成功时返回 TRUE, 或者在失败时返回 FALSE。 If called statically, returns a DOMDocument 或者在失败时返回 FALSE.

错误/异常

If an empty string is passed as the source, a warning will be generated. This warning is not generated by libxml and cannot be handled using libxml's error handling functions.

此方法可以被静态调用,但会抛出一个 E_STRICT 错误。

尽管非正确格式化的 HTML 仍应该被成功调入,但此函数会在遇到错误标记时产生 E_WARNING 错误。libxml 错误处理函数可以用来处理这类错误。

范例

Example #1 Creating a Document

<?php
$doc 
= new DOMDocument();
$doc->loadHTML("<html><body>Test<br></body></html>");
echo 
$doc->saveHTML();
?>

更新日志

版本 说明
5.4.0 Added options parameter.

参见

User Contributed Notes

kerim-yagmurcu at gmx dot de 25-Dec-2016 12:13
For those of you who want to get an external URL's class element, I have 2 usefull functions. In this example we get the '<h3 class="r">'
 elements back (search result headers) from google search:

1. Check the URL (if it is reachable, existing)
<?php
# URL Check
function url_check($url) {
   
$headers = @get_headers($url);
    return
is_array($headers) ? preg_match('/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/',$headers[0]) : false;
};
?>

2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)
<?php
# Function to clean a string
function clean($text){
   
$clean = html_entity_decode(trim(str_replace(';','-',preg_replace('/\s+/S', " ", strip_tags($text)))));// remove everything
   
return $clean;
    echo
'\n';// throw a new line
}
?>

After doing that, we can output the search result headers with following method:
<?php
$searchstring
= 'djceejay';
$url = 'http://www.google.de/webhp#q='.$searchstring;
if(
url_check($url)){
   
$doc = new DomDocument;
   
$doc->validateOnParse = true;
   
$doc->loadHtml(file_get_contents($url));
   
$output = clean($doc->getElementByClass('r')->textContent);
    echo
$output . '<br>';
}else{
    echo
'URL not reachable!';// Throw message when URL not be called
}
?>
fr at felix-riesterer dot de 13-Feb-2016 10:43
Remember: If you use an HTML5 doctype and a meta element like so

<meta CHARSET=gb2312">

your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by "bigtree at 29a"):

<meta http-equiv="Content-Type" content="text/html; CHARSET=gb2312">
finkenb2 at mail dot lib dot msu dot edu 06-Oct-2015 01:03
Warning:  This does not function well with HTML5 elements such as SVG.  Most of the advice on the Web is to turn off errors in order to have it work with HTML5.
cake at brothercake dot com 17-Dec-2012 02:57
Be aware that this function doesn't actually understand HTML -- it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.

For example, with input like this where the first element isn't closed:

    <span>hello <div>world</div>

loadHTML will change it to this, which is well-formed but invalid:

    <span>hello <div>world</div></span>
Alex 10-Apr-2010 05:45
Beware of the "gotcha" (works as designed but not as expected): if you use loadHTML, you cannot validate the document. Validation is only for XML. Details here: http://bugs.php.net/bug.php?id=43771&edit=1
Shane Harter 04-Jan-2010 05:42
DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.

This isn't well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.

Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.

You can find more info here http://www.php.net/manual/en/ref.libxml.php
mdmitry at gmail dot com 21-Dec-2009 06:02
You can also load HTML as UTF-8 using this simple hack:

<?php

$doc
= new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);

// dirty fix
foreach ($doc->childNodes as $item)
    if (
$item->nodeType == XML_PI_NODE)
       
$doc->removeChild($item); // remove hack
$doc->encoding = 'UTF-8'; // insert proper

?>
piopier 14-Jun-2009 05:29
Here is a function I wrote to capitalize the previous remarks about charset problems (UTF-8...) when using loadHTML and then DOM functions.
It adds the charset meta tag just after <head> to improve automatic encoding detection, converts any specific character to an html entity, thus PHP DOM functions/attributes will return correct values.

<?php
mb_detect_order
("ASCII,UTF-8,ISO-8859-1,windows-1252,iso-8859-15");
function
loadNprepare($url,$encod='') {
       
$content        = file_get_contents($url);
        if (!empty(
$content)) {
                if (empty(
$encod))
                       
$encod  = mb_detect_encoding($content);
               
$headpos        = mb_strpos($content,'<head>');
                if (
FALSE=== $headpos)
                       
$headpos= mb_strpos($content,'<HEAD>');
                if (
FALSE!== $headpos) {
                       
$headpos+=6;
                       
$content = mb_substr($content,0,$headpos) . '<meta http-equiv="Content-Type" content="text/html; charset='.$encod.'">' .mb_substr($content,$headpos);
                }
               
$content=mb_convert_encoding($content, 'HTML-ENTITIES', $encod);
        }
       
$dom = new DomDocument;
       
$res = $dom->loadHTML($content);
        if (!
$res) return FALSE;
        return
$dom;
}
?>

NB: it uses mb_strpos/mb_substr instead of mb_ereg_replace because that seemed more efficient with huge html pages.
Errol 11-Feb-2009 05:05
It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag (<p>).

For example:
<?php
$doc
= new DOMDocument();
$doc->loadHTML("<html><body>Test<br><div>Text</div></body></html>");
echo
$doc->saveHTML();
?>

will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>Test<br></p>
<div>Text</div>
</body></html>

while:
<?php
$doc
= new DOMDocument();
$doc->loadHTML(
   
"<html><body><i>Test</i><br><div>Text</div></body></html>");
echo
$doc->saveHTML();
?>

will yield:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<i>Test</i><br><div>Text</div>
</body></html>
jamesedwardcooke+php at gmail dot com 20-Oct-2008 08:37
Using loadHTML() automagically sets the doctype property of your DOMDocument instance(to the doctype in the html, or defaults to 4.0 Transitional). If you set the doctype with DOMImplementation it will be overridden.

I assumed it was possible to set it and then load html with the doctype I defined(in order to decide the doctype at runtime), and ran into a huge headache trying to find out where my doctype was going. Hopefully this helps someone else.
xuanbn at yahoo dot com 04-Oct-2007 10:38
If you use loadHTML() to process utf HTML string (eg in Vietnamese), you may experience result in garbage text, while some files were OK. Even your HTML already have meta charset  like

  <meta http-equiv="content-type" content="text/html; CHARSET=gb2312">

I have discovered that, to help loadHTML() process utf file correctly, the meta tag should come first, before any utf string appear. For example, this HTML file

<html>
 <head>
    <meta http-equiv="content-type" content="text/html; CHARSET=gb2312">
    <title> Vietnamese - Ti?ng Vi?t</title>
  </head>
<body></body>
</html>

will be OK with loadHTML() when <meta> tag appear <title> tag.

But the file below will not regcornize by loadHTML() because <title> tag contains utf string appear before <meta> tag.

<html>
 <head>
    <title> Vietnamese - Ti?ng Vi?t</title>
    <meta http-equiv="content-type" content="text/html; CHARSET=gb2312">
  </head>
<body></body>
</html>
hanhvansu at yahoo dot com 27-Apr-2007 05:50
When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get "C?nh tranh", you will receive "Cáo?nh tranh".  I suggest we use mb_convert_encoding before load UTF-8 page :
<?php
    $pageDom
= new DomDocument();   
   
$searchPage = mb_convert_encoding($htmlUTF8Page, 'HTML-ENTITIES', "UTF-8");
    @
$pageDom->loadHTML($searchPage);

?>
romain dot lalaut at laposte dot net 15-Feb-2007 05:31
Note that the elements of such document will have no namespace even with <html xmlns="http://www.w3.org/1999/xhtml">
bigtree at DONTSPAM dot 29a dot nl 26-Apr-2005 11:15
Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html's head section:

<head>
<meta http-equiv="Content-Type" content="text/html; CHARSET=gb2312"/>
</head>

If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.