Sunday, July 10, 2016

Using Microsoft.mshtml with the WebBrowser control

When working with a WebBrowser to extract information from a webpage, the most common classes that are probably used to traverse or navigate it's DOM are from the System.Windows.Forms namespace such as HtmlDocument, HtmlElementCollection and etc.
Another way to navigate and traverse the DOM of the WebBrowser's HTMLDocument is using the Microsoft.mshtml namespace which contains interfaces for the rendering engine of Internet Explorer. So to access those interfaces and in your application, add reference to Microsoft.mshtml.

In the code sample below, set alias for the Microsoft.mshtml namespace. You need to explicitly specify the namespace of a class or interface as MSHTML since Windows.Forms also have same class names with the former. And one thing to remember is that, empty results in mshtml does not yield nulls. Instead, it produces DBNull value.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
using MSHTML = mshtml; //set namespace alias for form-level use
....
....
List<string> myVidsData = new List<string>();
MSHTML.IHTMLElementCollection anchorElements = 
        ((MSHTML.HTMLDocument)WebBrowser1.Document.DomDocument).getElementsByTagName("a");
 
if (anchorElements != null && anchorElements.length > 0)
{
    foreach (MSHTML.HTMLAnchorElement element in anchorElements)
    {
        var attribute = element.getAttribute("data-vids");
 
        //empty attributes return DBNull
        if(!System.DBNull.Value.Equals(attribute))
        {
            myVidsData.Add(element.getAttribute("data-vids"));
        }
    }
}

0 comments:

Post a Comment