5.2.10.1. HTML Extraction – A Simple Example

<< Click to Display Table of Contents >>

Navigation:  5. Detailed description of the Actions > 5.2. Input Actions > 5.2.10. XML/HTML File Reader >

5.2.10.1. HTML Extraction – A Simple Example

 

Let’s assume that we have the following HTML file:

 

ANATEL~2_img255

 

 
When displayed inside a browser, this HTML file looks like this:

 

Name

Age

Frank

30

Sabrina

25

David

26

 

 

We want to import the above table inside Anatella. First, we need to find the XPATH that gives the location of the table inside the HTML document. To do so, open the HTML file inside a Browser (e.g. inside “Chrome”), right-click on the first cell of the table and select “Inspect Element”: See the screenshot below:

 

ANATEL~2_img256

 

 

You should now see the following window:
 

ANATEL~2_img257

 

 
Right-click the required XML tag (i.e. the tag that contains “Frank”) and select “Copy XPath” in the cotext menu. The clipboard now contains the following XPath:
 

/html/body/div/table/tbody/tr[2]/td[1]

 

Looking at the above XPath, you can see that the XPath that represents the start of the table is:
 

/html/body/div/table/tbody

 

Open Anatella, add a ANATEL~2_img251 ReadXML Action inside the graph, open the “Properties window” of this ReadXML Action, paste the XPath there, and click the “Auto Fill-In” button: You get:

 

ANATEL~2_img259

 

 

The above error message is normal because there is no “tbody” tags inside the HTML file (although a “tbody” tag appears inside your XPath expression). All the browsers always add “tbody” tags inside XPath expressions for compatibility reasons. To get around this annoying behavior from the browsers, open the “Advanced Parameters” panel and:

 

Enable the check box “HTML file(s)”: In addition to properly handle “tbody” tags, Anatella now also properly handles “br”, “img”, “link”,… tags that all have a special behavior in HTML.
 

Optional: Enable the check box “Ignore tags attributes when using the “Auto Fill-In” button.

 

 
You should have:
 

ANATEL~2_img260

 

 

Go the “Standard Parameters” panel and   click again the “Auto Fill-In” button: You now get:

 

ANATEL~2_img261

 

 

Discard the last 2 rows (i.e: the last 2 extractions - You don’t need them) and rename “by hand” the first 2 rows: You now have:

 

ANATEL~2_img262

 
 

Run the extraction (i.e. click the output pin of the ReadXML action): You get:

 

ANATEL~2_img264

 

You should still add a simple ANATEL~2_img40 FilterRow Action to   remove the “empty rows” (e.g. use the following expression: “strlen(Name)>0”). You finally get:
 

ANATEL~2_img263

 
 

 

ANATEL~2_img8

Inside Anatella, by default, the XPath indexes are zero-based (everything is zero-based in Anatella). This means that, inside Anatella, “td[0]” is equivalent to “td”. Unfortunately, HTML browsers are using indexes inside XPath expression that are “one-based” (i.e. for HTML browsers, “td[1]” is equivalent to “td”). Thus, to be able to directly copy/paste XPath expressions from your browser into Anatella, please verify that you changed the Anatella’s setting to use “one-based” XPath expression: Click here:

 

ANATEL~2_img267