5.2.10.2. HTML Extraction – A More Complex Example

<< Click to Display Table of Contents >>

Navigation:  5. Detailed description of the Actions > 5.2. Input Actions > 5.2.10. XML/HTML File Reader >

5.2.10.2. HTML Extraction – A More Complex Example

 

Let’s assume that we have the following HTML file:

 

ANATEL~2_img268

 

 

When displayed inside a browser, this HTML file looks like this:

 

 

ANATEL~2_img269

 

 

There are two “iteration levels” inside this HTML file:
 

The first iteration is about the Clients
 

The second level of iteration is about the transactions that each client commited. The second level is “embedded” inside the first level.

 

 
To extract this HTML, you have 2 solutions.

 

Here are the parameters for the first solution:

 

ANATEL~2_img270

 

 

ANATEL~2_img8

The parameter “Read only the subtags named:” is optional.number1

If left blank, Anatella simply iterates on all the subtags found here number2.

E.g. for the above example, you can let this parameter blank: It does not change anything.

 

 
 
The output of the first solution is (after removing the empy row):

 

ANATEL~2_img272

 

 

This first solution is not very good because the number of “Purchased items” extracted is limited to 3.
 

 
Here are the parameters for the second solution: Please note that we are now defining 2 levels:

 

ANATEL~2_img273

 

 

The output of this second solution is (after removing the empy row):

 

ANATEL~2_img275

 

 
This second solution is better than the first one because it does not impose any restriction on the number of transactions that each client commited (and, also, it looks more like the original HTML file).

Anatella allows you to define as many levels as you like, so that you can easily extract any data from any HTML file, whatever the size and structure. Furthermore, it’s very easy to find the right extraction parameters for the ANATEL~2_img251 ReadXML Action (i.e. the right XPATHs) because you can directly use the XPath expressions generated by Chrome, Firefox or IE (i.e. nearly all other HTML parsers that are based on XPath expressions can’t use XPath expressions generated by Chrome, Firefox or IE because they have problems with <tbody> tags).