<< Click to Display Table of Contents >> Navigation: 5. Detailed description of the Actions > 5.2. Input Actions > 5.2.10. XML/HTML File Reader |
Icon:
Function: readXML
Property window:
Short description:
Reads a (compressed) XML/HTML file.
Long Description:
See section 5.1.1 to have more information on how to specify the filename of the XML/HTML file (i.e. You can use relative path, wildcards, and Javascript to specify your filename).
You can drag&drop a .XML file, a .HTML file or a .HTM file from a MS-File-Explorer-Window into an Anatella-Graph-Window: This will directly create the corresponding ReadXML Action inside the Anatella graph.
The XML/HTML reader included inside Anatella is stream-oriented. This means that Anatella reads the XML/HTML file “chunk-by-chunk” (when all data from one chunk has been extracted, Anatella loads the next chunk). This is in opposition to almost all other XML/HTML extraction engine (Almost all engines require to load the whole XML file in RAM memory before starting data extraction). This means that, in opposition to other XML/HTML extraction engine:
•There are no limits to the size of the XML/HTML file that Anatella can read&parse.
•With Anatella, Data Extraction from XML/HTML only requires a small (and constant) amount of RAM.
•With Anatella, data Extraction from XML/HTML is very fast.
If the extension of the XML file is RAR, ZIP, GZ or LZO then Anatella will transparently decompress the XML file in RAM memory. Anatella chooses the (de)compression technique to use based on the filename extension. When Anatella uses compressed file formats, it does NOT decompress the files on the Hard Drive: Anatella decompresses the data “on-the-fly” in central core RAM memory, thus reducing:
•the load on the hard drive
•the hard-drive space consumption required to do the analysis.
Usually, for classical “real world” XML/HTML files, the compression/ratio is above 90-95%. Thus, it makes a lot of sense to compress all your XML/HTML files.
Let’s assume that you have the following XML file:
<?xml version="1.0" encoding="UTF-8"?>
<ConfidentialData>
<Revision>3</Revision>
<Diary>
<Contact>
<Name>Frank</Name>
<Address street="villers" town="ath" zip="7812"/>
<Notes><Age>38</Age></Notes>
<Skill>Coding</Skill>
<Skill>Datamining</Skill>
</Contact>
<Contact>
<Name>Sabrina</Name>
<Address street="jeanne" town="ixelles" zip="1050"/>
<Notes><Age>36</Age></Notes>
<Skill>Coding</Skill>
</Contact>
</Diary>
</ConfidentialData>
We want to extract the name and the age of each of our contact.
We’ll use the following settings:
We’ll get as output the following table:
Name |
Age |
Frank |
38 |
Sabrina |
36 |
It can sometime be difficult to manually write the different XPATHs required for extraction. This is why Anatella has an “Auto Fill-In” button: After you finished entering the “Iterate on all subtags located at” parameter, you can click the “Auto Fill-In” button. Anatella will analyzes the first 100 <Contact> tags (You can change this number using the “Number of Rows to Analyze for Auto Fill-In” parameter inside the “Advanced Parameters” tab) and extract all the different XPATH to all the different data contained inside these first 100 tags. For the above example, Anatella will find the following XPATHs:
Name |
Address/street |
Address/town |
Address/zip |
Notes/Age |
Skill[0] |
Skill[1] |
By default, the character encoding used by Anatella to decipher the content of the XML file is found automatically based on the BOM (Byte Order Mark) of the .XML file or based on the declaration on the first row of the XML file (e.g. “<?xml version="1.0" encoding="UTF-8"?>” is for UTF-8 character encoding). It can happen that the BOM is missing and the XML declaration is incorrect (The XML file is using another encoding than the one specified on the first row of the XML file). In such (common) situation, you can manually specify (using the “Encoding Name” parameter inside the “Advanced Parameters” tab) which character encoding Anatella must use to decipher the XML file.