For further information on either part of this blog please get in touch through our contact page or attend our conference on 27th October where we will have Diver experts talking about these techniques.
In Part 2 here my colleague YY Tsang from DMBI explains his ideas on extracting and incorporating non-tabular web data.
In Part 1 my colleague Paul Duggan at Dynamic Business Informatics has shared his ideas on how to interface with Sharepoint and incorporate some useful tabular web data into a Diveport embedded within Sharepoint.
PART 2 – INCORPORATING WEB DATA THAT ARE NOT IN TABULAR FORMAT
In the case described in part 1 of this blog, we were incorporating web data that was already in tabular format and accessible through a single web page – so it was fairly straightforward to integrate using the Diver Integrator.
In this new example we were working on a dashboard for the pharma sector. As the project involved Brand/Generic drug mapping we wanted to add more value to the dashboard reports by incorporating information about what ailments the drugs were actually used for. This would allow us to carry out some further analysis which could be of use to the pharmacists.
This information was available on a variety of websites but not in tabular format. We wanted to incorporate the Classifications and all of the sub classes of the active ingredients using a list of unique ATC codes to map them.
We had a drugs list file which contained these 7 character codes which are common to the industry. We could see the descriptions of what they were used for at each breakdown level of this code on various websites – but each classification was hidden amongst a lot of other information, and they were all on different pages.
Extraction Software
Using some freely available extraction software, such as the one we used http://www.hexcat.com/deepvacuum/, we were able to set up an iterative process that extracted the information we needed from every page of the website.
This enabled us to create a file for each page of the website as shown below.
The next but one image shows what the contents of just one of those files looks like after gathering it using the software – the page for ATC N02AB.
HTML Content
As you can see, the html code below has a lot of information but it’s not in the format that could be used as a lookup table.
Using Diver BI’s Visual Integrator
Using Diver BI’s Visual Integrator we created a script that would loop around and collect the information in each of these files, and strip out the important classifications to create tables for each level.
This is the same file when viewed through the Visual Integrator’s test feature.
Reformat Routines to Extract Classifications
In the reformat routines we looped through the html files and extracted the relevant classifications by breaking down the ATC code into its 3 component levels. The highest level was the description for the first 4 characters.
The next level down was added by including the 5th character of the ATC code.
The third level included all 7 characters.
Creation of Lookup Table
From these iterative loops we were able to derive a complete hierarchy and convert it to a lookup table for the drugs in our BI model. This extra information allows the pharma user to investigate, for example, which drugs are being prescribed together for particular diseases or ailments and whether other drugs are also being prescribed for the side effects.
These techniques are only shown as examples – you should be able to apply them to most widely available public data sets and incorporate them into your Diver models to provide important insights that do not reside in your own data.