Visits: 13
Scraping data from the web page through Web-scraping tools or through Programming languages like R and Python was always a hectic job. Fortunately, Power BI has a built-in feature to pull data from the web. The best part is that this connector can help you fetch raw data by converting data into Tabular form as well. In this blog we will be looking in detail at Web Scraping in Power BI.
The “From Web” connector is under the ‘Get Data’ drop menu. in the home tab. It can help us scrape a web page in two ways.
- Import data that is already in Tabular form in a web page.
- Convert a non tabular data into a tabular form and import into Power BI.
In this blog we will scrape data from a web page using both method. The web page we will be using is Wikipedia page List of highest-grossing films.
Scraping a Tabular data from web page
1. From the above web page we want the list Highest Grossing Films[12]. We only want the column ‘‘Title’, ‘Worldwide gross’ and ‘Year’.
2. Open Power BI and go to the “Home” tab, click o the “Get data” drop menu and select “Web”
3. Copy the web page’s URL and paste in the pop up window and press ‘OK’, as shown in the picture below.
4. As you press “OK”, the below Navigator window will pop-up.
5. At the top of the HTML tables, you will see the tables that already exists in the HTML. If the data is in a tabular form we can easily find it here under the titles given in the web page. In our first method we can easily find our table in the HTML Tables[88] folder titled “Highest-grossing films[12]”. click the checkbox to the left of the table and go to Transform. This will take you to the Power Query Editor page and there you can remove the irrelevant columns and provide the right data formatting. Apply Changes and Load to Power BI.
Scraping Non-tabular data from web page
For the sake of better understanding we will use the same URL and the same table used in the first method and ignore the tables suggested in the HTML Tables[88] and assume the data is not in a tabular form. This time we will browse the web and create our own table. So the first three steps will be the same.
1. From the above web page we want the list titled Highest Grossing Films[12]. We only want the column ‘Title’, ‘Worldwide gross’ and ‘Year’
2. Open Power BI and go to the “Home” tab, click o the “Get data” drop menu and select “Web”
3. Copy the web page’s URL and paste in the pop up window and press ‘OK’, as shown in the picture below.
4. As you press “OK”, the below Navigator window will pop-up.
This time instead of choosing between the HTML tables, we will click on the “Add Table Using Example”.
5. In the new window, there are two panes, the pane on top browses the web page, and the pane at bottom provides a tool similar to “New Column from Example”. At the top row we will give title to our first header and then right below the header in the first row we will provide the first entry under the header in the website, as you type the first entry a drop menu will appear and suggest what you can enter, select the right entry and press enter twice. after putting values in first few rows the AI will pick the pattern you are choosing for the row data and a suggestion will appear in all the below rows. If the suggestion looks accurate to you then leave the suggested data as it is and move to next column by click on the ‘+’ sign and repeat the process for second, third and forth column. Press “OK”.
6. Review the table and load it to Power BI.