top of page

#MicrosoftFabric Road Map and Web Scraping with Fabric

Updated: Nov 14

Like many I'm also playing with Fabric, many of my clients are also excited about Fabric and want to know more about it. Being a solution architect in the consulting world one of the most common questions I get asked is: “When certain features will be available, Where are they in the roadmap?”. That’s what sparked the idea of scraping the Microsoft Fabric Roadmap and creating this Power BI report. It's based on a Direct Lake connection, so it has been a bit temperamental.

What did I do it? If you aren't interested in the whole story. Here is Python code you can run to get a road map. If you're interested in my process carry on reading 🙂

Data Integration

I'm not proficient in Python but I have done web scraping with Power BI many times. So I know my way around web scraping a bit. However, I started with ChatGapt, doing a rough skeleton using Python.

url = "https://learn.microsoft.com/en-us/fabric/release-plan"page = requests.get(url)soup = BeautifulSoup(page.content, 'html.parser')results = soup.find(id='main')# print(results.prettify())

The above script gives me the entire page. I'm obviously not interested in the entire page. I went to developer tools and figured out snippets I'm interested in.

Screenshot 1 of 4 showing microsoftfabric road map - data integration

It looked like, I need all Hrefs and then I can filter out only those I'm interested in.

# get all hrefshrefs = results.find_all('a', href=True)# keep all href name and link in a listhref_list = []for href in hrefs:    href_list.append([href.text, href['href']])# add a column name to the list with full urlhref_list = pd.DataFrame(href_list, columns=['name', 'url'])# remove rows containing https  href_list = href_list[~href_list['url'].str.contains("https")]# add a new column with full url    href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url']href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'

I was able to extract the hrefs from the web page using the above code snippet. It was a bit messy, but I managed to clean it up and get the list I wanted. Now, I have to iterate over the links and scrape the data from each page. The pages have different formats and layouts, so I need to adjust my code accordingly. I decided to try chat GPT for some help, and AI suggested using Pd_readhtml to read the tables from the HTML pages.

table_list = []# loop through href_list, check if it has tables and if it does, append to table_listfor index, row in href_list.iterrows():    table = pd.read_html(row['full_url'], extract_links="all")    table[0]['product'] = row['name']    table[0]['FullURL'] = row['full_url']    table_list.append(table[0])Pandas_df = pd.concat(table_list, ignore_index=True)Pandas_df.columns = ['Feature', 'Estimated_release_timeline', 'Product','FullURL', 'Enabled_for', 'Public_Preview','GA']Pandas_df = Pandas_df.astype({'Feature': 'string', 'Estimated_release_timeline': 'string', 'Product': 'string', 'Enabled_for': 'string', 'Public_Preview': 'string', 'GA': 'string'})spark_df = spark.createDataFrame(Pandas_df)# display(spark_df)

I decided to make some improvements to the data. I added some extra columns to identify the products more easily, I converted the data types to string for consistency, and I gave meaningful names to the columns, avoiding any spaces that could cause problems later.

Next, I applied some more transformations with Spark. I learned that Spark Data Frame is much faster than Pandas, so I switched to Spark DataFrame.

Script for all transformations

from pyspark.sql.functions import split, colspark_df = spark_df.withColumn('Feature_Name', split(col('Feature'), '\',')[0])spark_df = spark_df.withColumn('Feature_URL', split(col('Feature'), '\',')[1])# display(spark_df)spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\(\'', ''))spark_df = spark_df.withColumn('Feature_Name', regexp_replace('Feature_Name', '\'', ''))spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'\)', ''))spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', '\'', ''))spark_df = spark_df.withColumn('Feature_URL', regexp_replace('Feature_URL', ' ', ''))# display(spark_df)# spark_df = spark_df.withColumn('Feature_URL', concat(col('FullURL'), col('Feature_URL')))spark_df = spark_df.withColumn('Feature_URL', when(col('Product') == 'Power BI'    , concat(lit('https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/'), col('Feature_URL')))    .otherwise(concat(col('FullURL'), col('Feature_URL'))))spark_df = spark_df.withColumn('Estimated_release_timeline', split(col('Estimated_release_timeline'), ',')[0])spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\(\'', ''))spark_df = spark_df.withColumn('Estimated_release_timeline', regexp_replace('Estimated_release_timeline', '\'', ''))spark_df = spark_df.drop('Feature')# display(spark_df)spark_df = spark_df.withColumn('Enabled_for', split(col('Enabled_for'), '\',')[0])spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\(\'', ''))spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\'', ''))spark_df = spark_df.withColumn('Enabled_for', regexp_replace('Enabled_for', '\)', ''))#split GA column into two columnsspark_df = spark_df.withColumn('GA', split(col('GA'), ',')[0])spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\(\'', ''))spark_df = spark_df.withColumn('GA', regexp_replace('GA', '\'', ''))#split Public_Preview column into two columnsspark_df = spark_df.withColumn('Public_Preview', split(col('Public_Preview'), ',')[0])spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\(\'', ''))spark_df = spark_df.withColumn('Public_Preview', regexp_replace('Public_Preview', '\'', ''))# spark_df.show()  # display(spark_df)Table_Name = 'Fabric_Release_Plan'# print(Table_Name)spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)print('Table ' + Table_Name + ' has been created successfully')

Finally, inserted all data into the Spark Delta Table

Table_Name = 'Fabric_Release_Plan'# print(Table_Name)spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)print('Table ' + Table_Name + ' has been created successfully')

Then, I went to get descriptions of each item as well and displayed them in the tooltips of my report. This was similar to the rest of the code, but more fun and challenging.

Script to get Descriptions

url = "https://learn.microsoft.com/en-us/fabric/release-plan"page = requests.get(url)soup = BeautifulSoup(page.content, 'html.parser')results = soup.find(id='main')# get all hrefshrefs = results.find_all('a', href=True)# keep all href name and link in a listhref_list = []for href in hrefs:    href_list.append([href.text, href['href']])# add a column name to the list with full urlhref_list = pd.DataFrame(href_list, columns=['name', 'url'])# remove rows containing https  href_list = href_list[~href_list['url'].str.contains("https")]# add a new column with full url    href_list['full_url'] = 'https://learn.microsoft.com/en-us/fabric/release-plan/' + href_list['url']href_list.loc[href_list['name'] == 'Power BI', 'full_url'] = 'https://learn.microsoft.com/en-us/power-platform/release-plan/2023wave2/power-bi/planned-features'Desc_List=[]table_list = []# loop through href_list, check if it has tables and if it does, append to table_listfor index, row in href_list.iterrows():        # also get all descriptions    url = row['full_url']    page = requests.get(url)    soup = BeautifulSoup(page.content, 'html.parser')    results = soup.find(id='main')    h3 = results.find_all('h3')    h3_ids = []    for h in h3:        h3_ids.append(h.get('id'))        h3_ids.append(h.get_text())        h3_ids.append(h.find_next('p').get_text())        h3_ids.append(h.find_next('p').find_next('p').get_text())        # print(h3_ids)        # transpose list    h3_ids = [h3_ids[i:i+4] for i in range(0, len(h3_ids), 4)]    h3_ids = pd.DataFrame(h3_ids, columns=['id', 'name', 'timeline', 'full_description'])    h3_ids['url'] = url    # add product name    h3_ids['product'] = row['name']    # add full url with concat of url and id    h3_ids['full_url'] = url + '#' + h3_ids['id']    Desc_List.append(h3_ids)# concat all desc_list dataframesDesc_List = pd.concat(Desc_List)# display(Desc_List)spark_df = spark.createDataFrame(Desc_List)# spark_df.show()Table_Name = 'Fabric_Release_Plan_with_desc'# print(Table_Name)spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable(Table_Name)print('Table ' + Table_Name + ' has been created successfully')

Next, I ran a simple script to get the time my notebook refreshed to display the last refresh date and time on the report.

# get notebook run time and save in a tableimport timefrom datetime import datetimefrom pyspark.sql.functions import litimport pandas as pdnow = datetime.now()current_time = now.strftime("%Y-%m-%d %H:%M:%S")df = pd.DataFrame([current_time], columns=['Time'])spark_df=spark.createDataFrame(df)spark_df.write.format("delta").option("mergeSchema", "true").mode("overwrite").saveAsTable("Notebook_runtime")

Modelling

Modelling was pretty easy. I created relationships using the Sql Endpoint relationships view.

Screenshot 2 of 4 showing microsoftfabric road map - modelling

Then I chose which tables to show in my default semantic model.

Screenshot 3 of 4 showing microsoftfabric road map - modelling

This lineage view explains the flow.

Screenshot 4 of 4 showing microsoftfabric road map - modelling

Reporting

I created a simple Power BI report on the default semantic model using the Power BI service. I published to the web and you can view the report here – Https://app.powerbi.com/view?r=eyJrIjoiOTYxMmM3NjYtMGIxZS00MDFmLTljOWMtYTYyMzBmYjA3ZGJjIiwidCI6IjM4YmU3MjU5LTMyMGQtNDZkYy04YzNiLWNkZDA5OTM1NGExMyIsImMiOjh9

Until next time, Prathy 🙂

Comments


bottom of page