Skip to content

Transform

Yogita edited this page Jun 28, 2021 · 6 revisions

Occupancy Transformations

Historic dataset (2012-2017)

  • Use StructField/StructType to convert columns to specific data types
  • Convert 'occupancydatetime' to specific dateformat using pyspark date format
  • Generate date dimensions day_of_week, month
  • Save the final occpancyfact and date_dim dataframe into parquet files
  • Record the processing status in a Postgres table for one- time processing of historic dataset for each year. Logic is in place to check the job status of historic dataprocessing and accordingly process the dataset.

Historic and Delta dataset (2018-2021)

  • Use StructField/StructType to convert columns to specific data types
  • Convert 'occupancydatetime' to specific dateformat using pyspark date format
  • Remove parenthesis from the Location column and split the column into Latitude and Longitude column
  • Remove non-word-charactor like comma from the station_id column and convert the column into IntegerType
  • For the current year, apply a check to load only delta records based on last processed paid occupancy date.
  • Generate date dimensions day_of_week, month
  • Record the processing status in a Postgres table for one- time processing of historic dataset for each year from 2018-2020. Logic is in place to check the job status of historic dataprocessing and accordingly process the dataset. For the current year 2021, store the status and the last date processed.

Blockface Transformations

Blockface dataset

  • Use StructField/StructType to convert columns to specific data types
  • Use UDF to convert the wkd_start1,2,3 & wkd_end1,2,3 & sat_start1,2,3 & sat_end1,2,3 from minutes to time format like '08:00:00'

The parquet files in Data Lake are loaded into SQL Data Warehouse using Data Factory.

  1. dbo.paid_occupancy: Fact table for holding the paid parking records from year 2012 to 2020

  2. dbo.paid_occupancy_test1: Table for holding the paid parking records for the year 2021

  3. dbo.date_dim: Dimension table for holding the date dimension created from the fact table occupancydate timestamp.

  4. dbo.blockface: Table for holding the blockface records

Paid Occupancy Fact Table

CREATE TABLE dbo.paid_occupancy
 	(
 	 [occupancydatetime] datetime not null,
         [paidoccupancy] int,
         [blockfacename] varchar(1000),
         [sideofstreet] varchar(2),
         [parkingtimelimitcategory] int,
         [available_spots] int,
         [paidparkingarea] varchar(100),
         [paidparkingsubarea] varchar(100),
         [paidparkingrate] float,
         [parkingcategory] varchar(50),
         [latitude] decimal(10,6),
         [longitude] decimal(10,6),
         [station_id] int,
        CONSTRAINT PK_Occupancy PRIMARY KEY NONCLUSTERED (occupancydatetime,latitude,longitude) NOT ENFORCED
    	)
WITH
(   CLUSTERED COLUMNSTORE INDEX
,  DISTRIBUTION = HASH([station_id])
)
 GO

Date Dimension Table


CREATE TABLE dbo.date_dim
 	(
 	 [occupancydatetime] datetime not null,
 	 [day_of_week] varchar(20),
	 [monthname] varchar(20),
	 CONSTRAINT PK_OccupancyDateTime PRIMARY KEY NONCLUSTERED (occupancydatetime) NOT ENFORCED
 		)
 WITH
 	(
 	DISTRIBUTION = REPLICATE,
 	 CLUSTERED COLUMNSTORE INDEX
 	 -- HEAP
 	)

Blockface Table


CREATE TABLE dbo.Blockface
 	(
 	 [station_id] int,
 	 [station_address] varchar(1000),
 	 [side] varchar(2),
 	 [block_nbr] int,
 	 [parking_category] varchar(200),
 	 [wkd_rate1] float,
 	 [wkd_start1] varchar(100),
 	 [wkd_end1] varchar(100),
 	 [wkd_rate2] float,
 	 [wkd_start2] varchar(100),
 	 [wkd_end2] varchar(100),
     [wkd_rate3] float,
 	 [wkd_start3] varchar(100),
 	 [wkd_end3] varchar(100),
     [sat_rate1] float,
 	 [sat_start1] varchar(100),
 	 [sat_end1] varchar(100),
 	 [sat_rate2] float,
 	 [sat_start2] varchar(100),
 	 [sat_end2] varchar(100),
     [sat_rate3] float,
 	 [sat_start3] varchar(100),
 	 [sat_end3] varchar(100),
     [parking_time_limit] int,
     [subarea] varchar(100),
     CONSTRAINT PK_Station_ID PRIMARY KEY NONCLUSTERED (station_id) NOT ENFORCED
 	)
 WITH
 	(
 	DISTRIBUTION = REPLICATE,
 	CLUSTERED COLUMNSTORE INDEX
 	 -- HEAP
 	)
 GO

Clone this wiki locally