-
Notifications
You must be signed in to change notification settings - Fork 0
Transform
Yogita edited this page Jun 28, 2021
·
6 revisions
- Use StructField/StructType to convert columns to specific data types
- Convert 'occupancydatetime' to specific dateformat using pyspark date format
- Generate date dimensions day_of_week, month
- Save the final occpancyfact and date_dim dataframe into parquet files
- Record the processing status in a Postgres table for one- time processing of historic dataset for each year. Logic is in place to check the job status of historic dataprocessing and accordingly process the dataset.
- Use StructField/StructType to convert columns to specific data types
- Convert 'occupancydatetime' to specific dateformat using pyspark date format
- Remove parenthesis from the Location column and split the column into Latitude and Longitude column
- Remove non-word-charactor like comma from the station_id column and convert the column into IntegerType
- For the current year, apply a check to load only delta records based on last processed paid occupancy date.
- Generate date dimensions day_of_week, month
- Record the processing status in a Postgres table for one- time processing of historic dataset for each year from 2018-2020. Logic is in place to check the job status of historic dataprocessing and accordingly process the dataset. For the current year 2021, store the status and the last date processed.
- Use StructField/StructType to convert columns to specific data types
- Use UDF to convert the wkd_start1,2,3 & wkd_end1,2,3 & sat_start1,2,3 & sat_end1,2,3 from minutes to time format like '08:00:00'
The parquet files in Data Lake are loaded into SQL Data Warehouse using Data Factory.
-
dbo.paid_occupancy: Fact table for holding the paid parking records from year 2012 to 2020
-
dbo.paid_occupancy_test1: Table for holding the paid parking records for the year 2021
-
dbo.date_dim: Dimension table for holding the date dimension created from the fact table occupancydate timestamp.
-
dbo.blockface: Table for holding the blockface records
CREATE TABLE dbo.paid_occupancy
(
[occupancydatetime] datetime not null,
[paidoccupancy] int,
[blockfacename] varchar(1000),
[sideofstreet] varchar(2),
[parkingtimelimitcategory] int,
[available_spots] int,
[paidparkingarea] varchar(100),
[paidparkingsubarea] varchar(100),
[paidparkingrate] float,
[parkingcategory] varchar(50),
[latitude] decimal(10,6),
[longitude] decimal(10,6),
[station_id] int,
CONSTRAINT PK_Occupancy PRIMARY KEY NONCLUSTERED (occupancydatetime,latitude,longitude) NOT ENFORCED
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([station_id])
)
GO
CREATE TABLE dbo.date_dim
(
[occupancydatetime] datetime not null,
[day_of_week] varchar(20),
[monthname] varchar(20),
CONSTRAINT PK_OccupancyDateTime PRIMARY KEY NONCLUSTERED (occupancydatetime) NOT ENFORCED
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
-- HEAP
)
CREATE TABLE dbo.Blockface
(
[station_id] int,
[station_address] varchar(1000),
[side] varchar(2),
[block_nbr] int,
[parking_category] varchar(200),
[wkd_rate1] float,
[wkd_start1] varchar(100),
[wkd_end1] varchar(100),
[wkd_rate2] float,
[wkd_start2] varchar(100),
[wkd_end2] varchar(100),
[wkd_rate3] float,
[wkd_start3] varchar(100),
[wkd_end3] varchar(100),
[sat_rate1] float,
[sat_start1] varchar(100),
[sat_end1] varchar(100),
[sat_rate2] float,
[sat_start2] varchar(100),
[sat_end2] varchar(100),
[sat_rate3] float,
[sat_start3] varchar(100),
[sat_end3] varchar(100),
[parking_time_limit] int,
[subarea] varchar(100),
CONSTRAINT PK_Station_ID PRIMARY KEY NONCLUSTERED (station_id) NOT ENFORCED
)
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
-- HEAP
)
GO