Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sql_server] Add mapping from SQL Server data types to Materialize data types #32040

Merged
merged 3 commits into from
Mar 28, 2025

Conversation

ParkMyCar
Copy link
Member

This PR adds types to describe tables and columns from MS SQL Server and how data will map into Materialize. On each new type I tried to document how it fits into the flow of a theoretical SQL Server source, but reviewers please feel free to push back if something isn't obvious. Also, the module comment in sql-server-util/src/desc.rs should (hopefully) add some helpful context.

The specific mappings from SQL Server datatypes into Materialize is described here, this PR follows that spec.

For testing, this PR relies on some test helpers I added in MaterializeInc/tiberius#1.

Throughout the code you'll see a few comments like TODO(sql_serverX) the idea here is X defines the milestone which these need to be solved, roughly:

TODO(sql_server1): Before private preview
TODO(sql_server2): Before GA
TODO(sql_server3): Backlog

Motivation

Progress towards https://github.com/MaterializeInc/database-issues/issues/8762

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@ParkMyCar
Copy link
Member Author

@martykulma and @DAlperin if either of y'all don't feel comfortable reviewing this feel let me know! Nothing here is really Materialize specific, mostly just massaging data from one format to another.

Copy link
Contributor

@DAlperin DAlperin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks sane to me. One comment about something I don't understand. I assume this will be followed by subsequent PRs?

"real" => (ScalarType::Float32, SqlServerColumnDecodeType::F32),
"double" => (ScalarType::Float64, SqlServerColumnDecodeType::F64),
"char" | "nchar" | "varchar" | "nvarchar" | "text" | "sysname" => {
// When the `max_length` is -1 SQL Server will not present us with the "before" value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "before" value in this context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment, hopefully it makes more sense now!

Basically for sources when an UPDATE or DELETE occurs in the upstream DB we need to get the old value of the row as it existed before it was updated or deleted so we can emit (old_val, -1) into the data flow. In SQL Server these data types don't always include the old value, so for the moment we can't support them.

@ParkMyCar
Copy link
Member Author

TFTR!

Comment on lines 295 to 299
/// Maximum length (in bytes) of the column.
///
/// For `varchar(max)`, `nvarchar(max)`, `varbinary(max)`, or `xml` this will be `-1`. For
/// `text`, `ntext`, and `image` columns this will be 16.
pub max_length: i16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does raw.max_length == -1 imply unlimited length (e.g. ln 235 above) for all data types or just for some?

json also return -1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out json is a brand new experimental type, only supported in Azure SQL as of 2024, so docs on it are sparse.

Thank you for the comment though! I went back and double checked these values and realized parse_data_type(...) wasn't totally up to date! I also also linked the SQL Server docs that call out what these values will be

column: &'a ColumnType,
) -> Result<Datum<'a>, SqlServerError> {
let maybe_datum = match (self, &column.scalar_type) {
(SqlServerColumnDecodeType::Bool, ScalarType::Bool) => data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: 😂 the tuples being backwards from parse_data_types may cause some discomfort for certain individuals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha totally fair! I flipped them around!

Comment on lines 416 to 423
(SqlServerColumnDecodeType::DateTime, ScalarType::Interval) => data
.try_get(name)
.context("interval")?
.map(|val: chrono::DateTime<chrono::Utc>| {
let ts = val.try_into().context("parse timestamptz")?;
Ok::<_, SqlServerError>(Datum::TimestampTz(ts))
})
.transpose()?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this decode is called on each row based on SqlServerTableDesc, which gets its column types from parse_data_type.

I wasn't seeing where ScalarType::Interval is set - is this supposed to be ScalarType::TimestampTz to match datetimeoffset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! This was stale from when I initially thought datetimeoffset was an interval-like type, instead it's actually a timestamptz type and you're seeing that I didn't complete the refactor 😂

* flip order of arguments so it's consistent in the entire file
* update comment to include a link to the docs
* fix stale reference to Interval
@ParkMyCar ParkMyCar enabled auto-merge (squash) March 28, 2025 19:59
@ParkMyCar ParkMyCar merged commit d38c022 into MaterializeInc:main Mar 28, 2025
82 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants