Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PT-2377 - fixed pt-table-sync for JSON utf8 strings #861

Open
wants to merge 1 commit into
base: 3.x
Choose a base branch
from

Conversation

hpoettker
Copy link
Contributor

The PR is intended to resolve this issue: https://perconadev.atlassian.net/browse/PT-2377

The tests added with pt-2377.t both fail with the current code base as they do not generate the expected DML statements with non-ASCII characters correctly. With the proposed change, they run successfully. They test that both REPLACE and UPDATE statements are generated correctly by pt-table-sync when applied to tables with JSON columns that contain non-ASCII characters.

The code is my own creation and it can be distributed under the GPL2 licence.

  • The contributed code is licensed under GPL v2.0
  • Contributor Licence Agreement (CLA) is signed
  • util/update-modules has been ran
  • Documentation updated
  • Test suite update

@hpoettker hpoettker force-pushed the PT-2377_table_sync_with_utf8_json branch from 67e432c to e5f0690 Compare September 15, 2024 20:04
@hpoettker
Copy link
Contributor Author

I updated the PR such that the change is also included in lib/ChangeHandler.pm. pt-table-sync is the only affected tool by this change.

@hpoettker hpoettker force-pushed the PT-2377_table_sync_with_utf8_json branch 2 times, most recently from 4acec07 to fc9987d Compare September 16, 2024 00:13
@hpoettker hpoettker force-pushed the PT-2377_table_sync_with_utf8_json branch from fc9987d to 263404d Compare October 11, 2024 12:35
@hpoettker
Copy link
Contributor Author

I adjusted the added code to also handle NULL values in the JSON columns correctly. Previously, they led to errors due to the use of an uninitialized values.

The unit test for ChangeHandler.t now also tests the correct behavior for NULL values.

@hpoettker hpoettker force-pushed the PT-2377_table_sync_with_utf8_json branch from 263404d to ec7705d Compare November 14, 2024 23:29
@hpoettker
Copy link
Contributor Author

Thanks so much for the massive effort to adjust the code base to MySQL 8.4!

I've rebased the changes on the latest commit in 3.x.

Copy link
Collaborator

@svetasmirnova svetasmirnova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test t/pt-table-sync/pt-2377.t does not pass with 5.7. Please add skip call into this test that will skip test if run on 5.7 or earlier version.

There is also a note at https://perldoc.perl.org/utf8:

$flag = utf8::is_utf8($string)

(Since Perl 5.8.1) Test whether $string is marked internally as encoded in UTF-8. Functionally the same as Encode::is_utf8($string).

Typically only necessary for debugging and testing, if you need to dump the internals of an SV, Devel::Peek's Dump() provides more detail in a compact form.

If you still think you need this outside of debugging, testing or dealing with filenames, you should probably read perlunitut and "What is "the UTF8 flag"?" in perlunifaq.

Don't use this flag as a marker to distinguish character and binary data: that should be decided for each variable when you write your code.

To force unicode semantics in code portable to perl 5.8 and 5.10, call utf8::upgrade($string) unconditionally.

Is there any other way to perform the same check?

The MySQL driver DBD::mysql does not decode JSON values as utf8
although MySQL uses utf8mb4 for all JSON strings.

This change decodes JSON values as utf8 (when not already done)
such that SQL statements are generated correctly.
@hpoettker hpoettker force-pushed the PT-2377_table_sync_with_utf8_json branch from ec7705d to 16b06dc Compare December 27, 2024 12:48
@hpoettker
Copy link
Contributor Author

I've removed the check, added a skip for versions below 8.0 in pt-2377.t, and rebased on the latest commit of 3.x.

If I understand correctly, the check is_utf8 shouldn't be necessary. I only added it out of caution to guard against unforeseen regressions. I developed the change against 8.0, and I'm now somewhat puzzled why the change doesn't seem to work equally against 5.7.

There certainly is a bug in pt-table-sync as the tests and the manual reproducer in the Jira issue show. While the fix works as is for my personal use case, I'm really not a Perl programmer and at the moment a little uncertain whether the proposed change is the correct solution for all use cases. Please feel free to close the PR if there are also doubts about it on your side.

@svetasmirnova svetasmirnova self-requested a review December 27, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants