Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add JsonParser component to declarative framework #166

Merged
merged 27 commits into from
Jan 16, 2025

Conversation

pnilan
Copy link
Contributor

@pnilan pnilan commented Dec 11, 2024

What

Recommended Reviewing Order:

  1. declarative_component_schema.yaml
  2. composite_raw_decoder.py
  3. model_to_component_factory.py
  4. test_composite_decoder.py

Summary by CodeRabbit

Release Notes

  • New Features

    • Added JSON parsing capabilities to the declarative components framework.
    • Introduced JsonParser for handling JSON data with flexible encoding support.
  • Improvements

    • Enhanced data handling with new parsing methods, including support for various data types and error handling.
    • Improved readability and consistency of documentation.
  • Testing

    • Added comprehensive unit tests for new JSON parsing functionality.
    • Expanded test coverage for error handling in JSON parsing.

@github-actions github-actions bot added the enhancement New feature or request label Dec 11, 2024

This comment was marked as off-topic.

@pnilan

This comment was marked as off-topic.

This comment was marked as outdated.

@pnilan

This comment was marked as outdated.

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as outdated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1754-1757: Consider adding parameter validation and documentation, wdyt?

The implementation looks good, but could benefit from:

  1. Docstring explaining the purpose and parameters
  2. Validation for supported encoding values

Here's a suggested improvement:

    @staticmethod
    def create_json_parser(model: JsonParserModel, config: Config, **kwargs: Any) -> JsonParser:
+       """Creates a JsonParser instance that parses JSON data with the specified encoding.
+       
+       Args:
+           model: The JsonParser model containing configuration
+           config: The connector configuration
+           **kwargs: Additional keyword arguments
+           
+       Returns:
+           JsonParser: A configured JSON parser instance
+       """
+       if model.encoding and not model.encoding.lower() in ['utf-8', 'utf-16', 'ascii']:
+           raise ValueError(f"Unsupported encoding: {model.encoding}")
        encoding = model.encoding or "utf-8"
        return JsonParser(encoding=encoding)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51118f1 and 34a710d.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

75-75: LGTM!

The import of JsonParser from the composite_raw_decoder module is correctly placed with other decoder imports.


251-253: LGTM!

The import of JsonParserModel follows the established pattern of importing models from the declarative_component_schema module.


529-529: LGTM!

The JsonParserModel is correctly mapped to its factory method in the PYDANTIC_MODEL_TO_CONSTRUCTOR dictionary.

@pnilan pnilan requested review from maxi297 and artem1205 January 10, 2025 22:56
@pnilan pnilan changed the title feat: Adds Parser interface and JsonParser component to declarative framework feat: Add JsonParser component to declarative framework Jan 10, 2025
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that from there, we will start parsing the json twice if there is pagination defined: once during dpath extraction and once during pagination.

I don't think we need to fix now that but I'd like to:

  • Make sure we use an efficient JSON parser (orjson)
  • Poke our master performance on this ticket to make sure he knows @artem1205

I think the longer term solution would be to have a response object so that we can keep the result of the parsing in-memory but it will not be easy to do since our interfaces depends on the requests.Response.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)

136-147: Consider adding more edge cases, wdyt?

The test implementation looks solid! A few suggestions to make it even more robust:

  1. Add edge cases like empty objects {} and nested structures
  2. Include error cases (invalid JSON, encoding errors)
  3. Validate that the encoding parameter is being respected

Here's a possible enhancement to the test parameters:

     "data",
     [
         ({"data-type": "string"}),
         ([{"id": 1}, {"id": 2}]),
         ({"id": 170_141_183_460_469_231_731_687_303_715_884_105_727}),
+        ({}),  # empty object
+        ({"nested": {"foo": {"bar": "baz"}}}),  # nested structure
     ],
     ids=[
         "test_with_buffered_io_base_data_containing_string",
         "test_with_buffered_io_base_data_containing_list",
         "test_with_buffered_io_base_data_containing_int128",
+        "test_with_empty_object",
+        "test_with_nested_structure",
     ],

And maybe add a separate test for error cases:

def test_json_parser_with_invalid_data():
    parser = JsonParser(encoding="utf-8")
    with pytest.raises(json.JSONDecodeError):
        next(parser.parse(BufferedReader(BytesIO(b"invalid json"))))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34a710d and 060178a.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2 hunks)
  • unit_tests/sources/declarative/decoders/test_composite_decoder.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (2)

7-7: LGTM! Clean import additions.

The new imports are well-organized and necessary for the JsonParser test implementation.

Also applies to: 17-17


123-135: Great test parameterization! 🎯

The test cases cover a good variety of data types and the test IDs are very descriptive.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1756-1760: Consider enhancing error handling for edge cases.

The implementation is clean and follows the pattern of other parser creation methods. However, we could make it more robust by validating the encoding value. What do you think about adding validation to ensure the encoding is supported by Python's codec registry? Here's a suggestion:

 @staticmethod
 def create_json_parser(model: JsonParserModel, config: Config, **kwargs: Any) -> JsonParser:
     encoding = model.encoding or "utf-8"
+    try:
+        "test".encode(encoding)
+    except LookupError as e:
+        raise ValueError(f"Unsupported encoding: {encoding}") from e
     return JsonParser(encoding=encoding)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 060178a and bf8dd26.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (12 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

75-75: LGTM!

The import is correctly placed in the decoders section.


251-253: LGTM!

The model import follows the established pattern.


529-529: LGTM!

The constructor mapping is added in alphabetical order.

Copy link
Contributor

@artem1205 artem1205 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, approving to unblock. please take a look at my latest comment

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ce2c28 and 7e7b2c4.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)

10-10: LGTM! The new imports look good.

The added imports support the new JsonParser implementation with error handling capabilities.

Also applies to: 13-15


Line range hint 1582-1582: LGTM! The GzipParser integration is correct.

The JsonParser is properly added as a valid inner_parser type for GzipParser.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)

740-748: LGTM! The FlattenFields implementation looks clean.

The class follows the established pattern with proper type hints and default values.


1176-1182: LGTM! The JsonParser model looks good.

The implementation is consistent with other parser models in the file.


1617-1617: LGTM! The component integrations are properly implemented.

JsonParser is correctly added as a valid parser type in the Union types.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

2907-2920: Consider enhancing the JsonParser component definition.

The implementation looks good, but we could make it more robust. What do you think about:

  1. Adding validation for the encoding field to restrict it to common encodings (utf-8, ascii, etc.)?
  2. Including an optional allow_invalid_json boolean property to handle malformed JSON gracefully?
  3. Adding an example showing how to handle JSON with different encodings?

Here's a suggested enhancement:

  JsonParser:
    title: JsonParser
    description: Parser used for parsing str, bytes, or bytearray data and returning data in a dictionary format.
    type: object
    additionalProperties: true
    required:
      - type
    properties:
      type:
        type: string
        enum: [JsonParser]
      encoding:
        type: string
        default: utf-8
+       enum: [utf-8, ascii, utf-16, utf-32]
+       description: The character encoding to use when decoding the input data.
+     allow_invalid_json:
+       type: boolean
+       default: false
+       description: When set to true, attempts to handle malformed JSON by skipping invalid records instead of failing.
+     examples:
+       - type: JsonParser
+         encoding: utf-16
+       - type: JsonParser
+         encoding: utf-8
+         allow_invalid_json: true

What do you think about these additions? They would make the component more flexible and easier to use correctly.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

1756-1759: Consider adding a docstring for better documentation?

The implementation looks good, but would you consider adding a docstring to document the parameters and return type? This would help maintain consistency with other factory methods, wdyt?

Here's a suggested addition:

    @staticmethod
    def create_json_parser(model: JsonParserModel, config: Config, **kwargs: Any) -> JsonParser:
+        """Creates a JsonParser instance.
+
+        Args:
+            model: The JsonParser model containing the configuration
+            config: The connector configuration
+            **kwargs: Additional keyword arguments
+
+        Returns:
+            JsonParser: A configured JsonParser instance
+        """
        encoding = model.encoding if model.encoding else "utf-8"
        return JsonParser(encoding=encoding)

2569-2589: Consider making the error message more concise?

The validation logic looks solid, but the error message is quite lengthy. Would you consider a more concise version that maintains clarity? For example:

-    _UNSUPPORTED_DECODER_ERROR = (
-        "Specified decoder of {decoder_type} is not supported for pagination."
-        "Please set as `JsonDecoder`, `XmlDecoder`, or a `CompositeRawDecoder` with an inner_parser of `JsonParser` or `GzipParser` instead."
-        "If using `GzipParser`, please ensure that the lowest level inner_parser is a `JsonParser`."
-    )
+    _UNSUPPORTED_DECODER_ERROR = (
+        "Decoder {decoder_type} is not supported for pagination. Use `JsonDecoder`, `XmlDecoder`, "
+        "or `CompositeRawDecoder` with `JsonParser` or `GzipParser(JsonParser)` as inner_parser."
+    )

1041-1044: Consider extracting common decoder handling logic?

I notice similar code patterns for handling decoders in both create_cursor_pagination and create_offset_increment. Would you consider extracting this into a helper method to reduce duplication? Something like:

def _prepare_pagination_decoder(self, decoder: Decoder) -> Decoder:
    """Prepares a decoder for pagination by wrapping it in PaginationDecoderDecorator if needed."""
    if isinstance(decoder, PaginationDecoderDecorator):
        return decoder
    return PaginationDecoderDecorator(decoder=decoder)

This could simplify both methods:

decoder = self._prepare_pagination_decoder(decoder)
inner_decoder = decoder.decoder if isinstance(decoder, PaginationDecoderDecorator) else decoder

Also applies to: 1954-1961

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f550f2 and 27bf5a7.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)

681-681: LGTM! Minor formatting improvement.

The enum formatting has been cleaned up by removing unnecessary whitespace.


Line range hint 2889-2906: LGTM! Consistent component reference.

The JsonParser has been correctly added to the list of allowed parsers in the CompositeRawDecoder component.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)

75-76: LGTM! Clean import additions.

The new imports for JsonParser and Parser are well-organized within the existing decoders import block.


252-254: LGTM! Consistent model import.

The JsonParserModel import follows the established pattern for model imports.


530-530: LGTM! Clean mapping addition.

The JsonParserModel to create_json_parser mapping is correctly added to the constructor dictionary.

Copy link
Contributor

@natikgadzhi natikgadzhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nits, perhaps nuke the allowAdditionalFields? Otherwise looks good, shipit.

@pnilan pnilan merged commit 40a9f1e into main Jan 16, 2025
19 checks passed
@pnilan pnilan deleted the pnilan/declarative/parsers branch January 16, 2025 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(JsonParser) - Create new JsonParser component
4 participants