Skip to content

Msgspec Doesn't Decode Json Correctly when SQLAlchemy.Mapped is used. #820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Vizonex opened this issue Mar 3, 2025 · 6 comments
Open

Comments

@Vizonex
Copy link

Vizonex commented Mar 3, 2025

Description

I am having a problem where when I want to write a Json Decoder for SQLAlchemy but then I bump into an issue where msgspec doesn't know how to handle these types.

from sqlalchemy.orm import Mapped, MappedAsDataclass, DeclarativeBaseNoMeta, mapped_column, declared_attr
from msgspec.json import Decoder, Encoder


class SQLTable(MappedAsDataclass, DeclarativeBaseNoMeta):
    pass


class User(SQLTable):
    __tablename__ = "user"
    # Mapped is used to register columns in the sql table without 
    # Having to write mapped_column out many times.
    name:Mapped[str]
    id:Mapped[int] = mapped_column(primary_key=True, default=None)

encoder = Encoder()
# Encoder works just fine...
print(encoder.encode(User("Vizonex", 10)))


# But the decoder on the other-hand is broken and I expect to decode the User json 
# To a User Object...
# msgspec.ValidationError: Expected `Mapped`, got `str` - at `$.name`
user = Decoder(User).decode(b'{"name":"Vizonex", "id": 10}')

I am completely stumped on how to fix this problem and I have gone through all hoops and hurtles to try and make the decoder for my upcoming python library SQLTable to decode json correctly.

Other Things I've Tried

I have tried hacking the Mapped[] Object itself into an annotated object but then SQLAlchemy fails to find it...

from sqlalchemy.orm import Mapped, MappedAsDataclass, DeclarativeBaseNoMeta, mapped_column, declared_attr
from typing import Annotated, TypeVar
from msgspec.json import Decoder, Encoder

T = TypeVar("T")

# this was an idea I came up with to allow msgspec to act friendly with 
# Sqlalchemy Mapped variables but sqlalchmey doesn't act friendly with 
# finding Mapped attributes...
HackedMap = Annotated[T, Mapped[T]]

class SQLTable(MappedAsDataclass, DeclarativeBaseNoMeta):
    pass

class User(SQLTable):
    __tablename__ = "user"
    # We can try to hack it to how we please so that msgspec's decoder works correctly 
    # but then a new problem arises     
    name:HackedMap[str]
    id:HackedMap[int] = mapped_column(primary_key=True, default=None)



# It only shows {'id': <ColumnProperty at 0x20d09b14e40; id>} Because we set it with mapped_column
print(User.__mapper__.attrs._data)

I am almost completely stumped on how this will get solved. But there are 2 fixes that I could think of that could be approched to solve this once and for all.

  1. SQLAlchemy adds some new logic for finding mapped variables hidden inside Annotated types.
  2. The way we could fix _core.c involves fixing the function typenode_collect_type function where we could check if the class has a __table__ or another obscure type belonging to sqlalchemy and then try to filter out Mapped[T] types through that kind of Boolean/logic.
@timinou
Copy link

timinou commented Apr 3, 2025

Having the same issue here. It's basically forced me not to use msgspec.convert throughout my code base... :/

msgspec.inspect.type_info(User), in this case, would give a field id of type CustomType, not int, in my case. That makes convert not work ever...

@Vizonex
Copy link
Author

Vizonex commented Apr 6, 2025

@timinou I have also tried other alternatives and I am willing to name them all including providing other things that I've tried.

  • SQLModel which I originally moved away from after learning that it's maintainability as well as pull requests are almost an entire gamble with a handful of features from SQLAlchemy that are missing

    • Not to mentaion that Pydantic v2 is slow and I have trouble liking it mainly because I found it's validation methods to be a handful and let's not forget it's metaclass is written in python but I do have some hope that pydantic v3 or 4 will finally get a proper metaclass implementation in rust in the near future. but until then me wanting to use it again seems grim as I am not happy with it's overall slowness and missing features.
    • SQLModel's inner workings is hard to learn from even with my 5 years of experience of coding with Python and it has a lot of spaghetti code in it's backend.
  • I have tried making my own dataclass library including proposing some ways to implement metaclasses that work in this way into Cython but it was currently turned down as the making of this long comment

  • I have a fork of msgspec's predesessor Quickle to help me research how it's CPython Metaclasses work since most implementations are very rare and difficult to come across

  • I have a library called SQLTable which I am programming as a competitor to SQLModel so that maybe it would provide those developers with some encouragement to optimize their own libraries but currently I am stuck with this part, hence why I threw this issue up to begin with because I got stuck with decoding Json Variables and the current __init__() function has problems with json decoding this is because __init__ if I remeber correctly (correct me if I am wrong) is not called and msgspec steals the references to the SQLAlchemy columns during that decoding process.

    • However you can hack the __annotations__ to remove the Mapped[...] variables temporarily which me and one of the SQLAlchemy Maintainers are trying to theorize ways to make everything work.
    • My Original goal was to bind Struct and One of SQLAlchemy's metaclasses together but the StructMeta metaclass is not importable and would require me to fork this branch and make a pull request with those changes and the columns do not initialize correctly.

@timinou
Copy link

timinou commented Apr 6, 2025

Wow yeah @Vizonex you definitely thought about this. Thanks for the resources and the reflections!

I looked at your SQLTable code and it led me to remove the Mapped from the dataclass entirely after its instanciation. I reasoned that

  1. SQLAlchemy did what it needed to do with that model, so we might as well assume only msgspec will reach its dataclass dunders after the fact
  2. SQLAlchemy is a bit at fault here, creating Mapped as a generic instead of an annotation. It's bound not to play well with other libraries.

This is the code, I think mostly from SQLTable:

def remove_mapping_annotations(cls):
    """Temporarly alters annotations so msgspec can understand
    how to decode variables that are associated to `Mapped`"""
    __previous_annotations__ = inspect.get_annotations(
        cls, eval_str=True
    )

    for k, v in __previous_annotations__.items():
        # using type(v) would just give me _GenericAlias this bypasses that...
        orig = typing.get_origin(v)

        if orig is Mapped:
            cls.__annotations__[k] = typing.get_args(v)[0]

I tried putting the remove_mapping_annotations in an __init_subclass__ method but it didn't work; maybe I'll try in __new__. Right now I call this function below the class definition, which is a bit ugly...

Generally when I've wanted to do something to msgspec, I've ended up finding an elegant way to do something with it (thanks @jcrist for the exquisite library design!). So my thinking around this would be to have something like:

from sqlalchemy.org import Mapped as _Mapped
from typing import Annotated
from msgspec import Meta

type Mapped[T] = Annotated[_Mapped[T], Meta(extra={"__table_field__": True})

If I implement it I'll let you all know.

It doesn't solve the other issue msgspec has with encoding fields, where I have to deal with session persistence and relationship fields having to be awaited to get their values (with .awaitable_attrs). My solution has been to recursively iterate on the dataclass instance's fields and output a dict that only contains the values I want (in effect, re-implementing msgspec.convert but for sqlalchemy models). For that issue, I skip all the fields that have init=False, or await them if I decided to add them to the final record. I think it's related to the __init__ behaviour you're talking about, I get a lot of session issues around that.

The lack of being able to omit fields that have init=False from being encoded is actually tricky for me, because I deal with embeddings and I don't want them in the final record. So for that one I have a del MyModel.__dataclass_fields__["embedding"] after its instanciation.

Another avenue I'm considering is SQLActive. It's got a serialisation mixin here with a to_dict method that seems to do the trick. Would still need to deal with the Mapped thing though.

Hope it helps!

@Vizonex
Copy link
Author

Vizonex commented Apr 6, 2025

@timinou The problem with using python-level things (At least for me) is that it kills the performance of many things. This is why I specifically tried researching into making a metaclass in Cython or Rust mainly so that maintenance would be a bit easier and so that the speed could be retained, it takes a lot longer to put together a CPython module than a Cython or Rust Module does.

However you were right about everything

  1. SQLAlchemy did what it needed to do with that model, so we might as well assume only msgspec will reach its dataclass dunders after the fact
  2. SQLAlchemy is a bit at fault here, creating Mapped as a generic instead of an annotation. It's bound not to play well with other libraries.

I'm already trying to lend the hand with finding the best solution. One Idea that I did not mention or forgot to was using cwpack in Cython and adding it in as either a wrapper or Mixin then handle all the __init__ attributes first then throw in the other data second where init was flagged as false. My only gripe with the approach is the time it would take for me to code it all.

  • Another Theory I had was subclassing Field so that the SQLAlchemy Columns could be theoretically retained but until Struct and one of SQLAlchemy's metaclasses can work together there is no hope.
  • I already had tried using the Meta Annotation to try and create the SQLAlchemy Columns but one of the SQLAlchemy Maintainers told me that it would not be an Ideal approach and we would still be stealing references to the columns if it was hacked in from __init_subclass__

I do understand the C Implementation of this library but the hard part (at least for me that is) is figuring out where it could be fixed and then going in, recompiling it and testing it. There's a lot of ways to solve the problem but as it stands the only success I've had is sterilization. Let's not forget that its redundant to write a class twice which was what lead me down this large rabbit hole to start.

@Vizonex
Copy link
Author

Vizonex commented Apr 6, 2025

The Reason why the json decoder and others are broken is because it calls for __new__ and as soon as that happens and starts setting in the attributes and since it takes those references to the fields away, the columns are stolen and cannot be referenced again. The SQLAlchemy stuff including mimicing the __init__ function is difficult because they make it hard and difficult for anybody using their frontend to reach it. My Final idea was coding a new class to the library which I call a InitDecoder that runs the __init__ function instead of __new__ so that SQLAlchemy's columns can be saved and retained, I'll see about reforking this repo again and try implementing it sometime later today or this week although this adds a whole other Layer to the library it's the only plausible solution I could think of where at least nothing has to be completely changed.

This is not a bad thing, the only other thing unrelated to this topic that I could really suggest is that more of the C Code could be separated into different C and Header files to spare any innocent developer or new maintainer trying to navigate through it's code in _core.c , I understand how to compile multiple C files together via setuptools and I could understand why this may have been roadblock for them if it was one.

@Vizonex
Copy link
Author

Vizonex commented Apr 15, 2025

@timinou Unique update I may have figured out a way to fix Colums to prevent member stealing.
This may require a bit of explanation but basically how the descriptor protocol works is that it prevents field information from being stolen on __init__'s end but not __new__'s end thus leading to member overriding. This can be remedied by making the __get__ method of the descriptor return itself if the private member is not found which prevents __new__ from even having the chance of overriding that variable...



class DescriptorExample:
    def __set_name__(self, owner, name):
        # Creating an inner variable can also help prevent overriding
        # Msgspec doesn't need to these arributes so we're safe to use 
        # these...
        self.name = name
        self.private_name = "__" + name

    def __get__(self, instance, owner):
        """Obtains the member if it is avalible otherwise 
        it goes out and gets the descriptor itself which prevents 
        descriptors from being stolen via `__new__`
        """
        return getattr(instance, self.private_name, self)
  
    def __set__(self, obj, value):
        setattr(obj, self.private_name, value)
        
    

class Example:
    descriptor_example = DescriptorExample()
    def __init__(self, example:str):
        self.descriptor_example = example

original_desc = Example.descriptor_example
e = Example.__new__(Example)
e.descriptor_example = "OVERRIDE"
assert Example.descriptor_example == original_desc, "OH NO! MEMBER WAS STOLEN!!!"

The Good News

Msgspec is not at fault and SQLAlchemy is and that means that this bug has a better chance of being remedied as shown in my example. There might have been some black-magic involved with SQLAlchemy's code that makes that happen and thus subclassing it and experimenting with a Json Decoder with Hacked annotations has a chance of working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants