Hi! I'm Reinaldo Chaves, a data journalist from Brazil.
I'm building an open-source tool to extract structured information from Brazilian investment fund regulation PDFs using LangExtract + Gemini. These are standardized documents (CVM Resolution 175/2022) with ~100 pages / 230K characters each, containing 22 entity types (fund name, CNPJ, administrator, manager, fees, duration, risk factors, liquidation events, legal forum, etc.).
Project repo: https://github.com/reichaves/langextract-fundos
The goal is to enable investigative journalists to systematically analyze thousands of fund regulations for transparency and accountability purposes.
Problems encountered
1. JSON truncation with many entity types (ResolverParsingError: Unterminated string)
Related: #127, #287
When extracting 22 entity types in a single lx.extract() call, the model generates JSON responses exceeding ~22K characters. This causes the output to be truncated mid-JSON, resulting in:
langextract.resolver.ResolverParsingError: Failed to parse JSON content:
Unterminated string starting at: line 1125 column 7 (char 22564)
Workaround attempted: I split the 22 entity types into 2-3 separate extraction groups (7-12 types each) and merge results. This reduces per-call JSON size but multiplies API calls (from ~16 to ~48).
Suggestion: Could LangExtract implement an max_output_tokens parameter or automatically split entity types into groups when the expected JSON output might exceed the model's output limit? A warning when many extraction classes are defined would also help.
2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)
Related: #240
With 48 API calls (3 groups × 16 chunks) on Gemini free tier (15 RPM), I consistently hit:
Parallel inference error: Gemini API error: 429 RESOURCE_EXHAUSTED
When this happens, LangExtract raises an exception and all progress is lost — including groups that already completed successfully.
Suggestion:
- Built-in exponential backoff for 429/503 errors (currently missing for Gemini provider)
- Save partial results when a later group fails, so completed work isn't lost
- A configurable
max_rpm parameter to self-throttle API calls
3. Performance challenges with large documents (230K chars)
Related: #178, #188
Processing a 230K-character document requires significant pre-filtering to stay within reasonable API call counts. Without filtering, a single document needs ~75 API calls (with 3 groups), which takes 25+ minutes and exhausts the free-tier quota.
I implemented a custom text filter that identifies relevant sections by clause headers ("7. TAXA DE ADMINISTRAÇÃO", "26. EVENTOS DE LIQUIDAÇÃO", etc.) and reduces the text from 230K to ~50K characters. However, this domain-specific filtering shouldn't be necessary — ideally LangExtract could handle long documents more efficiently.
Suggestion: Consider implementing a relevance-aware chunking strategy that uses the prompt_description to prioritize sections likely to contain target entities, rather than processing the entire document uniformly.
Environment
- langextract 1.0.9
- Python 3.11
- Model: gemini-2.0-flash (via free-tier API key)
- OS: macOS
- Document language: Portuguese (Brazilian)
Minimal reproduction
import langextract as lx
import textwrap
# 22 entity types for a Brazilian fund regulation
prompt = textwrap.dedent("""
Extract: nome_fundo, cnpj_fundo, tipo_fundo, administrador,
gestor, custodiante, auditor, taxa_administracao, taxa_gestao,
taxa_performance, taxa_custodia, prazo_duracao, regime_condominial,
publico_alvo, classe_cotas, ativo_alvo, limite_concentracao,
fator_risco, evento_avaliacao, evento_liquidacao,
aplicacao_minima, foro
""")
# With a 230K char document, this will:
# 1. Generate JSON too large → truncation → ResolverParsingError
# 2. Make ~75 API calls → 429 errors on free tier
# 3. Lose all progress when error occurs
result = lx.extract(
text_or_documents=large_pdf_text, # 230K chars
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash",
max_char_buffer=3000,
)
Summary
LangExtract is an excellent library and I'd love to use it more effectively for journalism applications. The main pain points for my use case (large regulatory documents + many entity types) are:
- JSON output truncation when many entity types generate large responses
- No retry logic for rate limit errors, losing completed work
- No way to prioritize relevant sections in very large documents
I'm happy to contribute test cases with Brazilian Portuguese documents if that would be helpful.
Thank you for building this tool!
Hi! I'm Reinaldo Chaves, a data journalist from Brazil.
I'm building an open-source tool to extract structured information from Brazilian investment fund regulation PDFs using LangExtract + Gemini. These are standardized documents (CVM Resolution 175/2022) with ~100 pages / 230K characters each, containing 22 entity types (fund name, CNPJ, administrator, manager, fees, duration, risk factors, liquidation events, legal forum, etc.).
Project repo: https://github.com/reichaves/langextract-fundos
The goal is to enable investigative journalists to systematically analyze thousands of fund regulations for transparency and accountability purposes.
Problems encountered
1. JSON truncation with many entity types (
ResolverParsingError: Unterminated string)Related: #127, #287
When extracting 22 entity types in a single
lx.extract()call, the model generates JSON responses exceeding ~22K characters. This causes the output to be truncated mid-JSON, resulting in:Workaround attempted: I split the 22 entity types into 2-3 separate extraction groups (7-12 types each) and merge results. This reduces per-call JSON size but multiplies API calls (from ~16 to ~48).
Suggestion: Could LangExtract implement an
max_output_tokensparameter or automatically split entity types into groups when the expected JSON output might exceed the model's output limit? A warning when many extraction classes are defined would also help.2. No retry/backoff for rate limit errors (429 RESOURCE_EXHAUSTED)
Related: #240
With 48 API calls (3 groups × 16 chunks) on Gemini free tier (15 RPM), I consistently hit:
When this happens, LangExtract raises an exception and all progress is lost — including groups that already completed successfully.
Suggestion:
max_rpmparameter to self-throttle API calls3. Performance challenges with large documents (230K chars)
Related: #178, #188
Processing a 230K-character document requires significant pre-filtering to stay within reasonable API call counts. Without filtering, a single document needs ~75 API calls (with 3 groups), which takes 25+ minutes and exhausts the free-tier quota.
I implemented a custom text filter that identifies relevant sections by clause headers ("7. TAXA DE ADMINISTRAÇÃO", "26. EVENTOS DE LIQUIDAÇÃO", etc.) and reduces the text from 230K to ~50K characters. However, this domain-specific filtering shouldn't be necessary — ideally LangExtract could handle long documents more efficiently.
Suggestion: Consider implementing a relevance-aware chunking strategy that uses the
prompt_descriptionto prioritize sections likely to contain target entities, rather than processing the entire document uniformly.Environment
Minimal reproduction
Summary
LangExtract is an excellent library and I'd love to use it more effectively for journalism applications. The main pain points for my use case (large regulatory documents + many entity types) are:
I'm happy to contribute test cases with Brazilian Portuguese documents if that would be helpful.
Thank you for building this tool!