Skip to content

Commit 6decfb7

Browse files
committed
Support gbk and other encoding with chardet
1 parent a1ccd04 commit 6decfb7

File tree

7 files changed

+53
-18
lines changed

7 files changed

+53
-18
lines changed

demos/saul.webp

-75.8 KB
Binary file not shown.

demos/yinxiang_gbk.html

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3+
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/><meta name="exporter-version" content="Evernote Mac 9.6.7 (470829)"/><meta name="keywords" content="AI, chatgpt, openai"/><meta name="altitude" content="0"/><meta name="author" content="Selfboot"/><meta name="created" content="2023-04-20 02:11:27 +0000"/><meta name="latitude" content="23.10881037823841"/><meta name="longitude" content="113.3146889887526"/><meta name="source" content="desktop.mac"/><meta name="source-url" content="https://platform.openai.com/docs/introduction/key-concepts"/><meta name="updated" content="2023-04-22 04:47:50 +0000"/><title>Test Case A</title></head><body style="font-size: 18px;"><div><br/></div><div><span style="font-weight: bold; text-decoration: underline;">Overview</span></div><div><br/></div><div>The <span style="font-weight: bold; color: rgb(255, 38, 0);">OpenAI API</span> can be applied to virtually any task that involves understanding or generating natural language, code, or images. We offer a spectrum of models with different levels of power suitable for different tasks, as well as the ability to fine-tune your own custom models. These models can be used for everything from content generation to semantic search and classification.</div><div><br/></div><div><span style="font-weight: bold; text-decoration: underline;">Tokens</span></div><div><br/></div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.15);-en-codeblock:true;"><div>Our models understand and process text by breaking it down into tokens. <font color="#ff9300">Tokens can be words or just chunks of characters.</font> For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with a whitespace, for example “ hello” and “ bye”.</div><div><br/></div><div>The number of tokens processed in a given API request depends on the length of both your inputs and outputs. As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text. One limitation to keep in mind is that your text prompt and generated completion combined must be no more than the model's maximum context length (for most models this is 2048 tokens, or about 1500 words). Check out our <a href="https://platform.openai.com/tokenizer">tokenizer tool</a> to learn more about how text translates to tokens.</div></div><div><br/></div><div>Models</div><div><br/></div><div>The API is powered by a set of models with different capabilities and price points. GPT-4 is our latest and most powerful model. GPT-3.5-Turbo is the model that powers ChatGPT and is optimized for conversational formats. To learn more about these models and what else we offer, visit our <a href="https://platform.openai.com/docs/models">models documentation</a>.</div><div><br/></div><div>Next steps</div><div><br/></div><ul><li><div>Keep our usage policies in mind as you start building your application.</div></li><li><div>Explore our examples library for inspiration.</div></li><li><div>Jump into one of our guides to start building.</div></li></ul><div><br/></div></body></html>

demos/yinxiang_md.html

Lines changed: 35 additions & 0 deletions
Large diffs are not rendered by default.

html2notion/translate/html2json.py

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
import sys
21
import json
2+
import chardet
3+
import time
34
from functools import singledispatch
45
from pathlib import Path
56
from bs4 import BeautifulSoup, Tag
@@ -86,8 +87,16 @@ def _(html_file: Path, import_stat):
8687
print(f"Load file: {html_file.resolve()} failed")
8788
raise FileNotFoundError
8889

89-
with open(html_file, "r") as file:
90-
html_content = file.read()
90+
html_content = ""
91+
with html_file.open('rb') as f:
92+
data = f.read()
93+
result = chardet.detect(data)
94+
encoding = result['encoding'] if result['encoding'] else 'utf-8'
95+
html_content = data.decode(encoding)
96+
97+
if html_content == "main_hold": # just for local debug
98+
time.sleep(1)
99+
return "main_hold"
91100

92101
converter = _get_converter(html_content, import_stat)
93102
result = converter.process()

html2notion/translate/notion_import.py

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -18,20 +18,6 @@ def __init__(self, session: ClientSession, notion_client):
1818

1919
async def process_file(self, file_path: Path):
2020
self.import_stats.set_filename(file_path)
21-
22-
if not file_path.is_file():
23-
self.import_stats.set_exception(Exception(f"{file_path} is not file"))
24-
logger.error(f"{file_path} is not a file.")
25-
return "fail"
26-
27-
with file_path.open() as f:
28-
content = f.read()
29-
30-
logger.info(f"Process file {file_path}")
31-
if content == "main_hold": # local debug
32-
await asyncio.sleep(1)
33-
return "main_hold"
34-
3521
try:
3622
notion_data, html_type = html2json_process(file_path, self.import_stats)
3723
except Exception as e:

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ cos-python-sdk-v5>=1.9.23
99
tenacity>=8.2.2
1010
rich>=13.3.4
1111
aiolimiter>=1.0.0
12+
chardet>=5.1.0

setup.cfg

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,11 @@ install_requires =
2323
PyYAML>=6.0
2424
aiohttp>=3.8.4
2525
anyio>=3.6.2
26-
rich>=13.3.4
2726
cos-python-sdk-v5>=1.9.23
2827
tenacity>=8.2.2
28+
rich>=13.3.4
2929
aiolimiter>=1.0.0
30+
chardet>=5.1.0
3031

3132
[options.entry_points]
3233
console_scripts =

0 commit comments

Comments
 (0)