Support gbk and other encoding with chardet

selfboot · selfboot · commit 6decfb710134 · 2023-05-12T10:10:20.000+08:00
diff --git a/demos/saul.webp b/demos/saul.webp
diff --git a/demos/yinxiang_gbk.html b/demos/yinxiang_gbk.html
@@ -0,0 +1,3 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/><meta name="exporter-version" content="Evernote Mac 9.6.7 (470829)"/><meta name="keywords" content="AI, chatgpt, openai"/><meta name="altitude" content="0"/><meta name="author" content="Selfboot"/><meta name="created" content="2023-04-20 02:11:27 +0000"/><meta name="latitude" content="23.10881037823841"/><meta name="longitude" content="113.3146889887526"/><meta name="source" content="desktop.mac"/><meta name="source-url" content="https://platform.openai.com/docs/introduction/key-concepts"/><meta name="updated" content="2023-04-22 04:47:50 +0000"/><title>Test Case A</title></head><body style="font-size: 18px;"><div><br/></div><div><span style="font-weight: bold; text-decoration: underline;">Overview</span></div><div><br/></div><div>The <span style="font-weight: bold; color: rgb(255, 38, 0);">OpenAI API</span> can be applied to virtually any task that involves understanding or generating natural language, code, or images. We offer a spectrum of models with different levels of power suitable for different tasks, as well as the ability to fine-tune your own custom models. These models can be used for everything from content generation to semantic search and classification.</div><div><br/></div><div><span style="font-weight: bold; text-decoration: underline;">Tokens</span></div><div><br/></div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.15);-en-codeblock:true;"><div>Our models understand and process text by breaking it down into tokens. <font color="#ff9300">Tokens can be words or just chunks of characters.</font> For example, the word ��hamburger�� gets broken up into the tokens ��ham��, ��bur�� and ��ger��, while a short and common word like ��pear�� is a single token. Many tokens start with a whitespace, for example �� hello�� and �� bye��.</div><div><br/></div><div>The number of tokens processed in a given API request depends on the length of both your inputs and outputs. As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text. One limitation to keep in mind is that your text prompt and generated completion combined must be no more than the model's maximum context length (for most models this is 2048 tokens, or about 1500 words). Check out our <a href="https://platform.openai.com/tokenizer">tokenizer tool</a> to learn more about how text translates to tokens.</div></div><div><br/></div><div>Models</div><div><br/></div><div>The API is powered by a set of models with different capabilities and price points. GPT-4 is our latest and most powerful model. GPT-3.5-Turbo is the model that powers ChatGPT and is optimized for conversational formats. To learn more about these models and what else we offer, visit our <a href="https://platform.openai.com/docs/models">models documentation</a>.</div><div><br/></div><div>Next steps</div><div><br/></div><ul><li><div>Keep our usage policies in mind as you start building your application.</div></li><li><div>Explore our examples library for inspiration.</div></li><li><div>Jump into one of our guides to start building.</div></li></ul><div><br/></div></body></html>
diff --git a/demos/yinxiang_md.html b/demos/yinxiang_md.html
diff --git a/html2notion/translate/html2json.py b/html2notion/translate/html2json.py
@@ -1,5 +1,6 @@
-import sys
 import json
+import chardet
+import time
 from functools import singledispatch
 from pathlib import Path
 from bs4 import BeautifulSoup, Tag
@@ -86,8 +87,16 @@ def _(html_file: Path, import_stat):
         print(f"Load file: {html_file.resolve()} failed")
         raise FileNotFoundError
 
-    with open(html_file, "r") as file:
-        html_content = file.read()
+    html_content = ""
+    with html_file.open('rb') as f:
+        data = f.read()
+        result = chardet.detect(data)
+        encoding = result['encoding'] if result['encoding'] else 'utf-8'
+        html_content = data.decode(encoding)
+
+        if html_content == "main_hold":                  # just for local debug
+            time.sleep(1)
+            return "main_hold"
 
     converter = _get_converter(html_content, import_stat)
     result = converter.process()
diff --git a/html2notion/translate/notion_import.py b/html2notion/translate/notion_import.py
@@ -18,20 +18,6 @@ def __init__(self, session: ClientSession, notion_client):
 
     async def process_file(self, file_path: Path):
         self.import_stats.set_filename(file_path)
-
-        if not file_path.is_file():
-            self.import_stats.set_exception(Exception(f"{file_path} is not file"))
-            logger.error(f"{file_path} is not a file.")
-            return "fail"
-
-        with file_path.open() as f:
-            content = f.read()
-
-        logger.info(f"Process file {file_path}")
-        if content == "main_hold":                  # local debug
-            await asyncio.sleep(1)
-            return "main_hold"
-
         try:
             notion_data, html_type = html2json_process(file_path, self.import_stats)
         except Exception as e:
diff --git a/requirements.txt b/requirements.txt
@@ -9,3 +9,4 @@ cos-python-sdk-v5>=1.9.23
 tenacity>=8.2.2
 rich>=13.3.4
 aiolimiter>=1.0.0
+chardet>=5.1.0
diff --git a/setup.cfg b/setup.cfg
@@ -23,10 +23,11 @@ install_requires =
     PyYAML>=6.0
     aiohttp>=3.8.4
     anyio>=3.6.2
-    rich>=13.3.4
     cos-python-sdk-v5>=1.9.23
     tenacity>=8.2.2
+    rich>=13.3.4
     aiolimiter>=1.0.0
+    chardet>=5.1.0
 
 [options.entry_points]
 console_scripts =

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+<?xml version="1.0" encoding="UTF-8"?>`
	`2`	`+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">`
	`3`	+<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/><meta name="exporter-version" content="Evernote Mac 9.6.7 (470829)"/><meta name="keywords" content="AI, chatgpt, openai"/><meta name="altitude" content="0"/><meta name="author" content="Selfboot"/><meta name="created" content="2023-04-20 02:11:27 +0000"/><meta name="latitude" content="23.10881037823841"/><meta name="longitude" content="113.3146889887526"/><meta name="source" content="desktop.mac"/><meta name="source-url" content="https://platform.openai.com/docs/introduction/key-concepts"/><meta name="updated" content="2023-04-22 04:47:50 +0000"/><title>Test Case A</title></head><body style="font-size: 18px;"><div><br/></div><div><span style="font-weight: bold; text-decoration: underline;">Overview</span></div><div><br/></div><div>The <span style="font-weight: bold; color: rgb(255, 38, 0);">OpenAI API</span> can be applied to virtually any task that involves understanding or generating natural language, code, or images. We offer a spectrum of models with different levels of power suitable for different tasks, as well as the ability to fine-tune your own custom models. These models can be used for everything from content generation to semantic search and classification.</div><div><br/></div><div><span style="font-weight: bold; text-decoration: underline;">Tokens</span></div><div><br/></div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.15);-en-codeblock:true;"><div>Our models understand and process text by breaking it down into tokens. <font color="#ff9300">Tokens can be words or just chunks of characters.</font> For example, the word “hamburger” gets broken up into the tokens “ham”, “bur” and “ger”, while a short and common word like “pear” is a single token. Many tokens start with a whitespace, for example “ hello” and “ bye”.</div><div><br/></div><div>The number of tokens processed in a given API request depends on the length of both your inputs and outputs. As a rough rule of thumb, 1 token is approximately 4 characters or 0.75 words for English text. One limitation to keep in mind is that your text prompt and generated completion combined must be no more than the model's maximum context length (for most models this is 2048 tokens, or about 1500 words). Check out our <a href="https://platform.openai.com/tokenizer">tokenizer tool</a> to learn more about how text translates to tokens.</div></div><div><br/></div><div>Models</div><div><br/></div><div>The API is powered by a set of models with different capabilities and price points. GPT-4 is our latest and most powerful model. GPT-3.5-Turbo is the model that powers ChatGPT and is optimized for conversational formats. To learn more about these models and what else we offer, visit our <a href="https://platform.openai.com/docs/models">models documentation</a>.</div><div><br/></div><div>Next steps</div><div><br/></div><ul><li><div>Keep our usage policies in mind as you start building your application.</div></li><li><div>Explore our examples library for inspiration.</div></li><li><div>Jump into one of our guides to start building.</div></li></ul><div><br/></div></body></html>