Skip to content

Commit d9eb275

Browse files
committed
Basically supports the markdown conversion, #1
1 parent 6decfb7 commit d9eb275

File tree

10 files changed

+535
-80
lines changed

10 files changed

+535
-80
lines changed

demos/yinxiang_markdown.html

Lines changed: 54 additions & 0 deletions
Large diffs are not rendered by default.
Loading

demos/yinxiang_md.html

Lines changed: 0 additions & 35 deletions
This file was deleted.

examples/process_md.ipynb

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": null,
6+
"metadata": {},
7+
"outputs": [],
8+
"source": [
9+
"md_text = \"\"\"\n",
10+
"# Header\n",
11+
"\n",
12+
"**bold**, _ite_, ~~other~~, more...\n",
13+
"`inline code` here.\n",
14+
"\n",
15+
"```python\n",
16+
"import os\n",
17+
"os.print('hello')\n",
18+
"```\n",
19+
"\n",
20+
"> Please work through this document in its entirety to better understand how OpenAI’s rate limit system works. We include code examples and possible solutions to handle common issues. It is recommended to **follow** this guidance before filling out the [Rate Limit Increase Request form](https://docs.google.com/forms/d/e/1FAIpQLSc6gSL3zfHFlL6gNIyUcjkEv29jModHGxg5_XGyr-PrE2LaHw/viewform) with details regarding how to fill it out in the last section.\n",
21+
"\n",
22+
"divider\n",
23+
"* * *\n",
24+
"\n",
25+
"### image\n",
26+
"local images:\n",
27+
"\n",
28+
"![846f62a6516227df1b4370aea3f63143.png](evernotecid://A2B91148-7880-4D85-A7CC-3A794B21D0F8/appyinxiangcom/186128/ENResource/p3511)\n",
29+
"\n",
30+
"web image:\n",
31+
"![pic](https://raw.githubusercontent.com/selfboot/html2notion/master/demos/notion_templage.png)\n",
32+
"\n",
33+
"[link](https://docs.microsoft.com/zh-tw/previous-versions/visualstudio/design-tools/expression-studio-2/cc294571(v=expression.10))\n",
34+
"\n",
35+
"### Table\n",
36+
"\n",
37+
"|header| column1 | column 2\n",
38+
"|-|-|-\n",
39+
"|row 1| row 1_1 | row 1_2\n",
40+
"|row 2| row 2_2 **bold**, _ite_, ~~other~~, more... | row 2_3\n",
41+
"\n",
42+
"### list\n",
43+
"\n",
44+
"[Why do we have rate limits?](https://platform.openai.com/docs/guides/rate-limits/overview)\n",
45+
"Rate limits are a common practice for APIs, and they're put in place for a few different reasons:\n",
46+
"\n",
47+
"- They help protect against abuse or misuse of the API. For example, a malicious actor could flood the API with requests in an attempt to overload it or cause disruptions in service. By setting rate limits, `OpenAI` can prevent this kind of activity.\n",
48+
"- Rate limits help ensure that everyone has fair access to the API. If one person or organization makes an excessive number of requests, it could bog down the API for everyone else. By throttling the number of requests that a single user can make, OpenAI ensures that the most number of people have an opportunity to use the API without experiencing slowdowns.\n",
49+
"- Rate limits can help OpenAI manage the aggregate load on its infrastructure. If requests to the API increase dramatically, it could tax the servers and cause performance issues. By setting rate limits, OpenAI can help maintain a smooth and consistent experience for all users.\n",
50+
"\n",
51+
"number list\n",
52+
"\n",
53+
"1. number list1\n",
54+
"2. numner list2\n",
55+
"\n",
56+
"## checkbox\n",
57+
"\n",
58+
"Three frogs\n",
59+
"* [x] The first frog\n",
60+
"* [ ] The second frog\n",
61+
"* [ ] The third frog\n",
62+
"\n",
63+
"# math and grapth\n",
64+
"\n",
65+
"Here is math\n",
66+
"```math\n",
67+
"e^{i\\pi} + 1 = 0\n",
68+
"```\n",
69+
"\n",
70+
"mermaid grapth:\n",
71+
"\n",
72+
"```mermaid\n",
73+
"graph TD\n",
74+
"A[Module A] -->|A1| B( Module B)\n",
75+
"B --> C{Confidition C}\n",
76+
"C -->|condition C1| D[Module D]\n",
77+
"C -->|condition C2| E[Module E]\n",
78+
"C -->|condition C3| F[Module F]\n",
79+
"```\n",
80+
"\n",
81+
"sequenceDiagram\n",
82+
"\n",
83+
"```mermaid\n",
84+
"sequenceDiagram\n",
85+
"A->>B: Have you received a message?\n",
86+
"B-->>A: Message received\n",
87+
"```\n",
88+
"\n",
89+
"gantt\n",
90+
"\n",
91+
"```mermaid\n",
92+
"gantt\n",
93+
"title Gantt chart\n",
94+
"dateFormat YYYY-MM-DD\n",
95+
"section Proj A\n",
96+
"Task 1 :a1, 2018-06-06, 30d\n",
97+
"Task 2 :after a1 , 20d\n",
98+
"section Proj B\n",
99+
"Task 3 :2018-06-12 , 12d\n",
100+
"Task 4 : 24d\n",
101+
"```\n",
102+
"\n",
103+
"### chart\n",
104+
"\n",
105+
"```chart\n",
106+
", budget, income, expenses, debt\n",
107+
"June,5000,8000,4000,6000\n",
108+
"July,3000,1000,4000,3000\n",
109+
"Aug,5000,7000,6000,3000\n",
110+
"Sep,7000,2000,3000,1000\n",
111+
"Oct,6000,5000,4000,2000\n",
112+
"Nov,4000,3000,5000,\n",
113+
"\n",
114+
"type: pie\n",
115+
"title: 每月收益\n",
116+
"x.title: Amount\n",
117+
"y.title: Month\n",
118+
"y.suffix: $\n",
119+
"```\n",
120+
"\n",
121+
"```chart\n",
122+
",Budget,Income,Expenses,Debt\n",
123+
"June,5000,8000,4000,6000\n",
124+
"July,3000,1000,4000,3000\n",
125+
"Aug,5000,7000,6000,3000\n",
126+
"Sep,7000,2000,3000,1000\n",
127+
"Oct,6000,5000,4000,2000\n",
128+
"Nov,4000,3000,5000,\n",
129+
"\n",
130+
"type: line\n",
131+
"title: Monthly Revenue\n",
132+
"x.title: Amount\n",
133+
"y.title: Month\n",
134+
"y.suffix: $\n",
135+
"```\n",
136+
"\"\"\""
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": null,
142+
"metadata": {},
143+
"outputs": [],
144+
"source": [
145+
"import re\n",
146+
"\n",
147+
"def extract_code_blocks(md_text):\n",
148+
" code_pattern = re.compile(r'```(\\w+)?\\n(.*?)```', re.DOTALL)\n",
149+
" matches = code_pattern.findall(md_text)\n",
150+
" code_blocks = [{'language': match[0], 'code': match[1]} for match in matches]\n",
151+
" return code_blocks\n",
152+
"\n",
153+
"\n",
154+
"code_blocks = extract_code_blocks(md_text)\n",
155+
"\n",
156+
"for block in code_blocks:\n",
157+
" print(f\"Language: {block['language']}\")\n",
158+
" print(f\"Code: {block['code']}\\n\")\n"
159+
]
160+
}
161+
],
162+
"metadata": {
163+
"kernelspec": {
164+
"display_name": "notion",
165+
"language": "python",
166+
"name": "python3"
167+
},
168+
"language_info": {
169+
"codemirror_mode": {
170+
"name": "ipython",
171+
"version": 3
172+
},
173+
"file_extension": ".py",
174+
"mimetype": "text/x-python",
175+
"name": "python",
176+
"nbconvert_exporter": "python",
177+
"pygments_lexer": "ipython3",
178+
"version": "3.11.2"
179+
},
180+
"orig_nbformat": 4
181+
},
182+
"nbformat": 4,
183+
"nbformat_minor": 2
184+
}

html2notion/translate/html2json.py

Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
from ..translate.html2json_default import Default_Type
1010
from ..translate.html2json_yinxiang import YinXiang_Type
1111
from ..translate.html2json_clipper import YinXiangClipper_Type
12+
from ..translate.html2json_markdown import YinXiangMarkdown_Type
1213

1314

1415
"""
@@ -17,14 +18,8 @@
1718
<meta name="source" content="mobile.android"/>
1819
"""
1920
def _is_yinxiang_export_html(html_soup):
20-
exporter_version_meta = html_soup.select_one('html > head > meta[name="exporter-version"]')
2121
meta_source = html_soup.select_one('html > head > meta[name="source"]')
22-
exporter_version_content = exporter_version_meta.get( 'content', "") if isinstance(exporter_version_meta, Tag) else ""
23-
2422
meta_source_content = meta_source.get('content', "") if isinstance(meta_source, Tag) else ""
25-
if isinstance(exporter_version_content, str) and not exporter_version_content.startswith("Evernote"):
26-
return False
27-
2823
yinxiang_source_content = ["yinxiang", "desktop", "mobile"]
2924
for prefix in yinxiang_source_content:
3025
if isinstance(meta_source_content, str) and meta_source_content.startswith(prefix):
@@ -37,28 +32,42 @@ def _is_yinxiang_export_html(html_soup):
3732
<meta name="source-application" content="微信" />
3833
"""
3934
def _is_yinxiang_clipper_html(html_soup):
40-
exporter_version_meta = html_soup.select_one('html > head > meta[name="exporter-version"]')
41-
exporter_version_content = exporter_version_meta.get(
42-
'content', "") if isinstance(
43-
exporter_version_meta, Tag) else ""
44-
45-
if isinstance(exporter_version_content, str) and not exporter_version_content.startswith("Evernote"):
46-
return False
47-
clipper_source_meta = html_soup.select_one('html > head > meta[name="source-application"]')
48-
clipper_source_content = clipper_source_meta.get('content', "") if isinstance(clipper_source_meta, Tag) else ""
49-
if isinstance(clipper_source_content, str) and clipper_source_content.endswith("evernote"):
35+
meta_source_application = html_soup.select_one('html > head > meta[name="source-application"]')
36+
source_application = meta_source_application.get('content', "") if isinstance(meta_source_application, Tag) else ""
37+
if isinstance(source_application, str) and source_application.endswith("evernote"):
5038
return True
51-
if isinstance(clipper_source_content, str) and clipper_source_content in ["微信",]:
39+
if isinstance(source_application, str) and source_application in ["微信",]:
40+
return True
41+
return False
42+
43+
44+
"""
45+
<meta name="content-class" content="yinxiang.markdown" />
46+
"""
47+
def _is_yinxiang_markdown_html(html_soup):
48+
meta_content_class = html_soup.select_one('html > head > meta[name="content-class"]')
49+
content_class = meta_content_class.get('content', "") if isinstance(meta_content_class, Tag) else ""
50+
if isinstance(content_class, str) and content_class.endswith("markdown"):
5251
return True
5352
return False
5453

5554

5655
def _infer_input_type(html_content):
5756
soup = BeautifulSoup(html_content, 'html.parser')
58-
if _is_yinxiang_clipper_html(soup):
59-
return YinXiangClipper_Type
60-
elif _is_yinxiang_export_html(soup):
61-
return YinXiang_Type
57+
exporter_version_meta = soup.select_one('html > head > meta[name="exporter-version"]')
58+
exporter_version_content = exporter_version_meta.get(
59+
'content', "") if isinstance(
60+
exporter_version_meta, Tag) else ""
61+
62+
# yinxiang export
63+
if isinstance(exporter_version_content, str) and exporter_version_content.startswith("Evernote"):
64+
if _is_yinxiang_markdown_html(soup):
65+
return YinXiangMarkdown_Type
66+
if _is_yinxiang_clipper_html(soup):
67+
return YinXiangClipper_Type
68+
elif _is_yinxiang_export_html(soup):
69+
return YinXiang_Type
70+
6271
return Default_Type
6372

6473

html2notion/translate/html2json_base.py

Lines changed: 34 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ class Block(Enum):
1616
DIVIDER = "divider"
1717
TABLE = "table"
1818
TO_DO = "to_do"
19+
EQUATION = "equation"
1920

2021
class Html2JsonBase:
2122
_registry = {}
@@ -28,6 +29,17 @@ class Html2JsonBase:
2829
"color": str,
2930
}
3031

32+
_language = {"abap", "agda", "arduino",
33+
"assembly", "bash", "basic", "bnf", "c", "c#", "c++", "clojure", "coffeescript", "coq", "css",
34+
"dart", "dhall", "diff", "docker", "ebnf", "elixir", "elm", "erlang", "f#", "flow", "fortran",
35+
"gherkin", "glsl", "go", "graphql", "groovy", "haskell", "html", "idris", "java", "javascript",
36+
"json", "julia", "kotlin", "latex", "less", "lisp", "livescript", "llvm ir", "lua", "makefile",
37+
"markdown", "markup", "matlab", "mathematica", "mermaid", "nix", "objective-c", "ocaml", "pascal",
38+
"perl", "php", "plain text", "powershell", "prolog", "protobuf", "purescript", "python", "r",
39+
"racket", "reason", "ruby", "rust", "sass", "scala", "scheme", "scss", "shell", "solidity", "sql",
40+
"swift", "toml", "typescript", "vb.net", "verilog", "vhdl", "visual basic", "webassembly", "xml",
41+
"yaml", "java/c/c++/c#"}
42+
3143
_color_tuple = namedtuple("Color", "name r g b")
3244
_notion_color = [
3345
_color_tuple("default", 0, 0, 0),
@@ -92,11 +104,7 @@ def extract_text_and_parents(tag: PageElement, parents=[]):
92104
@staticmethod
93105
def parse_one_style(tag_soup: Tag, text_params: dict):
94106
tag_name = tag_soup.name.lower()
95-
style = tag_soup.get('style', "")
96-
styles = {}
97-
if str and isinstance(style, str):
98-
styles = {rule.split(':')[0].strip(): rule.split(':')[1].strip() for rule in style.split(';') if rule}
99-
107+
styles = Html2JsonBase.get_tag_style(tag_soup)
100108
if Html2JsonBase.is_bold(tag_name, styles):
101109
text_params["bold"] = True
102110
if Html2JsonBase.is_italic(tag_name, styles):
@@ -456,6 +464,27 @@ def convert_table(self, soup):
456464
}
457465
return table_obj
458466

467+
# Only if there is no ";" in the value of the attribute, you can use this method to get all attributes.
468+
# Can't use this way like: background-image: url('data:image/png;base64...')
469+
@staticmethod
470+
def get_tag_style(tag_soup):
471+
style = tag_soup.get('style', "")
472+
styles = {}
473+
if str and isinstance(style, str):
474+
# style = ''.join(style.split())
475+
styles = {
476+
rule.split(':')[0].strip(): rule.split(':')[1].strip().lower()
477+
for rule in style.split(';')
478+
if rule and len(rule.split(':')) > 1
479+
}
480+
return styles
481+
482+
@staticmethod
483+
def get_valid_language(language):
484+
if language in Html2JsonBase._language:
485+
return language
486+
return "plain text"
487+
459488
@classmethod
460489
def register(cls, input_type, subclass):
461490
cls._registry[input_type] = subclass

html2notion/translate/html2json_clipper.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
YinXiangClipper_Type = "clipper.yinxiang"
77

88

9-
class Html2JsonYinXiang(Html2JsonBase):
9+
class Html2JsonClipper(Html2JsonBase):
1010
input_type = YinXiangClipper_Type
1111

1212
def __init__(self, html_content, import_stat):
@@ -158,4 +158,4 @@ def _check_is_block(self, element):
158158
return False
159159

160160

161-
Html2JsonBase.register(YinXiangClipper_Type, Html2JsonYinXiang)
161+
Html2JsonBase.register(YinXiangClipper_Type, Html2JsonClipper)

0 commit comments

Comments
 (0)