ferventdesert · sravya1994 · Jul 7, 2016 · Jul 7, 2016 · Jul 8, 2016 · Jul 8, 2016
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,18 @@
 *.pyc
-.idea
+.idea
+*.ipynb
+test
+ipynb
+etlpy.egg-info
+dist
+data
+EGG-INFO
+.vscode
+.ipynb_checkpoints
+etlpy.egg-info
+.DS_Store
+__pycache__/
+etlpy/__pycache__/
+etlpy/pinhole.log
+insurance.json
+pinhole.log
diff --git a/README.md b/README.md
@@ -1,32 +1,31 @@
-# etlpy
-##designed by desert
-a smart stream-like crawler &amp; etl python library
 
-##1.简介
-etlpy是基于配置文件的数据采集和清洗工具。  
+# etlpy: Python编写的流式爬虫系统
 
-写爬虫和数据清洗代码总是很烦人。因此，应该通过工具生成爬虫和数据清洗的代码！  etlpy就是为了解决这个问题而生的。  
+## 简介
 
-通过可视化和图形化设计工具，快速生成爬虫和数据清洗流程，并保存为xml文件，并由etlpy引擎解析它，即可获得最终的数据结果。
+etlpy是纯Python开发的函数库，实现流式DSL(领域特定语言)，能一行内完成爬虫，文件处理和数据清洗等。能和pandas等类库充分集成。
 
-##2.使用
-使用起来非常简单:
+它和linux的bash pipeline,C#的Linq以及作者本人开发的Hawk有高度的相似性。
+
+下面一行代码实现了获取博客园第1到10页的所有html:
 ```
-from etl import ETLTool
-tool = ETLTool();
-tool.LoadProject('project.xml', '数据清洗ETL-大众点评');
-datas = tool.RefreshDatas();
-for r in datas:
-  print(r)
+from etlpy.etlpy import *
+t= task().p.create(range(1,10)).cp('p:html').format('http://www.cnblogs.com/p{_}').get()
+#t.to_df()  生成DataFrame
+for data in t:
+    print data
+
 ```
-RefreshDatas函数返回的是生成器，通过for循环，即可自动读取所有数据。
+把上面的t改成下面的语句，自动监测算法就能自动分析网页结构，生成解析脚本：
+
+`t=task().create().url.set('http://www.cnblogs.com').get().tree().detect()`
 
-##3.基本原理
-模块分为 生成，过滤，排序，转换，执行四种。  
 
-利用Python的生成器，可以将不同模块组织起来，定义一个流水线，数据（python的字典）会在流水线上被加工和消费。  
+在p列生成从1到10的数，拷贝p列到html列，将html列合并为url,并发送web请求，最后的html正文保存在html列。
 
-图形化工具是用C#开发的，使用了类似Python生成器的Linq技术。其原始思路来自于Lisp的s-表达式。
+etlpy的特性有：
 
-##4. 用途
-爬虫，计算，清洗，任何符合一定计算范式的数据，都可以使用它来完成。
+- 同时支持python2和python3
+- 内置方便的代理，http get/post请求，写法与requests库非常相似
+- 内置正则解析，html转义，json转换等数据清洗功能，直接输出
+- 能方便地将任务按照协程，线程，进程，和多机分布式的方式进行任务并行
diff --git a/batch.sh b/batch.sh
@@ -0,0 +1,4 @@
+for((i=0; i<$1; ++i))  
+do
+ nohup  python src/distributed.py client $2 &
+done  
diff --git a/distributed.py b/distributed.py
diff --git a/docs/1.0综述.md b/docs/1.0综述.md
@@ -0,0 +1,65 @@
+# etlpy: A streaming DSL in Python
+
+## Intro
+
+etlpy is a function library written in Python, you can write code in even one line to do complicated web crawler, file processing and data filtering, which can be integred with Pandas, requests.
+
+etlpy是纯Python开发的函数库，实现流式DSL(领域特定语言)，能一行内完成爬虫，文件处理和数据清洗等。能和pandas等类库充分集成。纯链式操作，代码极简。
+
+The design philosophy comes from:
+- bash pipeline in linux
+- Linq in C#
+- filter system in jinja2(a template engine)
+- flink and blink
+- Hawk by same author.
+
+它和linux的bash pipeline,C#的Linq, jinja2的过滤器(filter)以及作者本人开发的Hawk有高度的相似性。
+
+the following code can get html from homepage to page 10 in website cnblogs:
+
+下面一行代码实现了获取博客园第1到10页的所有html:
+```
+from etlpy import *
+t= task().p.create(range(1,10)).cp('p:html').format('http://www.cnblogs.com/p{}').get()
+
+for data in t:
+    print data
+
+```
+
+It means generate num from 1 to 10 in column p, merge column p to column html, then format string as url like below, send web requests to the certain url and get the html.
+
+Finally, you can get all data from t using iterator.
+
+意思是指：在p列生成从1到10的数，拷贝p列到html列，将html列合并为url,并发送web请求，最后的html正文保存在html列
+
+etlpy supports:
+- Python2 & 3
+- http proxies, get/posts, really same as famous Python requests library
+- regex, filter, html format and clean
+- running code in parallel mode without modifying code.
+
+etlpy的特性有：
+
+- 同时支持python2和python3
+- 内置方便的代理，http get/post请求，写法与requests库非常相似
+- 内置正则解析，html转义，json转换等数据清洗功能，直接输出
+- 能方便地将任务按照协程，线程，进程，和多机分布式的方式进行任务并行
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+