【老博客地址】

  【CSDN】博客主页
  【纯技术交流】python批量下载公众号历史文章(一)
  (在刊载一段时间后,莫名被吞掉了,所以在这里补档)
  python批量下载公众号历史文章(二)

【写在前面】

前些天和同事讨论问题,什么是人才?讨论细节就不表了,最终总结为一句话:能解决问题的人才是人才。自己爱交朋友,很多朋友都在我有困难的时候帮助过我,也希望自己能够帮助朋友,成为能帮朋友解决问题的人才,所以朋友的请求从来都是来者不拒,力所能及之事一定全力办到!

基于这点,给自己挖了很多坑:前些日子写的学习强国自动学习已经完成了一个初步版本,能够自动读文章、自动学习视频、自动分享,但是还不能自动答题,朋友说已经够用了,不过还是希望有时间了完成这块儿的内容;想读黄易老爷子武侠的那位朋友的问题也解决了,虽然下载下来的文字部分章节格式上有些乱,但也满足了朋友的要求,够用就行;学习Python也是一样,因时间关系没有系统的去学习过,只是通过解决各种问题达到学以致用,够用就行。很多博客文章,写有序号,都是计划要分段完成的,希望自己有一天能把所有的坑都填完了。

【背景】

昨天和朋友去野钓,路上谈起股票。今年截止到目前,他的收益是40%多。问他炒股的心得,他是数据流派,经常看财务数据,就推荐了几个公众号,同时提出了一个请求,有些公众号历史文章太多,每次翻看之前的文章都得重头刷,问我能不能把这些文章下载到本地,自己想看的时候就看。想着不难,就答应姑且一试。

【目标】

  第一个版本,只是为了实现下载朋友指定的公众号的历史文章,并转换成pdf方便阅读。很多需要手工配合,算是一个半自动的版本吧。
  后续会陆续减少手工,并把文章存储在数据库里,支持检索等。

【思路】

  和从网页上爬取小说不同,微信的公众号是通过app操作进行爬取,所以不能直接用chrome查看网页源码进行解析,需要用到第三方抓包工具,网上看了下用的比较多的有fiddler和anyproxy,前者比较简单,就选择了前者,思路如下:
  1、通过fiddler抓取相应公众号的包(需要手动);
  2、通过python进行解析,获取历史文章列表信息json;
  3、通过python进行解析,获取所有文章信息,包括标题、链接等;
  4、通过python循环将这些文章下载到本地存为html;
  5、通过python将文章里面的图片都保存到本地;
  6、通过python将本地html文件里面的图片链接,修改为本地存储图片的路径(因为只存文章还不行,你会发现存到本地的html文件,里面的图片都无法正常显示);
  7、通过python调用wkhtmltopdf工具,将html转换成pdf。

【工具】

  1、抓包工具fiddler;
  2、html转pdf工具wkhtmltopdf;
  3、python3.7及对应的包;
  4、电脑版微信或者模拟器或者直接用手机也行(如果用手机或者模拟器,需要设置代理);

【准备工作之抓包】

  有些公众号的包比较特殊,所以要进行不同的处理。以朋友要求的公众号为例(为了避免广告嫌疑,就不写是啥公众号了)。
  1、下载fiddler软件,因为公众号里都是https的链接,需要做抓包配置。中间会弹出一些框框让确认,全部点确定就行。

  2、打开电脑版微信,搜索需要的公众号,点开历史文章列表。通过fiddler查看抓到的包的特点。

  3、为了避免其他链接的干扰,配置过滤规则,滤掉不需要的链接。

  4、然后清空fiddler的抓取记录,重新点击历史文章,并不断下拉,直到所有链接都显示出来为止。

  5、将过滤之后获得的所有的seesion,以raw的形式保存到本地。

【编码】

  1、标准报文头和需要的包

1
2
3
4
5
6
7
8
9
10
11
# _*_ coding:utf-8 _*_
import os,sys
import requests
import json
import subprocess
import re
import random
import time
from bs4 import BeautifulSoup
from datetime import datetime,timedelta
from time import sleep

  2、定义一个文章类,保存文章的基本信息

1
2
3
4
5
class ArticleInfo():
def __init__(self,url,title,idx_num): #idx_num是为了方便保存图片命名
self.url = url
self.title = title
self.idx_num = idx_num

  3、定义读取文件内容的方法(通用)

1
2
3
4
def read_file(file_path):
with open(file_path,"r",encoding="utf-8") as f:
file_content = f.read()
return file_content

  4、定义写文件内容的方法(通用)

1
2
3
def save_file(file_path,file_content):
with open(file_path,"w",encoding="utf-8") as f:
f.write(file_content)

  5、定义下载网页html的方法(通用)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def get_html(url):
headers = {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 QBCore/4.0.1219.400 QQBrowser/9.0.2524.400 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.5;q=0.4",
'Connection':'keep-alive'
}
response = requests.get(url,headers = headers,proxies=None)
if response.status_code == 200:
htmltxt = response.text #返回的网页正文
return htmltxt
else:
return None

  6、定义下载并保存网页图片的方法(通用)

1
2
3
4
5
6
7
8
9
10
11
def get_save_image(url,img_file_path):
headers = {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 QBCore/4.0.1219.400 QQBrowser/9.0.2524.400 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.5;q=0.4",
'Connection':'keep-alive'
}
response = requests.get(url,headers = headers,proxies=None)
with open(img_file_path,"wb") as f:
f.write(response.content)

  7、定义调用下载html和图片的方法(主函数调用此方法开始下载)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def down_html(json_path,html_path):
if not os.path.exists(html_path):
os.makedirs(html_path) # 创建保存html文件的文件夹
local_img_path = os.path.join(html_path,"images")
if not os.path.lexists(local_img_path):
os.makedirs(local_img_path) # 创建保存本地图片的文件夹
article_list = get_article_list(json_path)
article_list.sort(key=lambda x:x.atc_datetime, reverse=True) # 根据文章发表时间倒序排列
tot_article = len(article_list) # 文章的总数量
i = 0 #计数用
for atc in article_list:
i+=1
atc_unique_name = str(atc.atc_datetime) + "_" + str(atc.idx_num) # 时间+序号 作为同一时间发表的文章的唯一标识
html_name = atc_unique_name+".html"
html_file_path = os.path.join(html_path,html_name)
print(i,"of",tot_article,atc_unique_name,atc.title)
if os.path.exists(html_file_path): # 支持续传
print("{} existed already!".format(html_file_path))
continue
org_atc_html = get_html(atc.url)
new_atc_html = rep_image(org_atc_html,local_img_path,html_name)
save_file(html_file_path,new_atc_html)
sleep(round(random.uniform(1,3),2))
"""for test
if i>0 :
break
"""

  8、查看fiddler保存的json文件结构,定义解析json字符串获取文章列表的方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def get_article_list(json_path):
"""
通过抓取的包的json文件,获取所有文章的信息的列表
"""
file_list = os.listdir(json_path) #jsonpath是fiddler导出的文件夹路径
article_list = [] # 用来保存所有文章的列表
for file in file_list:
file_path = os.path.join(json_path,file)
file_cont = read_file(file_path)
json_cont = json.loads(file_cont)
general_msg_list = json_cont['general_msg_list']
json_list = json.loads(general_msg_list)
#print(json_list['list'][0]['comm_msg_info']['datetime'])
for lst in json_list['list']:
atc_idx = 0 # 每个时间可以发多篇文章 为了方便后续图片命名
seconds_datetime = lst['comm_msg_info']['datetime']
atc_datetime = seconds_to_time(seconds_datetime)
if lst['comm_msg_info']['type'] == 49: # 49为普通的图文
atc_idx+=1
url = lst['app_msg_ext_info']['content_url']
title = lst['app_msg_ext_info']['title']
atc_info = ArticleInfo(url,title,atc_idx,atc_datetime)
article_list.append(atc_info)
if 1 == lst['app_msg_ext_info']['is_multi']: # 一次发多篇
multi_app_msg_item_list = lst['app_msg_ext_info']['multi_app_msg_item_list']
for multi in multi_app_msg_item_list:
atc_idx+=1
url = multi['content_url']
title = multi['title']
mul_act_info = ArticleInfo(url,title,atc_idx,atc_datetime)
article_list.append(mul_act_info)
return article_list

  9、定义替换html中图片src为本地图片的方法(不替换,html中的图片将无法显示)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def chg_img_link(bs_html):
link_list = bs_html.findAll("link")
for link in link_list:
href = link.attrs["href"]
if href.startswith("//"):
new_href = "http:"+href
link.attrs["href"]=new_href

def rep_image(org_html,local_img_path,html_name):
bs_html = BeautifulSoup(org_html,"lxml")
img_list = bs_html.findAll("img")
img_idx = 0 # 计数和命名用
for img in img_list:
img_idx+=1
org_url = "" # 图片的真实地址
if "data-src" in img.attrs: # <img data-src="..."
org_url = img.attrs['data-src']
elif "src" in img.attrs : # <img src="..."
org_url = img.attrs['src']
if org_url.startswith("//"):
org_url = "http:" + org_url
if len(org_url) > 0 :
print("download image ",img_idx)
if "data-type" in img.attrs:
img_type = img.attrs["data-type"]
else:
img_type = "png"
img_name = html_name + "_" + str(img_idx) + "." +img_type
img_file_path = os.path.join(local_img_path,img_name)
get_save_image(org_url,img_file_path) # 下载并保存图片
img.attrs["src"] = "images/" + img_name
else:
img.attrs["src"] = ""
chg_img_link(bs_html)
return str(bs_html)

  10、定义html转换成pdf的方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def conv_html_pdf(html_path,pdf_path):
if not os.path.exists(pdf_path):
os.makedirs(pdf_path)
f_list = os.listdir(html_path)
for f in f_list:
if (not f[-5:]==".html") or ("tmp" in f): #不是html文件的不转换,含有tmp的不转换
continue
html_file_path = os.path.join(html_path,f)
html_tmp_file = html_file_path[:-5]+"_tmp.html" #生成临时文件,供转pdf用
html_str = read_file(html_file_path)
bs_html = BeautifulSoup(html_str,"lxml")
pdf_title = ""
title_tag = bs_html.find(id="activity-name")
if title_tag is not None:
pdf_title = "_"+title_tag.get_text().replace(" ", "").replace(" ","").replace("\n","")
print(pdf_title)
r_idx = html_file_path.rindex("/") + 1
pdf_name = html_file_path[r_idx:-5]+pdf_title
pdf_file_path = os.path.join(pdf_path,pdf_name+".pdf")
"""
加快转换速度,把临时文件中的不必要的元素去掉
"""
[s.extract() for s in bs_html(["script","iframe","link"])]
save_file(html_tmp_file,str(bs_html))
call_wkhtmltopdf(html_tmp_file,pdf_file_path)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def call_wkhtmltopdf(html_file_path,pdf_file_path,skipExists=True,removehtml=True):
if skipExists and os.path.exists(pdf_file_path):
print("pdf_file_path already existed!")
if removehtml :
os.remove(html_file_path)
return
exe_path = cfg['wkhtmltopdf'] #wkhtmltopdf.exe的保存路径
cmd_list = []
cmd_list.append(" --load-error-handling ignore ")
cmd_list.append(" "+ html_file_path +" ")
cmd_list.append(" "+ pdf_file_path +" ")
cmd_str = exe_path + "".join(cmd_list)
print(cmd_str)
subprocess.check_call(cmd_str, shell=False)
if removehtml:
os.remove(html_file_path)

  11、定义读取保存各文件夹路径的配置文件的方法

1
2
3
4
5
def get_config():
cfg_file = read_file("config/wechat.cfg")
cfg_file = cfg_file.replace("\\\\","/").replace("\\","/") #防止json中有 / 导致无法识别
cfg_json = json.loads(cfg_file)
return cfg_json

  【重要】配置文件放置在当前目录的子目录config下,名为wechat.cfg,内容如下

1
2
3
4
5
6
{
"jsonDir": "./download/wechat/fiddler-raw/Dump-0902-14-43-38/",
"htmlDir": "./download/wechat/html/",
"pdfDir": "./download/wechat/pdf/",
"wkhtmltopdf": "./wkhtmltopdf.exe"
}

  12、定义从秒数获得时间的方法(微信公众号的datetime存放的是从1970-01-01 00:00:00到发表时的秒数)

1
2
3
4
5
def seconds_to_time(seconds):
taime_array = time.localtime(seconds) # 1970-01-01 00:00:00 到发表时的秒数
other_style_time = time.strftime("%Y-%m-%d %H:%M:%S", taime_array)
date_time =datetime.strptime(other_style_time, "%Y-%m-%d %H:%M:%S")
return str(date_time).replace("-","").replace(":","").replace(" ","")

  13、定义主函数进行调用,没有参数或者参数是html则进行下载,如果是pdf则是html转换成pdf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cfg = get_config() # 获得配置文件的全局变量
#get_article_list("./tmp/") # for test
#down_html("./tmp/","./html/")# for test

if __name__ == "__main__":

if len(sys.argv) == 1:
arg = None
else:
arg = sys.argv[1]
if arg is None or arg == "html":
down_html(cfg['jsonDir'],cfg['htmlDir'])
elif arg == "pdf":
conv_html_pdf(cfg['htmlDir'],cfg['pdfDir'])

【运行效果】


【完整代码】

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
# _*_ coding:utf-8 _*_
import os,sys
import requests
import json
import subprocess
import re
import random
import time
from bs4 import BeautifulSoup
from datetime import datetime,timedelta
from time import sleep

class ArticleInfo():
def __init__(self,url,title,idx_num,atc_datetime): #idx_num是为了方便保存图片命名
self.url = url
self.title = title
self.idx_num = idx_num
self.atc_datetime = atc_datetime

def read_file(file_path):
with open(file_path,"r",encoding="utf-8") as f:
file_content = f.read()
return file_content

def save_file(file_path,file_content):
with open(file_path,"w",encoding="utf-8") as f:
f.write(file_content)

def get_html(url):
headers = {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 QBCore/4.0.1219.400 QQBrowser/9.0.2524.400 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.5;q=0.4",
'Connection':'keep-alive'
}
response = requests.get(url,headers = headers,proxies=None)
if response.status_code == 200:
htmltxt = response.text #返回的网页正文
return htmltxt
else:
return None

def get_save_image(url,img_file_path):
headers = {
"Accept": "*/*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36 QBCore/4.0.1219.400 QQBrowser/9.0.2524.400 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.5;q=0.4",
'Connection':'keep-alive'
}
response = requests.get(url,headers = headers,proxies=None)
with open(img_file_path,"wb") as f:
f.write(response.content)

def get_article_list(json_path):
"""
通过抓取的包的json文件,获取所有文章的信息的列表
"""
file_list = os.listdir(json_path) #jsonpath是fiddler导出的文件夹路径
article_list = [] # 用来保存所有文章的列表
for file in file_list:
file_path = os.path.join(json_path,file)
file_cont = read_file(file_path)
json_cont = json.loads(file_cont)
general_msg_list = json_cont['general_msg_list']
json_list = json.loads(general_msg_list)
#print(json_list['list'][0]['comm_msg_info']['datetime'])
for lst in json_list['list']:
atc_idx = 0 # 每个时间可以发多篇文章 为了方便后续图片命名
seconds_datetime = lst['comm_msg_info']['datetime']
atc_datetime = seconds_to_time(seconds_datetime)
if lst['comm_msg_info']['type'] == 49: # 49为普通的图文
atc_idx+=1
url = lst['app_msg_ext_info']['content_url']
title = lst['app_msg_ext_info']['title']
atc_info = ArticleInfo(url,title,atc_idx,atc_datetime)
article_list.append(atc_info)
if 1 == lst['app_msg_ext_info']['is_multi']: # 一次发多篇
multi_app_msg_item_list = lst['app_msg_ext_info']['multi_app_msg_item_list']
for multi in multi_app_msg_item_list:
atc_idx+=1
url = multi['content_url']
title = multi['title']
mul_act_info = ArticleInfo(url,title,atc_idx,atc_datetime)
article_list.append(mul_act_info)
return article_list

def chg_img_link(bs_html):
link_list = bs_html.findAll("link")
for link in link_list:
href = link.attrs["href"]
if href.startswith("//"):
new_href = "http:"+href
link.attrs["href"]=new_href

def rep_image(org_html,local_img_path,html_name):
bs_html = BeautifulSoup(org_html,"lxml")
img_list = bs_html.findAll("img")
img_idx = 0 # 计数和命名用
for img in img_list:
img_idx+=1
org_url = "" # 图片的真实地址
if "data-src" in img.attrs: # <img data-src="..."
org_url = img.attrs['data-src']
elif "src" in img.attrs : # <img src="..."
org_url = img.attrs['src']
if org_url.startswith("//"):
org_url = "http:" + org_url
if len(org_url) > 0 :
print("download image ",img_idx)
if "data-type" in img.attrs:
img_type = img.attrs["data-type"]
else:
img_type = "png"
img_name = html_name + "_" + str(img_idx) + "." +img_type
img_file_path = os.path.join(local_img_path,img_name)
get_save_image(org_url,img_file_path) # 下载并保存图片
img.attrs["src"] = "images/" + img_name
else:
img.attrs["src"] = ""
chg_img_link(bs_html)
return str(bs_html)


def down_html(json_path,html_path):
if not os.path.exists(html_path):
os.makedirs(html_path) # 创建保存html文件的文件夹
local_img_path = os.path.join(html_path,"images")
if not os.path.lexists(local_img_path):
os.makedirs(local_img_path) # 创建保存本地图片的文件夹
article_list = get_article_list(json_path)
article_list.sort(key=lambda x:x.atc_datetime, reverse=True) # 根据文章发表时间倒序排列
tot_article = len(article_list) # 文章的总数量
i = 0 #计数用
for atc in article_list:
i+=1
atc_unique_name = str(atc.atc_datetime) + "_" + str(atc.idx_num) # 时间+序号 作为同一时间发表的文章的唯一标识
html_name = atc_unique_name+".html"
html_file_path = os.path.join(html_path,html_name)
print(i,"of",tot_article,atc_unique_name,atc.title)
if os.path.exists(html_file_path): # 支持续传
print("{} existed already!".format(html_file_path))
continue
org_atc_html = get_html(atc.url)
new_atc_html = rep_image(org_atc_html,local_img_path,html_name)
save_file(html_file_path,new_atc_html)
sleep(round(random.uniform(1,3),2))
"""for test
if i>0 :
break
"""

def conv_html_pdf(html_path,pdf_path):
if not os.path.exists(pdf_path):
os.makedirs(pdf_path)
f_list = os.listdir(html_path)
for f in f_list:
if (not f[-5:]==".html") or ("tmp" in f): #不是html文件的不转换,含有tmp的不转换
continue
html_file_path = os.path.join(html_path,f)
html_tmp_file = html_file_path[:-5]+"_tmp.html" #生成临时文件,供转pdf用
html_str = read_file(html_file_path)
bs_html = BeautifulSoup(html_str,"lxml")
pdf_title = ""
title_tag = bs_html.find(id="activity-name")
if title_tag is not None:
pdf_title = "_"+title_tag.get_text().replace(" ", "").replace(" ","").replace("\n","")
print(pdf_title)
r_idx = html_file_path.rindex("/") + 1
pdf_name = html_file_path[r_idx:-5]+pdf_title
pdf_file_path = os.path.join(pdf_path,pdf_name+".pdf")
"""
加快转换速度,把临时文件中的不必要的元素去掉
"""
[s.extract() for s in bs_html(["script","iframe","link"])]
save_file(html_tmp_file,str(bs_html))
call_wkhtmltopdf(html_tmp_file,pdf_file_path)

def call_wkhtmltopdf(html_file_path,pdf_file_path,skipExists=True,removehtml=True):
if skipExists and os.path.exists(pdf_file_path):
print("pdf_file_path already existed!")
if removehtml :
os.remove(html_file_path)
return
exe_path = cfg['wkhtmltopdf'] #wkhtmltopdf.exe的保存路径
cmd_list = []
cmd_list.append(" --load-error-handling ignore ")
cmd_list.append(" "+ html_file_path +" ")
cmd_list.append(" "+ pdf_file_path +" ")
cmd_str = exe_path + "".join(cmd_list)
print(cmd_str)
subprocess.check_call(cmd_str, shell=False)
if removehtml:
os.remove(html_file_path)

def get_config():
cfg_file = read_file("config/wechat.cfg")
cfg_file = cfg_file.replace("\\\\","/").replace("\\","/") #防止json中有 / 导致无法识别
cfg_json = json.loads(cfg_file)
return cfg_json

def seconds_to_time(seconds):
taime_array = time.localtime(seconds) # 1970-01-01 00:00:00 到发表时的秒数
other_style_time = time.strftime("%Y-%m-%d %H:%M:%S", taime_array)
date_time =datetime.strptime(other_style_time, "%Y-%m-%d %H:%M:%S")
return str(date_time).replace("-","").replace(":","").replace(" ","")


cfg = get_config() # 获得配置文件的全局变量
#get_article_list("./tmp/") # for test
#down_html("./tmp/","./html/")# for test

if __name__ == "__main__":

if len(sys.argv) == 1:
arg = None
else:
arg = sys.argv[1]
if arg is None or arg == "html":
down_html(cfg['jsonDir'],cfg['htmlDir'])
elif arg == "pdf":
conv_html_pdf(cfg['htmlDir'],cfg['pdfDir'])