【老博客地址】

  【CSDN】博客主页
  【量化投资】00.资源篇——持续更新
  【量化投资】01.基本概念
  【量化投资】02.数据接口初探——excel读取新浪股票api数据
    【量化投资】03.量化工程abu学习之量化基础(1/3)

【背景】

之前的文章里推荐了好几个量化投资程序获取数据的方法,但是后来发现大多都有了限制,包括某share都得用积分才能使用,积分的获取非常麻烦。求人不如靠己!所以就开始自己一点点的自己想办法获取数据,并存储到本地。

本次的目标是从东方财富的官网页面上抓取股票的基本数据,内容如下:

  • 本期
    • 获取A股所有的股票列表和一些基本信息,并保存到cvs文件
    • 获取各股票的一些详细信息,并保存到csv文件
  • 计划
    • 后续将数据进行结构化保存到数据库
    • 后续改成多线程获取,加速爬取数据

【最终效果】


【代码】

获取所有A股股票清单

☆完整代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import urllib
import requests
import re
import json

def get_html(url):
with open('html.txt','w',encoding='utf-8') as f:
f.write(requests.get(url).text)

# 股票信息主页
# http://quote.eastmoney.com/center/gridlist.html#hs_a_board
# 沪深A股列表真实链接
url = "http://11.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112405084455909086425_1610764452571&pn=1&pz=4400&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1610764452635"

def read_file():
with open('html.txt','r',encoding='utf-8') as f:
all_contents = f.read()

# 正则表达式,获取股票列表信息json串
re_cont = re.findall('\\((.*?)\\);',all_contents,re.S)
js_cont = json.loads(re_cont[0])
stock_list=[]
for i in js_cont['data']['diff']:
#_stock={}
#_stock['code']=i['f12']
#_stock['name']=i['f14']
#_stock['new_price']=i['f2']
conten=[] # 存放单只股票信息
conten.append(i['f12']) # 股票代码
conten.append(i['f14']) # 股票名称
conten.append(i['f2']) # 最新价
conten.append(i['f3']) # 涨跌幅
conten.append(i['f4']) # 涨跌额
conten.append(i['f5']) # 成交量(手)
conten.append(i['f6']) # 成交额
conten.append(i['f7']) # 振幅
conten.append(i['f15']) # 最高
conten.append(i['f16']) # 最低
conten.append(i['f17']) # 今开
conten.append(i['f18']) # 昨收
conten.append(i['f10']) # 量比
conten.append(i['f8']) # 换手率
conten.append(i['f9']) # 市盈率(动态)
conten.append(i['f23']) # 市净率
stock_list.append(conten)

with open('stock.csv','w+',encoding='utf-8') as f:
# 写入表头
n=0
single=(str(n)+','+'股票代码'+','+'股票名称'+','+'最新价'+','+'涨跌幅'+','
+'涨跌额'+','+'成交量(手)'+','+'成交额(亿)'+','+'振幅'+','+'最高'+','
+'最低'+','+'今开'+','+'昨收'+','+'量比'+','+'换手率'+','
+'市盈率(动态)'+','+'市净率'+'\n')
f.write(single)
# 写入股票数据
for _ in stock_list:
n+=1
single = str(n)
for val in _ :
single=single+','+str(val)
single=single+'\n'
f.write(single)

get_html(url)
read_file()
☆代码解读
  1. 获取返回股票列表信息的真实连接
  • 打开东方财富的 主页 ,获取返回值的html,查看源码,发现其页面里面没有股票信息,说明是通过生成页面框架,然后动态加载的。
  • 使用fiddler或者chrome自带功能(可以参考我之前的相关文章),点击“沪深A股”或者点击“翻页”,从fiddler捕获的连接或chrome的开发者工具(F12)的Network中找到最新的几个连接,找到真实的连接.
  • 分析请求request的header请求,发现关键信息,pn=3,pz=20,猜想是第3页和每页20个的请求参数,其他f1,f2…应该是需要请求哪些数据的参数。那么我们把参数改成pn=1&pz=5000,也就是一次性将A股的4000多只股票全部获取到。
    1
    2
    3
    4
    5
    Request URL: http://56.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124034105331658665383_1612251280345&pn=3&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1612251280452
    Request Method: GET
    Status Code: 200 OK
    Remote Address: 101.89.224.78:80
    Referrer Policy: strict-origin-when-cross-origin
  1. 访问真实连接,并对返回值进行解析
  • 分析返回reponse,发现其是一个非常规整json字符串,外面包了一个小括弧,那就非常好办了,通过正则表达式把json字符串取出来,然后使用json包进行解析即可。
  • 通过页面对比,发现每个返回字段代表的中文含义,例如:f12是股票代码

获取每只股票信息信息

☆完整代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import urllib
import requests
import re
import os
import csv
import json

#读取文件,拼接地址
f = open('info_fin.csv','w+',encoding='utf-8')
f.write('股票代码'+","+"股票名称"+","+"最新价"+","+"最高价"+","+"最低价"+","+"今开"+","+"总手"+","+"金额"+","+"外盘"+","+"量比"
+","+"涨停"+","+"跌停"+","+"收益(新)"+","+"昨收"+","+"总股本"+","+"流通股"
+","+"净利润"+","+"每股净资产"+","+"总值"+","+"流值"+","+"板块"+","+"地区"+","+"pe"
+","+"ROE"+","+"总营收"+","+"总营收同比"+","+"净利润同比"+","+"净利率"+","+"负债率"+","+"上市时间"
+","+"每股未分配利润"+"\n")

file = open('stock.csv',encoding='utf-8')
fileReader = csv.reader(file)
filedata = list(fileReader)

for _,i in enumerate(filedata):
if _==0: # 跳过文件头
continue
if i[1].startswith('6'):
addr = "http://push2.eastmoney.com/api/qt/stock/get?ut=fa5fd1943c7b386f172d6893dbfba10b&invt=2&fltt=2&fields=f43,f57,f58,f169,f170,f46,f44,f51,f168,f47,f164,f163,f116,f60,f45,f52,f50,f48,f167,f117,f71,f161,f49,f530,f135,f136,f137,f138,f139,f141,f142,f144,f145,f147,f148,f140,f143,f146,f149,f55,f62,f162,f92,f173,f104,f105,f84,f85,f183,f184,f185,f186,f187,f188,f189,f190,f191,f192,f107,f111,f86,f177,f78,f110,f262,f263,f264,f267,f268,f250,f251,f252,f253,f254,f255,f256,f257,f258,f266,f269,f270,f271,f273,f274,f275,f127,f199,f128,f193,f196,f194,f195,f197,f80,f280,f281,f282,f284,f285,f286,f287,f292&secid=1." + i[1] + "&cb=jQuery11240660569033297794_1610790254977&_=1610790254978"
else:
addr = "http://push2.eastmoney.com/api/qt/stock/get?ut=fa5fd1943c7b386f172d6893dbfba10b&invt=2&fltt=2&fields=f43,f57,f58,f169,f170,f46,f44,f51,f168,f47,f164,f163,f116,f60,f45,f52,f50,f48,f167,f117,f71,f161,f49,f530,f135,f136,f137,f138,f139,f141,f142,f144,f145,f147,f148,f140,f143,f146,f149,f55,f62,f162,f92,f173,f104,f105,f84,f85,f183,f184,f185,f186,f187,f188,f189,f190,f191,f192,f107,f111,f86,f177,f78,f110,f262,f263,f264,f267,f268,f250,f251,f252,f253,f254,f255,f256,f257,f258,f266,f269,f270,f271,f273,f274,f275,f127,f199,f128,f193,f196,f194,f195,f197,f80,f280,f281,f282,f284,f285,f286,f287,f292&secid=0." + i[1] + "&cb=jQuery11240660569033297794_1610790254977&_=1610790254978"


re_cont = re.findall("\\((.*?)\\);",requests.get(addr).text,re.S)
je_cont = json.loads(re_cont[0])
#fin = {}
#fin['code'] = je_cont['data']['f57']
#fin['name'] = je_cont['data']['f58']
work = ''
#print(str(je_cont['data']['f43']))
#print('12')
work = (str(je_cont['data']['f57'])+ ","+ str(je_cont['data']['f58'])+"," + str(je_cont['data']['f43'])+ ","+ str(je_cont['data']['f44'])
+ ","+ str(je_cont['data']['f45'])+ ","+ str(je_cont['data']['f46'])+ ","+ str(je_cont['data']['f47'])
+ ","+ str(je_cont['data']['f48'])+ ","+ str(je_cont['data']['f49'])+ ","+ str(je_cont['data']['f50'])+ ","+ str(je_cont['data']['f51'])
+ ","+ str(je_cont['data']['f52'])+ ","+ str(je_cont['data']['f55'])
+ ","+ str(je_cont['data']['f60'])+ ","+ str(je_cont['data']['f84'])+","+ str(je_cont['data']['f85'])+","+str(je_cont['data']['f105'])+","+str(je_cont['data']['f92'])
+ ","+ str(je_cont['data']['f116'])+","+ str(je_cont['data']['f117'])+","+str(je_cont['data']['f127'])+","+str(je_cont['data']['f128'])+","+str(je_cont['data']['f162'])+","+str(je_cont['data']['f173'])
+ ","+ str(je_cont['data']['f183'])+","+ str(je_cont['data']['f184'])+","+str(je_cont['data']['f185'])+","+str(je_cont['data']['f187'])+","+str(je_cont['data']['f188'])+","+str(je_cont['data']['f189'])
+ ","+ str(je_cont['data']['f190']))

work = work+'\n'
f.write(work)
☆代码解读
  1. 通过点击详情页面,找到每只股票详细信息访问的连接的规律
    1
    2
    http://quote.eastmoney.com/sz******.html
    http://quote.eastmoney.com/sh******.html
  2. 同样的方法,发现所有的股票详细数据,都是动态加载的。使用fiddler或者chrome自带功能,获取真实连接。
  3. 分析规律组装真实连接,“6”开头的请求参数为“1.股票代码”,其他是“0.股票代码”
    1
    2
    3
    4
    5
     if i[1].startswith('6'):
    addr = "http://push2.eastmoney.com/api/qt/stock/get?ut=fa5fd1943c7b386f172d6893dbfba10b&invt=2&fltt=2&fields=f43,f57,f58,f169,f170,f46,f44,f51,f168,f47,f164,f163,f116,f60,f45,f52,f50,f48,f167,f117,f71,f161,f49,f530,f135,f136,f137,f138,f139,f141,f142,f144,f145,f147,f148,f140,f143,f146,f149,f55,f62,f162,f92,f173,f104,f105,f84,f85,f183,f184,f185,f186,f187,f188,f189,f190,f191,f192,f107,f111,f86,f177,f78,f110,f262,f263,f264,f267,f268,f250,f251,f252,f253,f254,f255,f256,f257,f258,f266,f269,f270,f271,f273,f274,f275,f127,f199,f128,f193,f196,f194,f195,f197,f80,f280,f281,f282,f284,f285,f286,f287,f292&secid=1." + i[1] + "&cb=jQuery11240660569033297794_1610790254977&_=1610790254978"
    else:
    addr = "http://push2.eastmoney.com/api/qt/stock/get?ut=fa5fd1943c7b386f172d6893dbfba10b&invt=2&fltt=2&fields=f43,f57,f58,f169,f170,f46,f44,f51,f168,f47,f164,f163,f116,f60,f45,f52,f50,f48,f167,f117,f71,f161,f49,f530,f135,f136,f137,f138,f139,f141,f142,f144,f145,f147,f148,f140,f143,f146,f149,f55,f62,f162,f92,f173,f104,f105,f84,f85,f183,f184,f185,f186,f187,f188,f189,f190,f191,f192,f107,f111,f86,f177,f78,f110,f262,f263,f264,f267,f268,f250,f251,f252,f253,f254,f255,f256,f257,f258,f266,f269,f270,f271,f273,f274,f275,f127,f199,f128,f193,f196,f194,f195,f197,f80,f280,f281,f282,f284,f285,f286,f287,f292&secid=0." + i[1] + "&cb=jQuery11240660569033297794_1610790254977&_=1610790254978"

  4. 解析返回reponse,同样的正则表达式+json解析,然后通过页面对比,发现每个返回字段对比的中文含义,例如:f57代表股票代码
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    最新价:f43
    最高:f44
    最低:f45
    今开:f46
    总手:f47
    金额:f48
    外盘:f49
    量比:f50
    涨停:f51
    跌停:f52
    收益(三): f55
    股票代码:f57
    股票名称:f58
    昨收:f60
    总股本:f84
    流通股:f85
    净利润:f105
    每股净资产:f92
    总值:f116
    流值:f117
    板块:f127
    地区:f128
    pe(动):f162
    ROE:f173
    总营收:f183
    总营收同比:f184
    净利润同比:f185
    净利率:f187
    负债率:f188
    上市时间:f189
    每股未分配利润:f190