从东方财富抓取机构持股数据

在经典的动量选股策略中,有一个比较经典的特征,就是机构持股比例,一般来说机构持股比例高的股票跟基本面因子的相关度更高

在聚宽和优矿的数据中都没有机构持股比例这个数据,但是在东方财富网站上有这个数据,例如合兴包装的机构持股统计在这个网页可以看到

http://data.eastmoney.com/zlsj/detail/2018-09-30-0-002228.html

为了能够在量化策略中使用这个因子,决定手动写代码从网站抓取这个数据,

python抓这个数据使用requests_html的HTMLSession或者用urllib+ bs4都可以实现

使用urllib+ bs4的代码如下

[cc lang="python"]
import urllib
import bs4

#下载页面

def get_html(url):
    html = urllib.request.urlopen(url).read()
    html = html.decode('gbk')

return html

#解析页面抓取需要的数据

def get_page(url):
    html = get_html(url)
    soup = bs4.BeautifulSoup(html, "html.parser")
    table = soup.find('table', class_='tab1')
    holdElement = table.select('td')
    if (holdElement == None):
        print("hold empty");

#基金持股家数
try:
    fund_num = int(holdElement[2].text)
except:
    fund_num = 0

#基金持股股数
try:
    fund_hold_vol = float(holdElement[3].text)
except:
    fund_hold_vol = 0.0

#基金持股市值
try:
    fund_hold_value = float(holdElement[4].text)
except:
    fund_hold_value = 0.0

#基金持股占总市值比例
try:
    fund_ratio = float(holdElement[5].text)
except:
    fund_ratio = 0.0

#基金持股占流通市值比例
try:
    fund_ratio_in_circu = float(holdElement[6].text)
except:
    fund_ratio_in_circu = 0.0

#机构汇总总家数
try:
    total_num = int(holdElement[38].text)
except:
    total_num = 0

#机构汇总持股股数
try:
    total_hold_vol = float(holdElement[39].text)
except:
    total_hold_vol = 0.0

#机构汇总持股市值
try:
    total_hold_value = float(holdElement[40].text)
except: 
    total_hold_value = 0.0

#机构汇总占总股本比例
try:
    total_ratio = float(holdElement[41].text)
except:
    total_ratio = 0.0

#机构汇总占流通股本比例

try:
    total_ratio_in_circu = float(holdElement[42].text)
except:
    total_ratio_in_circu = 0.0
return (code, end_date, fund_num, fund_hold_vol, fund_hold_value, fund_ratio, fund_ratio_in_circu,
total_num, total_hold_vol,total_hold_value, total_ratio, total_ratio_in_circu)
[/cc]

拿到这些数据后,再写入mysql或者csv,就可以在策略中使用了

[cc lang="python"]
end_dates=('2017-03-31','2017-06-30','2017-09-30','2017-12-31','2018-03-31','2018-06-30')
#end_date='2018-09-30'
for end_date in end_dates:
    for code in df_stocks['code']:
        url = "http://data.eastmoney.com/zlsj/detail/%s-0-%s.html"
        args = get_page(url % (end_date,code[:-5]))
        #查询数据库判断是否存在记录
        sql_select = "SELECT * FROM institutional_hold WHERE code = '%s' AND end_date='%s' "
        cur = connect.cursor()
        record_count = cur.execute(sql_select % (code, end_date))
        
        if record_count > 0:
            continue

        column_str = """code, end_date, fund_num, fund_hold_vol, fund_hold_value, fund_ratio, 
        fund_ratio_in_circu,total_num, total_hold_vol,total_hold_value, total_ratio, 
total_ratio_in_circu"""
        insert_str = "'%s','%s',%d,%.2f,%.2f,%.2f,%.2f,%d,%.2f,%.2f,%.2f,%.2f"
        sql_insert = "INSERT INTO institutional_hold (%s) VALUES (%s)" % (column_str, insert_str)

        try:
            cur.execute(sql_insert % args)
            connect.commit()
        except:
            connect.rollback()
        print(str(code) + ' ' + str(end_date), end= '\r')
[/cc]

暂无评论

发表评论