将ttlsa站点文章导入evernote

06/01/2014 18:32:37python已关闭评论6,712字数 3586阅读11分57秒阅读模式

平时喜欢用印象笔记存资料，所以想到把这个站点的资料导入到印象笔记中。本来打算用ifttt的rss->evernote来实现的，但是ifttt没法导入博客的标签，也没有办法将以前的博客导入。最后自己写了一个python脚本实现了。

windows下的evernote提供了本地api，详情参见http://dev.yinxiang.com/doc/articles/enscript.php 文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

python的lxml模块支持xpath，可以方便的解析html内容，下面说明脚本中几个关键点。文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

from lxml import html文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

#获取网页源码文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

lhtml=html.parse('https://www.ttlsa.com/page/1')文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

#获取博客列表
posts=lhtml.xpath('//div[@id="main"]/div[@class="post"]')文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

//是从html的根目录开始查找，div[@id="main"]是查找id属性是main的div标签，div[@class="post"]是查找class属性为post的html。post中存放了文章的摘要。文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

#获取一篇博客的链接文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

href=post.xpath('h2/a/@href[.]')[0]文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

[.] 用来获取属性的值，a/@href[.]就是获取超链接标签a的href值，就是博客的链接了。文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

注：xpath获取到的数据都是存放在数组里面的。文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

#获取文章内容文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

contents=post.xpath('div[@class="content"]/*/text()')文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

*匹配所有节点文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

text()用来获取根点下的文本内容这里只能获取到直接子节点的内容，子节点的子节点就获取不到了文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

contents存放了文章的段落列表，evernote支持<br/>换行，将将所有段落连接为一个。文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

l=[]
[l.append(cgi.escape(t)) for t in contents]
content='<br/>'.join(l)
content=content.split('[......]')[0]
content+=content+'[......]'文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

cgi.escape()方法是将html的符号编码，因为evernote只支持部分html格式，如果不编码的话可能导致导入evernote失败。文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

#获取博客标签文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

tags=post.xpath('div[@class="under"]/span/a[@rel="tag"]/text()')文章源自运维生存时间-https://www.ttlsa.com/python/inport-ttlsa-to-evernote/

#导入evernote

subprocess.call('ENScript.exe importNotes /n www.ttlsa.com /s '+path)

ENScript.exe在evernote的安装目录下，使用前要将安装目录假如环境变量，或者使用绝对路径。

/n 指定笔记本

/s evernote文件路径

下面贴上完整的源代码

#encoding:utf8
import subprocess
from lxml import html
import os
import cgi

enex='''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-export SYSTEM "http://xml.evernote.com/pub/evernote-export2.dtd">
<en-export export-date="20131219T061541Z" application="Evernote/Windows" version="5.x">
<note><title>{{title}}</title><content><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">

<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">
{{content}}
<div><br/></div><div><a href="{{href}}">{{hreftitle}}</a></div></en-note>]]></content>{{tag}}</note></en-export>
'''

def get_latest_blog(index):
global num
logfile='last.log'
lasthref=''
#if os.path.exists(logfile) and os.path.isfile(logfile):
# lasthref=open('last.log','r').read()

print 'reading index',str(index)
href='https://www.ttlsa.com/page/'+str(index)
lhtml=html.parse(href)
#获取博客列表
posts=lhtml.xpath('//div[@id="main"]/div[@class="post"]')
if not posts:
return 'exit'

#从ttlsa读取博客
for post in posts:
href=post.xpath('h2/a/@href[.]')[0]
href=cgi.escape(href)
#if lasthref==href:
# print 'this is the last href', href
# return posts[0].xpath('h2/a/@href[.]')[0] #返回最新的href
#xpath获取网页数据
title=post.xpath('h2/a/text()')[0]
title=cgi.escape(title)
contents=post.xpath('div[@class="content"]/*/text()')
l=[]
[l.append(cgi.escape(t)) for t in contents]
content='<br/>'.join(l)
content=content.split('[......]')[0]
content+=content+'[......]'
tags=post.xpath('div[@class="under"]/span/a[@rel="tag"]/text()')
#format enex
note=enex.replace('{{title}}',title.encode('utf8'))
note=note.replace('{{hreftitle}}',href.encode('utf8'))
note=note.replace('{{href}}',href.encode('utf8'))
note=note.replace('{{content}}',content.encode('utf8'))
enex_tag=''
for tag in tags:
enex_tag+='<tag>%s</tag>'%(tag)
note=note.replace('{{tag}}',enex_tag.encode('utf8'))
print title
#print href
#print 'tag:',tags
#保存到本地
path='blog\\'+href.split('/')[-1]
file1=open(path,'w')
file1.write(note)
file1.close()
#导入evernote
ret=subprocess.call('ENScript.exe importNotes /n www.ttlsa.com_2 /s '+path)
if ret ==0:
num+=1
print 'num:',num
if ret != 0:
open('err.log','a').write(href.encode('utf8')+'\r\n')
print '-----'

return 'next'

#==========begin============#
num=0 #正确写入evernote的数量
page_num=32 #ttlsa博客的页数

page_num=page_num+1
for index in range(1,page_num):
index=page_num-index
res=get_latest_blog(index)
if res=='next':
continue
elif res=='exit': #没有读到网页
break
else:
open('last.log','w').write(res)
break

我的微信

微信公众号

扫一扫关注运维生存时间公众号，获取最新技术文章~