Python3.3で、はてなブログAtomPubを使ってみた（続き）

前回の記事では、AtomPubを使って、ブログの全記事をXMLファイルとしてダウンロードしたので、今回はこのXMLファイルから、必要な情報を取り出してみる。

必要なモジュール：

XMLの構造を調べてみた

AtomPubで取得したXMLを見てみると、複数の＜entry＞～＜/entry＞が含まれている。
＜entry＞～＜/entry＞が1つの記事データで、その中に＜title＞や＜content＞といったタグが含まれている。どのタグに、何の情報が入っているかを整理してみた。

と、その前に、情報を格納するためにEntryクラス（というか namedtuple ）を定義しておく。

import collections

Entry = collections.namedtuple('Entry',
	'id, link, edit_link, author, title, updated, published, edited, ' +
	'summary, content, formatted, categories, draft')

Entryのメンバと、対応するタグはこうなる。

Entryのメンバ	型	<entry>内のタグ	内容
id	str	<id>	記事のID(?)
link	str	<link rel="alternate" href="～">	記事のURI
edit_link	str	<link rel="edit" href="～">	編集用URI ※1
author	str	<author><name>	作者名
title	str	<title>	タイトル
updated	datetime	<updated>	更新日 ※2
published	datetime	<published>	公開日
edited	datetime	<app:edited>	編集日 ※2
summary	str	<summary>	本文の先頭部分
content	str	<content>	本文 ※3
formatted	str	<hatena:formatted-content>	HTMLに変換したもの
categories	list of str	<category term="～" />	カテゴリ（複数並ぶ）
draft	bool	<app:control><app:draft>	下書きならyes、公開済みならno

※1 編集用URIとは、AtomPubで記事を編集する際に使うURIのこと。
※2 更新日と編集日は同じ値のようなので、どちらか1つでいいかもしれない。
※3 ブログをはてな記法で書いているので、ここは、はてな記法のテキストになる。

BeautifulSoupでXMLを解析

XMLの解析は、前回も使った BeautifulSoupを使う。
まず、soup.find_all('entry') で、XML内のすべての＜entry＞タグを探し、その＜entry＞タグの中から必要な情報を抽出する。

soup = BeautifulSoup(～, features='xml')

for entry in soup.find_all('entry'):
    pass

タグの値を取得する

<title>タイトル</title>

こういう形式なら、単純に

s = entry.title.text

とすれば、'タイトル'を取得できる。もし、

<title><hoge>テキスト</hoge></title>

こうなっていたら、

s = entry.title.hoge.text

となる。

属性値を取得する

<link rel="alternate" href="～"/>

ここから、href=の値を取得したいなら、

s = entry.find('link', rel='alternate').get('href')

とする。

BeautifulSoupの注意点

いくつか苦労した点があった。

＜name＞タグには注意

<author><name>作者名</name></author>

このとき、

s = entry.author.name.text

では取得できず、エラーになる。BeautifulSoupでnameは別の用途（タグ名を取得）に使われているからだ。

s = entry.author.find('name').text

このように find() を使うといい。
同じようなケースは、他にも attrs, contents, parent, children など、たくさんあるのでご用心。
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes

＜formatted-content＞（タグ名に「-」が含まれる）

<formatted-content>～</formatted-content>

このように、タグ名に「-」が含まれている場合も、

s = entry.formatted-content.text

とは書けないので、

s = entry.find('formatted-content').text

とする。

＜app:control＞（タグ名に「:」が含まれる）

<app:control>
  <app:draft>yes</app:draft>
</app:control>

コロン「:」が付いている場合は、「:」より後ろ（control）を指定すればいいようだ。

s = entry.control.draft.text

これでいけた。

日時

'2014-02-23T17:23:05+09:00'

という形式の文字列を、Pythonのdatetimeに変換するには、python-dateutilモジュールを使う。

import dateutil.parser

d = dateutil.parser.parse('2014-02-23T17:23:05+09:00')

まとめてみた

XMLファイルパスから、Entryを列挙する。

import dateutil.parser
from bs4 import BeautifulSoup
import collections

Entry = collections.namedtuple('Entry',
	'id, link, edit_link, author, title, updated, published, edited, ' +
	'summary, content, formatted, categories, draft')

def enum_entries(xml_path):
    """
    XMLファイルから記事情報を列挙する
    """
    def get_link(entry_tag, rel):
        t = entry_tag.find('link', rel=rel)
        return t.get('href') if t else None

    # XMLから必要な情報を取り出す
    soup = BeautifulSoup(open(xml_path, 'rb').read(), features='xml')
    for entry in soup.find_all('entry'):
        yield Entry(
            id=entry.id.text,
            link=get_link(entry, 'alternate'),
            edit_link=get_link(entry, 'edit'),
            author=entry.author.find('name').text,
            title=entry.title.text,
            updated=dateutil.parser.parse(entry.updated.text),
            published=dateutil.parser.parse(entry.published.text),
            edited=dateutil.parser.parse(entry.edited.text),
            summary=entry.summary.text,
            content=entry.content.text,
            formatted=entry.find('formatted-content').text,
            categories={t.get('term') for t in entry.find_all('category')} - {None},
            draft=entry.control.draft.text == 'yes',
        )

使用例

せっかくなので、実際に使ってみる。

全記事へのリンクを、はてな記法で出力する

引数の xml_files は前回ダウンロードしたすべてのXMLファイルのパスを渡す。

def make_link_list(xml_files):
    entries = sorted([e for xml in xml_files for e in enum_entries(xml)], key=lambda e: e.published)
    return ''.join('-[{}:title={}]\n'.format(e.link, e.title) for e in entries)

実際に、このブログの目次は、こうやって出力したものを、手作業で直して作った。
（思った以上に、手作業での修正に時間がかかってしまったのだが…）

全記事をテキストファイルに変換する

1記事＝1ファイルのテキストファイルに変換してみた。（特に何に使うわけでもないが。）

def export_to_txt(xml_files, txt_dir):
    for xml_path in xml_files:
        for entry in enum_entries(xml_path):
            # ファイルパスを作成
            filename = '{0:%Y%m%d-%H%M%S}.txt'.format(entry.published)
            txt_path = os.path.join(txt_dir, filename)
            # ファイル内容を作成
            s = 'URL: {url}\nDate: {date}\nCategory: {category}\nTitle: {title}\n{line}\n{content}'.format(
                title=entry.title, url=entry.link, date=entry.published, content=entry.content,
                category=', '.join(entry.categories), line='-' * 80)
            # ファイルに保存
            with open(txt_path, 'w', encoding='utf-8') as f:
                f.write(s)

ファイル名は、単純に「(公開日時).txt」としている。

summer_tree_home

Check iOでPython3をマスターするぜっ