$ curl https://www.ibiblio.org/xml/examples/shakespeare/hamlet.xml > hamlet.xml
Hold my beer.
$ grep "your philosophy" hamlet.xml # <LINE>Than are dreamt of in your philosophy. But come;</LINE> $ python
“But come.”
Heh.
So, some kind soul transcribed Hamlet into XML. Let’s dig into it and see what we can find.
>>> import xml.etree.ElementTree as ET >>> tree = ET.parse('country_data.xml') >>> root = tree.getroot() >>> [l.text for l in root.iter("LINE")] # ["Who's there?", 'Nay, answer me: stand, and unfold yourself.', 'Long live the king!', 'Bernardo?', 'He.' ...
So far so good. The XML API sucks but we should be able to get something useful out of it. Like… who said all that stuff?
Well, the XML input looks like this:
<SPEECH> <SPEAKER>KING CLAUDIUS</SPEAKER> <LINE>It likes us well;</LINE> <LINE>And at our more consider'd time well read,</LINE> <LINE>Answer, and think upon this business.</LINE> <LINE>Meantime we thank you for your well-took labour:</LINE> <LINE>Go to your rest; at night we'll feast together:</LINE> <LINE>Most welcome home!</LINE> </SPEECH>
It’s less than nice, on the scale of shit to nice.
Right. Okay Sam, you got this.
>>> speeches = [ {"speaker": s.find("SPEAKER").text, "lines": [l.text for l in s.findall("LINE")]} for s in root.iter("SPEECH")] # [{'lines': ["Who's there?"], # 'speaker': 'BERNARDO'}, # {'lines': ['Nay, answer me: stand, and unfold yourself.'], # 'speaker': 'FRANCISCO'}, # {'lines': ['Long live the king!'], # 'speaker': 'BERNARDO'}, # {'lines': ['Bernardo?'], # 'speaker': 'FRANCISCO'}, # {'lines': ['He.'], # 'speaker': 'BERNARDO'} # ...
Better! Let’s denormalise it so we get a speaker for each speech.
>>> flattened_speeches = [{"speaker": speech["speaker"], "line": line} ... for speech in speeches ... for line in speech["lines"]] # [{'line': "Who's there?", 'speaker': 'BERNARDO'}, # {'line': 'Nay, answer me: stand, and unfold yourself.', 'speaker': 'FRANCISCO'}, # {'line': 'Long live the king!', 'speaker': 'BERNARDO'}, # {'line': 'Bernardo?', 'speaker': 'FRANCISCO'}, # {'line': 'He.', 'speaker': 'BERNARDO'} # ...
Right, that’s loads better! So, who has the happiest line?
Keanu does.
sudo pip install textblob python -m textblob.download_corpora
TextBlob “is a Python (2 and 3) library for processing textual data”.
Sure it is.
>>> from textblob import TextBlob >>>TextBlob("awesome!").sentiment # Sentiment(polarity=1.0, subjectivity=1.0)
>>> [{"speaker": l["speaker"], "line": l["line"], "score": TextBlob(l["line"]).sentiment.polarity} for l in flattened_speaches] # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/Library/Python/2.7/site-packages/textblob/blob.py", line 361, in __init__ # 'must be a string, not {0}'.format(type(text))) # TypeError: The `text` argument passed to `__init__(text)` must be a string, not <type 'NoneType'>
Shiiiiiiiiiiit.
OH! One of the lines is bollocksed up:
<SPEECH> <SPEAKER>ROSENCRANTZ</SPEAKER> <LINE> <STAGEDIR>To POLONIUS</STAGEDIR> God save you, sir! </LINE> </SPEECH>
We should fix our parsing logic, right?
WRONG! Let’s bodge it:
[{"speaker": l["speaker"], "line": l["line"], "score": TextBlob(l["line"]).sentiment.polarity} for l in flattened_speaches if l["line"] is not None]
Alright, let’s kill this thing:
>>> happy_lines = [{"speaker": l["speaker"], "line": l["line"], "score": TextBlob(l["line"]).sentiment.polarity} for l in flattened_speaches if l["line"] is not None] >>> max(happy_lines, key=lambda x: x["score"]) # {'line': 'Which, happily, foreknowing may avoid, O, speak!', 'score': 1.0, 'speaker': 'HORATIO'}
OF COURSE IT’S FUCKING HORATIO