Skip to main content ☰ Contents Index You! < Prev ^ Up Next > \(
\newcommand{\lt}{<}
\newcommand{\gt}{>}
\newcommand{\amp}{&}
\definecolor{fillinmathshade}{gray}{0.9}
\newcommand{\fillinmath}[1]{\mathchoice{\colorbox{fillinmathshade}{$\displaystyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\textstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptstyle \phantom{\,#1\,}$}}{\colorbox{fillinmathshade}{$\scriptscriptstyle\phantom{\,#1\,}$}}}
\)
Exercises 13.16 Multiple Choice Questions
1.
Q-1: What protocol can be used to retrieve web pages using python?
urllib
urlib is a python library that contains several modules with URLs
bs4
bs4 is a python library pulling out data from HTML files.
HTTP
HTTP is a network protocol that is used to transmit different documents like HTML.
GET
GET is a HTTP request method from a specified resource in a server.
2.
Q-2: What provides two way communication between two different programs in a network.
socket
A single socket is a program that can be used to send and receive data in a network.
port
A port represents an endpoint on a computer that can connect to different network nodes.
http
HTTP is a protocol used for transfer data from a web server.
protocol
protocol is a set of rules that determine how data is transmitted over a network.
3.
Q-3: What is a python library that can be used to send and receive data over HTTP?
http
http is a protocol and not a python library
urllib
urllib can be used to send and receive data over HTTP instead of manually doing it using a webbrowser.
port
port is an endpoint for a device to connect with other devices in a network to transmit similar types of data.
header
a header is additional information sent and received along with data.
4.
Q-4: What is the process by which search engines retrieve webpages and build a search index called?
scrape
Scrape is the act of extraction of webpages
parse
Parse is breaking down scraped webpages to useful data
BeautifulSoup
BeautifulSoup is a python library for extracting HTML documents
spider
spider retrieves a webpage and then all the webpages linked to it to form a search index.
5.
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
It sends a request to extract 'romeo.txt' from 'data.pr4e.org'
this sends a GET request to the webserver over port 80
It sends the 'romeo.txt' file to 'data.pr4e.org'
This does not send a file to the webserver.
It creates a file named 'romeo.txt'
This does not create a file
It throws an error because a socket cannot use HTTP.
sockets can be used to connect with different types of servers using different protocols.
6.
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
It creates a file named 'romeo.txt' in 'data.pr4e.org'
urllib.request cannot create files in a web server.
It finds the urls linked to 'data.pr4e.org' and prints it.
urllib.request is not a spider.
It opens a file named 'http://data.pr4e.org/romeo.txt' in local storage
urllib.request does not handle files in local storage
It prints the contents of 'romeo.txt' after retrieving it from 'data.pr4e.org'
urllib.request requests the file and then accepts it.
7.
import urllib.request, urllib.parse, urllib.error
img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()
It retrieves 'cover3.jpg' and saves it to your computer.
Running the code does not display any output because it saves the file to your computer.
It displays the image 'cover3.jpg'.
It does not output anything on the screen.
It retrieves the url to download 'cover3.jpg'
The urllib retrieves the file and parses it.
8.
http[s]?://.+?
Exact match to 'http[s]?://.+?'
The regex uses wildcard characters and is not an exact match case.
'http://' or 'http[s]://' followed by one or more character
the square brackets denotes a character class with 0 or 1 's'.
'http://' or 'https://' followed by one or more characters.
the '[s]?' means 0 or 1 s and '.+?' means 1 or more characters
'https://' followed by one or more characters.
the regex also accepts 'http://' because '[s]?' means 'http' followed by 0 or 1 's'
9.
url = "https://www.nytimes.com"
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
retrieves and displays the webpage
This does not display the webpage. BeautufulSoup parses webpage retrieved by urllib.rquest
parses the html content of the "https://www.nytimes.com" webpage.
This parses all html tags and contents of the webpage.
downloads the webpage
This does not save files to the computer
10.
url = "https://www.nytimes.com/"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('img')
for tag in tags:
print(tag.get('src', None))
retrieves and displays the webpage
urllib retrieves the webpage but does not display it
downloads the webpage
this does not save files to the computer
prints the images from 'www.nytimes.com'
BeautifulSoup and html.parser cannot display images
prints all the 'img' sources under 'src' from 'www.nytimes.com'
it prints out the image sources listed under 'src' of 'img' tags.