CSIE1000 Assignment 1 -Web Crawler Solved

30.00 $

Category:

Description

Rate this product

Requirements

Crawl the announcement page (https://www.csie.ntu.edu.tw/news/news.php?class=101) of CSIE website within specified range of dates. (Please use the request headers in TA sample codes) The results should contain but not limited to the following fields:

Post date
e.g. 2019-05-14

Content
recursively find all the text in <div class=”editor content”>

Please save the results to a CSV file which can be opened by Excel using utf-8. Please note that:

User should be able to specify the path to write the CSV file with –output argument.
Formats

Each record in one line.
Fields of a record are seperated by a comma “,” with no space or new line

  • —  r08922053 Yu-Kai Huang
  • —  r08922042 Kuang-Yu Jeng

between.
Strings in the CSV file are enclosed by a pair of double quotation mark (e.g.

“I’m string ” ). And any double quote within a string should be replace by 2 double quotation mark. For instance, the string: “Prof. Yuguang “Michael” Fang, University of Florida” should be replaced by “Prof. Yuguang ““Michael”” Fang, University of Florida”

What you should do

Create programming environment (Linux environment (https://docs.google.com/presentation /d/1O43qZ5th7l5kpojirpqSCzVXqZtvL_7WZ7Z05wSCWig/edit?usp=sharing))
Init your git repository with a README

If not, there will have no master branch. Reference this (https://docs.google.com

/presentation/d/123JcZ-YwsCXcY6PYHk31_1wCss0ukQFGQ05SrOa6ZIg/edit#slide=id.g6c70dc8c07_0_0). Clone the repository to local

(optional) Copy TA sample codes (https://github.com/kaikai4n/ItC-python-hw-sample-code) to your local repo, push to origin, and star TA repository
Start programming (Python Toturial (https://docs.google.com/presentation/d/14pCla_krES-uVRrrv- aW1XtZNFV0ArhNn89WVedeV3Y/edit?usp=sharing))

After, finishing the crawler, remember to write team members’ names

school ids

Brief introduction to what the project does

Environment
e.g. CSIE Workstation, Python 3.6.2, lxml==4.4.2, tqdm==4.28.1, …

collaboration contribution (which programming parts you are responsible for)

Put your git url in a file and upload to ceiba, only one person in the team should upload.

What TAs will run

python3 main.py –start-date [start date] –end-date [end date] –output [out filenam –start-date and –end-date will be in the format of [Year]-[month]-[day] . For

e]

instance, 2019-12-09 .
–output is the csv filename to save. For instance, output.csv .

Score

Hope you get

Python

Run without error
Correctly parse arguments
Output files to correct place and can be opened by Excel and pandas.read_csv

 Sort by post date (current to before)

Sleep seconds before every request. This rule is required. You will lose points if you violate the rule.

Contents are correct
If your python contains malicious codes, all the team members will fail .

GitHub

Protect master branch

Pull Request
Always have pull request when merging to master branch

Peer review with comment
(10 pts) Collaboration: people in the team commit codes

(5 pts) Neat and tidy network: Rebase is required when merging codes with conflict. Abnormal branch networks would be thought of as not neat and tidy.

above is considered as not neat and tidy. (5 pts) Branch

Merge or delete unneccessary branches after finishing the homework Branch name should have meaning

For example, branch name kai is meaningless Others

(-10 pts) Readme contains no team members’ names or school ids or descriptions or environment
If no collaboration contribution is specified, TAs would think team members equally

The

contributed to the homework.
README is generally written in markdown format, but it is optional to use the format.

If you are interested in how to use markdown, you can reference markdown tutorial (https://hackmd.io/s/features-tw)

Related Links

GitHub Tutorial (https://docs.google.com/presentation/d/123JcZ- YwsCXcY6PYHk31_1wCss0ukQFGQ05SrOa6ZIg/edit?usp=sharing)
TA Sample code (https://github.com/kaikai4n/ItC-python-hw-sample-code) Python Introduction (https://docs.google.com/presentation/d/14pCla_krES-uVRrrv- aW1XtZNFV0ArhNn89WVedeV3Y/edit?usp=sharing)
Environment Setting (https://docs.google.com/presentation /d/1O43qZ5th7l5kpojirpqSCzVXqZtvL_7WZ7Z05wSCWig/edit?usp=sharing)

  • ITC-FP-WebCrawler-l7alju.zip