Learning Outcomes
- extract information from a document using a parsing library
Resources
Lab
Get the number of computer labs from a Langara page.
Make a new project direcory, langara-computer-labs
.
Change to the project direcory (cd langara-computer-labs
).
Use npm init
to initialize the projet.
Install the Axios and htmlparser2 dependencies (npm install axios htmlparser2
).
Create the following index.js
file to get started:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
const axios = require('axios');
const htmlparser2 = require("htmlparser2");
let inTitle = false;
const parser = new htmlparser2.Parser(
{
onopentag(name, attribs) {
if (name === "title") {
inTitle = true;
}
},
ontext(text) {
if (inTitle) {
console.log( "Page title: " + text );
}
inTitle = false;
},
onclosetag(tagname) {
if (tagname === "html") {
console.log("That's it!");
}
}
},
{ decodeEntities: true }
);
if ( process.argv.length > 2 ) {
axios.get( process.argv[2] )
.then( response => {
parser.write( response.data );
parser.end();
})
.catch( error => {
console.log("Could not fetch page.");
});
} else {
console.log( "Missing URL argument" );
}
Test the program as follows:
1
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/a-building.html
You should see the title of the web page in the console.
We will now modify the program to count the number of computer labs in a building. The page above has that information in table. If we count the number of rows in that table, we will get our answer. We have to be careful to count the rows in the right table.
To do so, we will start by creating several state variables at the top of the program:
1
2
3
4
let inTitle = false;
let countRows = false;
let done = false;
let rowCount = 0;
We need to start counting rows when we are in the right table Fortunately, the tables with the room information have the class
attribute set to table-responsive table-basic hide-mobile
so we can key in on that in onattribute
parser event handler:
1
2
3
4
5
onattribute(name, value) {
if ( name === "class" && value === "table-responsive table-basic hide-mobile") {
countRows = true;
}
},
Once we are counting rows, we count every tr
opening tag, so add this to the onopentag
event handler:
1
2
3
if ( name === "tr" && countRows ) {
rowCount++;
}
Finally, once we get to the table
closing tag, we print the out the result and set the done
flag so we don’t repeat for other tables on the page. This can all be done by adding the following to the onclosetag
event handler:
1
2
3
4
if ( tagname === "table" && countRows && !done ) {
console.log( "There are " + (rowCount-1) + " computer labs.");
done = true;
}
Convice yourself the program by testing it on the 5 building web pages:
1
2
3
4
5
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/a-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/b-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/c-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/g-and-l-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/t-building.html
Assignment
- Write a program that will print out the city, date/time, and temperature from a weather.gc.ca page.
- Test with the following pages:
1
2
3
https://weather.gc.ca/city/pages/bc-74_metric_e.html
https://weather.gc.ca/city/pages/on-143_metric_e.html
https://weather.gc.ca/city/pages/nl-24_metric_e.html