Make an app, scrape the net

12 minuten lezen

There are millions of apps, but the app showing the information you want the way you want, is not always available. The only solution is to create one.

I figured out a fast and easy way to create PWA ℹī¸ apps using PHP and React. The overall idea is to create a scraper. Take HTML from a website, convert it JSON and fetch the JSON with a React based PWA.

In this blog I'll go through every step. If you think it reads like a recipe - a lot of steps with little explanation of why it works - than you're totally right. That's what I'm aiming for. If you want more details, the documentation can be found on github.com 🔗.

For this tutorial I've created a crude PWA that shows the current weather (temperature) for several cities. I've used the table displayed on the KNMI's website 🔗 as input for the app.

0. What do you need?

  • PHP (I'm using valet+)
  • Git
  • node.js v8.15.1 or above
  • npm
  • PHP Simple HTML DOM Parser 🔗
  • React Boilerplate 🔗
  • A website you want to scrape

Now, lets get to work.

1. Turn HTML into JSON

Create a project folder. I'm going for "scraper". Make sure your PHP scripts will be able to run from here.

Download PHP Simple HTML DOM Parser 🔗 and put the (unzipped) simplehtmldom_2_0-RC2 folder in the root of your project. Of course, you can use composer instead.

composer require simplehtmldom/simplehtmldom:2.0-RC2

Create an index.php in the root.

Include Simple HTML DOM like this. Next make sure CORS settings are right and the page will be returned as JSON content.

<?php


# With Composer
# include 'vendor/simplehtmldom/simplehtmldom/simple_html_dom.php';


# Or manual download
include 'simplehtmldom_2_0-RC2/simple_html_dom.php';


//$cors = 'https://yourdomainhere.ninja/';
$cors = 'http://localhost:3000';


header('Content-type: application/json; charset=UTF-8;');
header('Access-Control-Allow-Origin: ' . $cors, FALSE);
header("Access-Control-Allow-Headers: Content-Type, Access-Control-Allow-Origin");

Now we need to scrape the table containing the data from the KNMI website. Simple HTML DOM will do the job.

Find tags on an HTML page with selectors just like jQuery.

Simple HTML DOM

We'll use it to loop through the HTML structure to create a table array.

function raw_scrape($url) {
  $doc = file_get_html($url);
  $headers = [];
  $rows = [];


  foreach ($doc->find('table tr') as $tr) {
    foreach ($tr->find('th') as $element) {
      array_push($headers, strip_tags($element->plaintext));
    }


    $row = [];
    foreach ($tr->find('td') as $element) {
      array_push($row, $element->plaintext);
    }


    if (!empty($row)) {
      array_push($rows, $row);
    }
  }


  $table = $rows;
  array_unshift($table, $headers);


  return $table;
}

The result will be refined in the next step.

The array still needs to be converted into a proper JSON. The follow code does this, and then kicks off the functions to echo the JSON encoded array.

function scrape ($url) {
  $tempArr = [];
  $table = raw_scrape($url);
  $headers = array_shift($table);


  foreach ($table as $indexR => $row) {
    $data = [];
    foreach ($row as $indexT => $tuple) {
      $data[$headers[$indexT]] = $tuple;
    }


    array_push($tempArr, $data);
  }


  return json_encode($tempArr);
}
echo scrape('https://www.knmi.nl/nederland-nu/weer/waarnemingen');

Check the output by visiting your local URL. For valet+ the default is scraper.test. In Firefox, it looks like this:

The "API" 😇

2. Turn JSON into a website

2.1 Setup React Boilerplate

React Boilerplate is a great starting point with redux, redux-saga, and PWA implementations. It's simple enough for small projects like this, but suited for complex apps.

Use Git to download React Boilerplate 🔗 to a subfolder of the scraper project, e.g. "js".

git clone https://github.com/react-boilerplate/react-boilerplate.git js

Navigate to the new folder and run

npm run setup && npm run clean

Answer "no" if you don't want to create a repo.

Run npm run generate to create a new container.

npm run generate

Select container and answer the following questions:

  • What should it be called? weather
  • Do you want to wrap your component in React.memo? Yes
  • Do you want headers? No
  • Do you want an actions/constants/selectors/reducer tuple for this container? Yes
  • Do you want sagas for asynchronous flows? (e.g. fetching data) Yes
  • Do you want i18n messages (i.e. will this component use text)? No
  • Do you want to load resources asynchronously? Yes

A new folder is now created called "Weather" in js/app/containers. You'll end up with something like this:

Run npm start and visit http://localhost:3000. You'll see a white page.

npm start

2.2 Fetch the JSON

From here on, you'll need to copy-paste a lot of code.

First, you'll probably need to create js/app/utils/request.js if it doesn't exist, and get it from github.com 🔗.

Let's open the Weather component as default homepage.

Go to js/app/containers/App/index.js and change all mentions of "HomePage" into "Weather". Change the import, and the Route. It will look like this:

import React from 'react';
import { Switch, Route } from 'react-router-dom';


import Weather from 'containers/Weather/Loadable';
import NotFoundPage from 'containers/NotFoundPage/Loadable';


import GlobalStyle from '../../global-styles';


export default function App() {
  return (
    <div>
      <Switch>
        <Route exact path="/" component={Weather} />
        <Route component={NotFoundPage} />
      </Switch>
      <GlobalStyle />
    </div>
  );
}

Your browser will now open the Weather component when visiting http://localhost:3000.

The next step is to fetch the JSON. This happens in js/app/containers/Weather/saga.js. Change it to make it look like this:

import { call, put, takeLatest } from 'redux-saga/effects';
import { LOAD_WEATHER } from './constants';
import { weatherLoaded } from './actions';
import request from 'utils/request';


export function* getWeather() {
  let requestURL = 'http://scraper.test';
  if (process.env.NODE_ENV == 'production') {
    requestURL = 'https://yoursite.ninja/whateversubfolder';
  }
  try {
    const weather = yield call(request, requestURL);
    yield put(weatherLoaded(weather));
  } catch(e) {
    console.error(e);
  }
} 


export default function* weatherData() {
  yield takeLatest(LOAD_WEATHER, getWeather);
}

Note the request file, this is a file copied earlier. Copy and paste the following files.

actions.js

import { LOAD_WEATHER, LOAD_WEATHER_SUCCESS } from './constants';


export function loadWeather(weather) {
  return {
    type: LOAD_WEATHER,
    weather
  };
} 


export function weatherLoaded(weather) {
  return {
    type: LOAD_WEATHER_SUCCESS,
    weather
  };
}

reducer.js

import produce from 'immer';
import { LOAD_WEATHER_SUCCESS } from './constants';


export const initialState = {};


const weatherReducer = (state = initialState, action) =>
  produce(state, ( draft ) => {
    switch (action.type) {
      case LOAD_WEATHER_SUCCESS:
        draft.weather = action.weather;
        break;
    }
  });


export default weatherReducer;

selectors.js

import { createSelector } from 'reselect';
import { initialState } from './reducer';


const selectWeatherDomain = state => state.weather || initialState;


const makeSelectWeather = () =>
  createSelector(
    selectWeatherDomain,
    substate => substate.weather,
  );


export default makeSelectWeather;
export { selectWeatherDomain };

constants.js

export const LOAD_WEATHER = 'app/Weather/LOAD_WEATHER';
export const LOAD_WEATHER_SUCCESS = 'app/Weather/LOAD_WEATHER_SUCCESS';

Als laatste en grootste, index.js

import React, { memo, useEffect } from 'react';
import PropTypes from 'prop-types';
import { connect } from 'react-redux';
import { createStructuredSelector } from 'reselect';
import { compose } from 'redux';


import { useInjectSaga } from 'utils/injectSaga';
import { useInjectReducer } from 'utils/injectReducer';
import { loadWeather } from './actions';


import makeSelectWeather from './selectors';
import reducer from './reducer';
import saga from './saga';


export function Weather(props) {
  useInjectReducer({ key: 'weather', reducer });
  useInjectSaga({ key: 'weather', saga });


  useEffect(() => {
    props.initWeather();
  }, []);


  if (props.weather) {
    return (
      <div>
        {props.weather.map(weather => (
          <div key={weather.Station}>
            {weather.Station} {weather['Temp (°C)']}
          </div>
        ))}
      </div>
    );
  }


  return 'No weather';
} 


Weather.propTypes = {
  dispatch: PropTypes.func.isRequired,
  initWeather: PropTypes.func,
  weather: PropTypes.array,
};


const mapStateToProps = createStructuredSelector({
  weather: makeSelectWeather(),
});


function mapDispatchToProps(dispatch) {
  return {
    initWeather: () => {
      dispatch(loadWeather());
    },
    dispatch,
  };
} 


const withConnect = connect(
  mapStateToProps,
  mapDispatchToProps,
);


export default compose(
  withConnect,
  memo,
)(Weather);

In short, what happens is this: index.js is loaded by the router, and initiates initWeather(). This fires loadWeather() from actions.js, which triggers the saga to fetch the JSON. Once the data is returned it goes back through weatherLoaded(), passes the reducer to set the state. Finally, the selector notifies index.js to put the data on screen. đŸ¤¯

Now lets check localhost:3000. There is the data!

3. Make the PWA app

To get everything running you'll need to upload the index.php and simplehtmldom_2_0-RC2 to a server.

First check your index.php and add the server name to the $cors variable.

$cors = 'https://yoursite.ninja/whateversubfolder';

Next, check js/app/containers/Weather/saga.js. Choose a relative path if you prefer.

if (process.env.NODE_ENV == 'production') {
  requestURL = 'https://yoursite.ninja/whateversubfolder';
}

If you're releasing to a subfolder like in this tutorial, change js/internals/webpack/webpack.base.babel.js

publicPath: '/whateversubfolder/',

And the router in js/app/containers/App/index.js to

<Route exact path="/whateversubfolder" component={Weather} />

This will also change your local route. The site will be available at http://localhost:3000/whateversubfolder.

Now build the project and copy all files from js/build, including .htaccess, to your server's subfolder

npm run build

4. Tweaks

Go to js/internals/webpack/webpack.prod.babel.js and check the info under WebpackPwaManifest. Change to your liking.

      name: 'Weather',
      short_name: 'Weather app tutorial',
      description: 'Best app ever!',
      background_color: '#00ffff',
      theme_color: '#ffff00',

In js/app/images you can change the images for custom (fav)icons. Rebuild and upload to see the changes.

5. Use the PWA

Navigate to your site e.g. https://yoursite.ninja/whateversubfolder in your favorite browser. In my case it's emmanuelweethetwel.nl/weather in Chrome for Android.

Voeg toe aan homescreen
Pas de naam aan
Sleep naar het home scherm
Hier is het pictogram
Splash screen
App!

6. After thougths

This is a simple quick and dirty tutorial. It's a starting point to get acquainted with React, redux-saga and asynchronicity. To make it more amusing I've combined it with scraping. Scraping is fun but useless if you need a stable source of data, and it should be avoided in a professional setting.

This journey resulted in this web app. It displays all the info from De Bilt. See emmanuelweethetwel.nl/knmi/

Check out the tutorial code on github.com 🔗.

I hope this inspired you to create your own app!

Emmanuel Maltete

Backend & frontend deveoloper[email protected]Drupal
Meer van Emmanuel Maltete

Meer artikelen

Composer
Composer 2
Dependency hell
Bash

Use multiple composer versions simultaneously

Valerie Valkenburg
4 minuten lezen
Lees het artikel Use multiple composer versions simultaneously
liferay
search
prefiltering
modelprefiltercontributor
layout
Liferay dxp
portlet

Allow portlet pages to be shown in Liferay Search

Danielle Ardon
4 minuten lezen
Lees het artikel Allow portlet pages to be shown in Liferay Search
liferay 7.2
liferay upgrade dxp
liferay
my account
screen navigation
panels

Removing panels from my account in Liferay 7.2

Danielle Ardon
9 minuten lezen
Lees het artikel Removing panels from my account in Liferay 7.2