John Pace
  • Home
  • About
  • Contact

​​
​
Data Scientist, husband, father of 3 great daughters, 5x Ironman triathlon finisher, just a normal guy who spent a lot of time in school.
Let’s explore data science, artificial intelligence, machine learning, and other topics together.

Is combining data from different sources really effective?

11/30/2017

2 Comments

 
Picture
Malaysia (From: https://www.malaysiaairlines.com/us/en/destinations.html)

As more and more data becomes available, it is possible for companies and researchers to use this data to attempt to solve business problems or unanswered scientific questions in ways that were not possible even a few years ago.  But is if the use of these disparate data sources really valid?  In some cases, the answers are yes.  However, I believe that in many cases, the answer is significantly more complicated.

In a 2016 article in Proceedings of the National Academy of Sciences (PNAS), a top rated scientific journal, Ethan Deyle and colleagues investigated the "Global environmental drivers of influenza." In the paper, they used statistical tests, including time series, to look at the effects of temperature, absolute and relative humidity, and precipitation on influenza outbreaks in different countries.  I am not going to look at the technical details, they can be found in the paper, but rather at the data sources used.

Data for the work came from the following sources:
Picture
  • WHO's number of flu cases per week - FluNet. 

Picture
  • Weekly temperature and absolute humidity - National Oceanic and Atmospheric Administration Global Surface Summary of the Day data was used. A single value was calculated for each country by taking a simple average over all available stations.

  • Precipitation - combined National Centers for Environmental Prediction Climate Forecast System (NCEPCFS)

One major difficulty of using this data is the lack of consistency of monitoring locations.  I will use Malaysia, one of the countries that was investigated, as an example.  Malaysia has a total area of 127,724 square miles and a population of over 31,000,000 per Wikipedia.

Here are some of the challenges the data present:
  • The number of flu cases is reported per country.  This data provides no granularity.   From the WHO website, there were approximately 40 cases of influenza reported in 2013 (one of the years included in the study).  This is one case per 3,193 square miles.  The location of these cases was not reported.
  • The weekly temperature and absolute humidity values were an average over all available stations.  In a country of over 127,000 square miles, there will be significant differences in relative and absolute humidity between cities and regions, potentially even within cities.  According to the NOAA website, there were 44 monitoring sites, one per 2,902 square miles.
  • The precipitation data, taken from NCEPCFS, assumed standard atmospheric pressure, an assumption that was likely violated at some monitoring stations.

This lack of granularity and reliance upon datasets that record data in different locations, with different scales, and with assumptions that may or may not be correct, is the source of the difficulty.  Are the conclusions valid?  Would the same results be obtained if data from wunderground.com, which, for international data, uses data from automated weather stations located at airports and owned by government agencies and international airports as well as over 8,000 personal weather stations?

This disparity between data sources can be the cause of major difficulties in building predictive models or even determining correlations between variables.  Thus, a company that builds a model to forecast sales using data from one source, may build a completely different model using data from a second source.  While we can only work with the data we have available, it is critical to attempt to compile as much data, particularly if multiple sources of data are available, in order to make predictions.  I think this is not only a major challenge for data scientists, but also a unique opportunity. ​

If you have questions and want to connect, you can message me on LinkedIn or Twitter. Also, follow me on Twitter @pacejohn and LinkedIn https://www.linkedin.com/in/john-pace-phd-20b87070/

#datascience #datasources #predictivemodels #pnas #flu #noaa #who #proceedingsofthenationalacademyofscience #malaysia

Deyle ER, Maher MC, Hernandez RD, Basu S, Sugihara G. Global environmental drivers of influenza. Proc Natl Acad Sci U S A. 2016;113(46):13081-13086. doi:10.1073/pnas.1607747113
2 Comments

    Archives

    December 2020
    November 2020
    September 2020
    August 2020
    June 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    May 2019
    April 2019
    March 2019
    April 2018
    March 2018
    January 2018
    November 2017

    Tweets by pacejohn
Proudly powered by Weebly
  • Home
  • About
  • Contact