As more and more data becomes available, it is possible for companies and researchers to use this data to attempt to solve business problems or unanswered scientific questions in ways that were not possible even a few years ago. The question I will address in this post is if the use of these disparate data sources is really valid? In some cases, the answers are yes. However, I believe that in many cases, the answer is significantly more complicated.
In a 2016 article in Proceedings of the National Academy of Sciences (PNAS) - a top rated scientific journal - Deyle et al. investigated the "Global environmental drivers of influenza." (https://www.ncbi.nlm.nih.gov/pubmed/27799563) In the paper, they used statistical tests, including time series, to look at the effects of temperature, absolute and relative humidity, and precipitation on influenza outbreaks in different countries. I am not going to look at the technical details, they can be found in the paper, but rather at the data sources used.
Data for the work came from the following sources:
One major difficulty of using this data is the lack of consistency of monitoring locations. I will use Malaysia, one of the countries that was investigated, as an example. Malaysia has a total area of 127,724 square miles and a population of over 31,000,000 per Wikipedia.
This lack of granularity and reliance upon datasets that record data in different locations, with different scales, and with assumptions that may or may not be correct is the source of the difficulty. Are the conclusions valid? Would the same results be obtained if data from wunderground.com, which, for international data, uses data from automated weather stations located at airports and owned by government agencies and international airports as well as over 8,000 personal weather stations? (https://www.wunderground.com/about/data)
This disparity between data sources can be the cause of major difficulties in building predictive models or even determining correlations between variables. Thus, a company that builds a model to forecast sales using data from one source, may build a completely different model using data from a second source. While we can only work with the data we have available, it is critical to attempt to compile as much data, particularly if multiple sources of data are available, in order to make predictions. I think this is not only a major challenge for data scientists, but also a unique opportunity.