The study shows a positive correlation exists that is unlikely due to chance.
It doesn't, and cannot say that (not correctly, anyway). Pretending for the moment that the garbage statistical methods used were actually robust, appropriate, and altogether decent and the sampling likewise unproblematic, the analyses rests on the use of statistical significance using
p values. Now, personally I (like many) think that the modern combination that is in most of the textbooks and used in this and many other studies is abominable, and find it deeply problematic that one of the most widely used frameworks for statistical inference is the only method I know of that has been wholly, severely, and soundly rejected and fundamentallyha flawed and unsound since before it existed. Normally, this is not possible, because you have to wait for this or that method, tool, metric, model, etc., to exist before it can be criticized. In this case, however, the use of
p values, significance, and inference comes out of a merging of two warring factions in the earliest days of modern statistics: Ronald Fisher on the one hand and (Eagon) Pearson and Neyman on the other. Fisher's work was earlier, and Pearson and Neyman introduced their approach partly in response to Fishers and as a rejection of fundamental components of Fisher's approach. Fisher attacked their approach visiously. The "modern" union still taught today is mainly a combination of components of both approaches that would have disgusted followers of either camp (and the founders).
That said, even if one believes that the use of
p values and significance testing as used in the study is not seriously flawed and a primary factor in many of the "replication crises" and similar problems across numerous fields with flawed findings and failures to replicate and so forth, then we are left with the fact that the interpretation of p values still doesn't allow for the researchers to conclude that the results are "unlikely due to chance".
Rather, the ONLY conclusion one can give a "significant"
p value is that,
under the assumption of the null AND under the assumption that the variables are i.i.d. THEN the results found or more extreme results have a probability of occurance by chance less than that given by the p value.
Of course, the treatment of the variables, the correlation measures used, the assumptions about the distributions, the selection of years, the aggregation methods, etc., used in the study are deeply, deeply problematic. That's without getting into issues of the exclusion of data for people who answered "unsure" about evolution, the differing and problematic methods used to determine belief in evolution, the fact that the measures of "better people" (e.g., measures of racism, attitudes, towards LGBTQ, etc.) are based on questions that
might be suitable if one were using them singly rather than throwing them into some GLM-type model you can crank out with a few lines and clicks and SAS, and the overall data massaging just to get outputs that are all so very, very problematic (and, unfortunately typical).
Mining large surveys conducted by other institutions or groups can be and is fraught with difficulties, not the least of which is defining just exactly what constitutes the random variable one is assuming to be i.d.d. and with what justifications. A lot of work has to be done to ensure robustness even with relatively few factors or parameters and comparisons among different methods to determine the weights or coefficients. Little of this was done, but instead a large number of variables across survey years and across surveys of very different types were aggregated poorly and investigated using undergrad modeling methods to spit out p-values.