Good use of Neo4J? River network database with linear regression on height gages

Howdy! I'm an MS Analytics student at Georgia Tech and a whitewater kayaker and I have a project idea that I'm trying to mesh out. There's a discipline of whitewater kayaking called "Creeking", which, as you might guess, is running creeks and smaller rivers. These are usually not dam released and they run when nature decides they run. There's plenty of USGS data for many of these rivers, with flow and height gages. There's plenty of historical weather data. There's also watershed data, which is pretty easy to find. The website RainPursuit | Chase the Good Flow does a sweet job stitching this data together. What nobody is doing is predicting what will run, where and for how long.

The reason I'm wondering if this should be done in Neo4j is because rivers and streams seem like an obvious example of a network graph to me. Every node is directly affected by only the nodes upstream from it, right? So, in theory, you build a network, for each node you calculate the predicted weather over the desired interval at all points upstream, then you map that back to a live prediction of that point. I'm sure I could skip the network graph and just see how every point effects every other point, but it seems like it would introduce a lot of noise. What do you guys think?

Honestly, this sounds like a perfect use-case to me. However, you are going to run into one very big roadblock: geography. To make this kind of pursuit accurate, you'll need models of all the watersheds, groundwater flow, predicted rainfall, and every little creek-bed and stream in an area. That said, has it been your experience that you can combine information from rain-pursuit, combined with weather forcasts, to gauge if you can hit a specific creek/river? If so, then you can definitely do what you're thinking, both more accurately, and for all streams at once.

Where are you going to get all the raw data?

This is what everybody who comes at this from an engineering point of view tells me, but I don't think it's necessarily the case for what I'm trying to do. The groundwater flow, geography and creek bed issues can all be modeled, but I don't think you need to. Their impacts live directly in the gage data. You won't be able to separate them out or model them from it, but if that's not the goal, why put in the extra work? If you look at how a gage reacts to rainfall in it's watershed over ten years, you're going to see the geography reflected in the correlation. If you look at how that gage corresponds to one up the network, you're going to account for the geography and topography between them. Especially as you have live data coming in constantly so you can improve the model with time.

USGS has a lot of good data. Weather data shouldn't be too hard to come by either. I'm just not sure what would actually help me build the graph.

Sounds like you've got something that workable if you can get a csv or json of the flow gage data, and correlate that against a graph of the streams. Are you thinking you'd manually build the graph, then tie-in the gage data somehow?

I wasn't able to find the raw gage data, can you share a link or a sample here?

Here is some raw gage data, switch to tabbed data and download 15 minute readings back almost 20 years: USGS Current Conditions for USGS 02379500 CARTECAY RIVER NEAR ELLIJAY, GA x

Building the graph would be a nightmare, I think. What would be really nice is if I could isolate the rivers and streams on Google Maps and transform that into a graph, where river width and river distance were weights of the edges.

I did find these guys which could have some potential for building the network:

The water-flow data looks good. Most of the other data isn't very usable.
Seems doable.

Node for every stream intersection. Node for every water-flow data collection point. You got yourself a stream graph. You'll probably be building a good portion of that graph by hand, from looking at a map. Start with a small area?