|
Fast Downloading of GRIB Files
|
|
Partial http transfers
|
| |
|
Introduction
|
|
Downloading meteorological data can be a pain. Servers are under-powered,
connections are slow and the "bean counters" figure that 80 GB will store
a trillion spreadsheets so who would want more disk space? Can't help with
the last problem but downloading data in GRIB files can be made
faster.
Often people only need a few fields from a GRIB file. For
example, the GFS forecasts contain 325 fields per forecast time.
Many people are only interested in a few fields such as
the precipitation or 500 mb heights. Assuming we only wanted
two fields, downloading the entire file (26 MB) to get 0.2 MB of
data is just silly.
|
|
If You are Lucky, it is Simple
|
|
Some datasets have pre-configured scripts to download the data.
See Part 2 for more information.
|
|
Details
|
|
The http protocol allows "random access" reading;
however, that means that we need an index file and a http
program that supports random access. For the index file, we
can modify a
wgrib inventory. For the random-access http program, we
can use cURL. Both are freely
available, widely used, work on many platforms and are easily
scripted/automated/put into a cronjob.
The basic format of the quick download is,
   get_inv.pl INV_URL | grep FIELDS | get_grib.pl GRIB_URL OUTPUT
  
   INV_URL is the URL of a wgrib inventory
       ex. http://nomad3.ncep.noaa.gov/pub/gfs/rotating/gblav.t00z.pgrbf12.inv
  
   FIELDS is string that selects the desired fields (wgrib compatible)
       ex. ":HGT:500 mb:"
       see the
wgrib home page for more information and tricks on using grep and egrep
  
   GRIB_URL is the URL of the grib file
       ex. http://nomad3.ncep.noaa.gov/pub/gfs/rotating/gblav.t00z.pgrbf12
  
   OUTPUT is the name of the for the downloaded grib file
The "get_inv.pl INV_URL" downloads the wgrib inventory off the net and adds
a range field. The "grep FIELDS" uses the grep command to select desired
fields from the inventory. Use of the "grep FIELDS" is similar to the
procedure used when using wgrib to extract fields. The "get_grib.pl
GRIB_URL OUTPUT" uses the filtered inventory to select the fields
from GRIB_URL to download. The selected fields are saved in OUTPUT.
|
|
Examples
|
get_inv.pl http://nomad3.ncep.noaa.gov/pub/gfs/rotating/gblav.t00z.pgrbf12.inv | \
grep ":HGT:500 mb:" | \
get_grib.pl http://nomad3.ncep.noaa.gov/pub/gfs/rotating/gblav.t00z.pgrbf12 out.grb
  
The above example can be written on one line without the back slashes. (Back slashes
are the unix convention indicating the line is continued on the next line.) The
example downloads the the 500 mb height from the 12 hour (f12) from the 00Z (t00z)
GFS fcst from the NCEP NOMAD2 server.
  
  
get_inv.pl http://nomad2.ncep.noaa.gov/pub/gfs/rotating/gblav.t00z.pgrbf12.inv | \
egrep "(:HGT:500 mb:|:TMP:1000 mb:)" | \
get_grib.pl http://nomad2.ncep.noaa.gov/pub/gfs/rotating/gblav.t00z.pgrbf12 out.grb
  
The above example is similar to the earlier example except it downloads both the
500 mb height and the 1000 mb temperature.
|
|
Sample Script
|
|
Here is an example of downloading a year of R2 data.
#!/bin/sh
# simple script to download 4x daily V winds at 10mb
# from the R2 archive
set -x
date=197901
enddate=197912
while [ $date -le $enddate ]
do
url="http://nomad3.ncep.noaa.gov/pub/reanalysis-2/6hr/pgb/pgb.$date"
get_inv.pl "${url}.inv" | grep ":VGRD:" | grep ":10 mb" | \
get_grib.pl "${url}" pgb.$date
date=$(($date + 1))
if [ $(($date % 100)) -eq 13 ] ; then
date=$(($date - 12 + 100));
fi
done
|
|
Requirements
|
- perl
- grep
- cURL
- grib files and their wgrib inventory on an http server
- get_inv.pl
- get_grib.pl
|
|
Configuration (UNIX/Linux)
|
The first two lines of get_inv.pl and get_grib.pl need to be modified.
The first line should point to your perl interpreter. The
second line needs to point to the location of curl if it is not
on your path.
|
|
Usage: Windows
|
There have been some reports that the perl scripts didn't work on Windows machines.
The problem was solved by Alexander Ryan.
Hi Wesley,
thought this might be of some use to your win32 users.
I had the following problem when running the get_grib.pl file as per your instructions.
run this
grep ":UGRD:" < my_inv | get_grib.pl $URL ugrd.grb
and I would get the error No download! No matching grib fields. on further
investigation I found that it was just skipping the while STDIN part of the
code. a few google searches later and I found that for some strange reason in
the pipe I needed to specify the path or command for perl even though the file
associations for .pl are set up. (don't fiqure)
this works for me
grep ":UGRD:" < my_inv | PERL get_grib.pl $URL ugrd.grb
Regards and thanks for the fine service
Alexander Ryan
Another email from Alexander
Hi Wesley,
Further to my last email here are some details regarding the enviorment I run this all on for your referance.
My computer is P4 1.7GHz with 1Gb Ram running Windows 2000 service pack 4
Perl version :V5.6.1 provided by http://www.activestate.com
cUrl Version: 7.15.4 from http://curl.haxx.se/
grep & egrep: win32 versions of grep and egrep, I found both at
http://unxutils.sourceforge.net
who provide some useful ports of common GNU utilities to native Win32. (no cygwin required)
so far this is working fine
Regards Alexander
Apparently,
 
   get_inv.pl INV_URL | grep FIELDS | perl get_grib.pl URL OUTPUT
 
should work. Linux users probably will gravitate towards the cygwin system because it
includes bash, an X-server, compilers and the usual unix tools.
|
|
Tips
|
If you want to download multiple fields, for example, precipitation and 2 meter temperature, you
can type,
 
 
     URL="http://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.2006070312/gfs.t12z.pgrb2f00"
     get_inv.pl $URL.idx | egrep ':(PRATE|TMP:2 m above gnd):' | get_grib.pl $URL out
 
The above code will put the precipiation and 2-m temp in the file out. Of course, egrep understands
regular expressions which is a very powerful feature.
If you are doing multiple downloads from the same file, you can save time by keeping a local
copy of the inventory. For example,
 
     URL="http://www.ftp.ncep.noaa.gov/data/nccf/com/gfs/prod/gfs.2006070312/gfs.t12z.pgrb2f00"
     get_inv.pl $URL.idx > my_inv
     grep ":UGRD:" < my_inv | get_grib.pl $URL ugrd.grb
     grep ":VGRD:" < my_inv | get_grib.pl $URL vgrd.grb
     grep ":TMP:" < my_inv | get_grib.pl $URL tmp.grb
 
The above code saves two extra downloads of the inventory.
 
Some people have slow internet connections. A user was complaining about bad
downloads. Turns out that the user was using a modem and cURL
was "timing out". The user solved the problem by adding the following options to
the cURL command "-y 30 -Y 30" which are found within get_inv.pl and get_grib.pl.
The options tell curl to only "time out" when the download rate is less than 30 bytes
per second for 30 seconds. Glad I don't have to use a modem.
|
|
Notes for Data Providers
|
|
The grib data needs to accessable be on an http server. Often this is a
minor change in the httpd configuration.
The users will need a wgrib inventory (grib-1) or a wgrib2 inventory
(grib-2). It is convenient if the inventory is in the same directory as the data files
and uses the '.inv' suffix convention. The inventory can be created by,
 
     GRIB-1: wgrib -s grib_file > grib_file.inv
 
     GRIB-2: wgrib2 -s grib_file > grib_file.inv
|
|
GRIB-2
|
|
5-2006: GRIB-2 support is now in get_inv.pl and get_grib.pl.
Data providers need to use wgrib2 with the -range option to create the
inventory files.
7-2006: GRIB-2 support has been improved. Users need a new version
of get_inv.pl (7/2006 release date). Data providers do not need to make inventories with -range
option.
|
|
Notes
|
|
In theory, curl allows random access to FTP servers but in practice we
found this to be slow (each random access is its own FTP session).
Support for the FTP access was dropped 2/2005 because we want
data providers to use the faster http protocol.
| |
|
Regional Subsetting
|
|
Some users would like to download specific regions. That
would really reduce the bandwidth. There are two approaches,
server-side software and client-side software. With grib1, we
used server software (ftp2u/ftp4u) to unpack and create regional
subsets. The process was slow and we expected the grib2 version
of the software to be even slower because of the jpeg2000 compression.
Well I was wrong, the grib2 version (g2sub) was faster than expected.
Apparently by the jpeg2000 decompression time was saved by not
having not using a temporary files.
In theory, client software could download a part of a grib-1 file
to obtain a regional subset. We never persued this because grib-1's
days were numbered. With jpeg2000-compressed grib-2 files, such
a technique will probably work.
My current opinion is that we need to go to "tiles". Instead of having
a big map, you break down the map into into n x m subsections (tiles).
For example, suppose our grib file has an array of (1:100, 1:60). We can make
4 tiles by saving (1:50, 1:30), (51:100, 1:30), (1:50, 31:60) and (51:100, 31:60)
into 4 separate grib files. Instead of downloading the entire map, the users
could download the tiles that covered their region of interest.
After downloading, the user would run a program
such as copygb to combine the tiles into a single map. By using "partial http
transfers" and tiling, the user gets regional subsetting and the server
keeps the low overhead of the "partial http transfers".
Tiling is already being used with the NAM (NCEP regional model). Here, the tiles
are grib files with the same Lambert-Conformal projection as the "master" grib file.
The only difference is that the ix and iy have different ranges.
The GFS (NCEP global model) is often distributed in "octets". (0-90N, or -90S to 0) by
(x E to x+90 E). The 8 octets would cover the globe. For many people, downloading one
or two of the octects could replace downloading the entire global field.
Comment (WNE 8/2006, updated 3/2008): Most of NCEP's grib-2 data will be encoded using JPEG2000
compression. My understanding of JPEG2000 is very limited but apparently JPEG2000
allows random access. Maybe someone smart can get regional subsetting working.
(We also discusssed subdividing the grid into several smaller grids and encoding
each of them into its own jpeg. That would speed up processing with a mpp machine
and allow regional subsetting. However, that would require a new packing scheme to
be designed and approved.)
|
|
Created: 1/21/2005 last modified 10/2006
comments: Wesley.Ebisuzaki@noaa.gov
|