After downloading the one large file that contained all 4.4+ million web pages, we used VB.NET (and lots and lots of regular expressions) to process the data. Everything went into MS SQL Server databases, including the metadata, a list of the links, and all of the text for each page. We started with the same architecture and interface as the rest of the BYU corpora, but we then modified this quite a bit to allow functionality for virtual corpora.
|