See this:
That’s right. The install worked. All I had to do was get a completely clean server set up. No domain controller. Now to get my learn on.
Microsoft has a web site with a number of introductory samples. I’ll start there and work through them. The very first example gets me set up with some data that it builds by running a Powershell script, importdata.ps1. But I’m not going to just blindly follow along. I want to see what the heck is happening so I can start understanding this stuff. By the way, thank you Microsoft for making the samples in PowerShell and not forcing me to relearn Python or something else. That would have been frustrating.
The script is really simple. It has two scenarios you can pass it, w3c or RunTraffic. They just change directory and run another PowerShell script, import.ps1, from two different directories. I’ll be the scripts are different. I’m running the w3c scenario, so let’s see what that script is doing.
Ah, now things are getting interesting. There are two functions, one for data generation which uses an executable to make up test data. The other a mechanism for calling Hadoop. Basically it uses two objects, System.Diagnostics.ProcessStartInfo and System.Diagnostics.Process. The ProcessStartInfo is for defining startup information for a process that you then define using the Process command. In this case it’s setting the location of hadoop:
$pinfo.FileName = $env:HADOOP_HOME + "\bin\hadoop.cmd";
Then it sets up arguments, if any. The actual calls to this from the code use a command, dfs, which has different settings -mkdir and -copyFromLocal. From what I can tell, it’s creating a storage location within Hadoop and then moving the data generated over. I’m good with all the scripts I can see except knowing where this dfs thing comes from.
Data load ran just fine:
Data loaded, time to test out a Map/Reduce job. Again there’s a powershell script included for running a simple job, so I check it out. First run, fails. Great. More stuff to try to troubleshoot in order to be able to see this work. This is not going to be easy.
Stepping through and running the scripts might not be the best way to learn this. So, I’m going to now start reading the Big Data Jumpstart Guide. I’ll post more as I learn it.
I just wanted to say I’m glad you’re posting about your experiences with this. It’s something that I’m also curious about but don’t have the resources currently to set it up. Thanks!
Not a problem. I’ll also be trying to get to the Hadoop on Azure piece of it as well. I suspect, but don’t know, that cloud based processing will become the more prevalent method.
If you have any specific questions, please pass ’em on. Thanks.
[…] HDInsight, Finally by @GFritchey (posted Dec. […]
[…] HDInsight, Finally by @GFritchey (posted Dec. […]
Thanks for pointing to introductory samples!
Admiring the dedication you put into your site and detailed information you present.
It’s good to come across a blog every once in a while that isn’t the same outdated rehashed
material. Great read! I’ve saved your site and I’m
including your RSS feeds to my Google account.