Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Get an AWS account
    You need this before you can do anything
     
    Go to http://aws.amazon.com
    You will need a credit card. They bill monthly, and charge by the hour.
  2. Once you have an account get your security credentials
    You need these for authentication.
     
    Go to aws.amazon.com and click on "Your Account" -> "Security Credentials"
    In various locations on that page are several account IDs. Write these ones down:
      Access Key ID
      Secret Access Key
      AWS Account ID
    We will refer to these as: ACCESS_KEY, SECRET_KEY, and ACCOUNT_ID
  3. Create X.509 certificates
    These will be used to sign VM images later.
     
    On the same page click on X.509 certificates and "Create a new Certificate"
    You should get two files: a cert-.pem and a pk-.pem. They will have long names.
    Save those files somewhere on your machine.
    We will refer to these files as CERT.PEM and KEY.PEM
  4. Install Condor on your submit host
        This is part 1 of 2 of setting up your submit host
        
        You need to have a host outside the cloud running a Condor central
        manager. We will refer to this host as your "submit host". We
        assume that you have a submit host already if you are running
        Pegasus.
        
        First you need to install Condor. That is a bit involved, so we
        will skip most of it. Go to: http://cs.wisc.edu/condor to get
        more information about how to install a basic condor manager.
        
        Once you have a basic manager working we need to modify the
        configuration a bit.
        
        Edit your condor_config file and append to HOSTALLOW_WRITE:
            HOSTALLOW_WRITE = <what it was before>, *.compute-1.amazonaws.com
            
        Now edit your condor_config.local and add:
            HIGHPORT = 41000
            LOWPORT = 40000
            UPDATE_COLLECTOR_WITH_TCP=True
            COLLECTOR_SOCKET_CACHE_SIZE=1000
            
        Finally, restart Condor
        
        VERY IMPORTANT: The firewall on the submit host should be configured
        so
  5. Install Condor on your submit host
    This is part 1 of 2 of setting up your submit host
    You need to have a host outside the cloud running a Condor central manager. We will refer to this host as your "submit host". We assume that you have a submit host already if you are running Pegasus.
    First you need to install Condor. That is a bit involved, so we will skip most of it. Go to: http://cs.wisc.edu/condor to get more information about how to install a basic condor manager.
    Once you have a basic manager working we need to modify the configuration a bit.
    Edit your condor_config file and append to HOSTALLOW_WRITE:
    HOSTALLOW_WRITE = <what it was before>, *.compute-1.amazonaws.com
    Now edit your condor_config.local and add:
    HIGHPORT = 41000
    LOWPORT = 40000
    UPDATE_COLLECTOR_WITH_TCP=True
    COLLECTOR_SOCKET_CACHE_SIZE=1000
    Finally, restart Condor
    VERY IMPORTANT: The firewall on the submit host should be configured so that anything from *.compute-1.amazonaws.com can connect to port
        96189618, and ports 40000-41000. These ports are used by Condor.    
  6. Log into the Amazon Management Console
        This This is a web application that lets you manage cloud resources.
        We We will refer to this webapp as the "console", and we will refer to
        the the links on the left side of the console as "areas".
            
        Go Go to: http://console.aws.amazon.com
        Click Click "Sign in to the AWS console"
        Change Change the region on the upper-left side of the console to "US East".
        
        IMPORTANT IMPORTANT: When you use this make sure you stick to one Region (US East
        or or US West). Most things in Amazon don't work across regions.
        
        VERY VERY IMPORTANT: For this tutorial please use region "US East". Or you
        wonwon't be able to find our public VM image.
  7. From the console create a keypair
        These These are the credentials you use to log into worker nodes.
        
        Go Go to the "Key Pairs" area in the console
        Click Click "Create Key Pair"
        Call Call it "ec2-keypair" and click OK.
        It It should popup a download box. Save the file.
        We We will refer to this file as KEYPAIR.
        
  8. From the console create a security group
        This This is how you authorize machines outside the cloud to access your nodes.
        
        We We will assume your submit host is "host.example.com", and that it has
        an an IP of "192.168.1.1". The security group we create here will give
         "host.example.com" unrestricted access to your nodes.
        
        Go Go to the "Security Groups" area in the console
        Call Call your new group "host.example.com", add a description, and create the group
        Click Click on the group and add three entries:
        
            Method    Protocol    From Port    To Port        Source Method Protocol From Port To Port Source (IP or Group)
            All           tcp           1                 65535          192 All tcp 1 65535 192.168.1.1/32
            All           udp          1                 65535          192 All udp 1 65535 192.168.1.1/32
            All           icmp        -1                -1                192 All icmp -1 -1 192.168.1.1/32
            
        Note
    Note that those are CIDR addresses, so don't forget the /32.
  9. Launch the Pegasus public image
        This This is how you launch a virtual machine (or virtual cluster).
        
        We We are going to use a pre-configured image developed specifically for
        PegasusPegasus. It contains Pegasus, Condor, and Globus.
        
        Go Go to the "AMIs" area in the console.
        
        We We are going to launch ami-06dd226f.
        
        Filter Filter by "Public Images" and "CentOS" using the drop-downs, type
         "ami-06dd226f" into the text box and hit 'Refresh'. It may take a few seconds to give you a list.
        
        Select Select the one called "405596411149/centos-5.6-x86_64-cloud-tutorial" and click "Launch".
        
        A A launch wizard will pop up.
        
        Select Select the number of instances (1 for now), and instance type (m1.large),
        then then "Continue".
        
        On On the "Advanced Instance Options" page add the following to "User Data"
        and and hit "Continue" (note: host.example.com should be replaced with your submit host):
            CONDOR CONDOR_HOST = host.example.com
            
        Next Next, on the tags page, enter a value, any value, for the Name tag and hit "Continue"
        Next Next, select the keypair you created earlier, and "Continue"
        Next Next, select the security group you created earlier and "Continue".
        On On the last page click "Launch"
        
        VERY VERY IMPORTANT: Select the security group and keypair you created
        earlier earlier or else it won't work. Also, make sure you replace
         "host.example.com" in the User Data with your submit host.
        
        ALSO ALSO IMPORTANT: The "User Data" is how you tell the image what to
        dodo. This will be copied directly into the Condor configuration file. You can
        define define any extra configuration values you like, but you must specify at least
        CONDORCONDOR_HOST.
        
  10. Log into your node
        This This is how you SSH to a node you launched.
        
        Go Go to the "Instances" area in the console.
        
        You You should see the instance you just launched go from "pending"
        to to "running". You may need to hit "Refresh" a couple times.
        
        When When it says "running" click on it and get the "Public DNS"
         (call it PUBLIC_DNS).
        
        From From your submit host ssh to the worker:
            $ $ ssh -i KEYPAIR root@PUBLIC_DNS
        
        VERY VERY IMPORTANT: Make sure you log in from your submit host otherwise
        this this won't work because the security group does not match.
  11. Check your submit host
        Make Make sure the workers showed up
        
        On On your submit host run:
            $ $ condor_status
        
        You You should see something that looks like this:
        
            Name               OpSys      Arch   State     Name OpSys Arch State Activity LoadAv Mem   Mem ActvtyTime
            slot1@ec2 slot1@ec2-204-236- LINUX      LINUX X86_64 Unclaimed Idle     Idle 0.080  3843  080 3843 0+00:00:04
            slot2@ec2 slot2@ec2-204-236- LINUX      LINUX X86_64 Unclaimed Idle     Idle 0.000  3843  000 3843 0+00:00:05
                             Total Owner Claimed Unclaimed Matched Preempting Backfill
                X86_64/LINUX     2     0       0         2       0          0        0
                       Total     2     0       0         2       0          0        0
        If LINUX 2 0 0 2 0 0 0
    Total 2 0 0 2 0 0 0
    If you don't see anything, then wait a few minutes. If you still don't
        see see anything, then you need to debug Condor. Check the CollectorLog to
        see see if the workers tried to connect. If it doesn't work contact:
        pegasuspegasus-support@isi.edu.
        
  12. Run a test job
        Make Make sure the workers are usable
        
        Once Once the workers show up in condor_status you can test to make sure they
        will will run jobs.
        
        Create Create a file called "vanilla.sub" on your submit host with this inside:
            universe universe = vanilla
            executable executable = /bin/hostname
            transfer transfer_executable = false
            output output = test_$(cluster).$(process).out
            error error = test_$(cluster).$(process).err
            log log = test_$(cluster).$(process).log
            requirements requirements = (Arch == Arch) && (OpSys == OpSys) && (Disk != 0) && (Memory != 0)
            should should_transfer_files = YES
            when when_to_transfer_output = ON_EXIT
            copy copy_to_spool = false
            notification notification = NEVER
            queue queue 1
            
        Submit Submit the test job:
            $ $ condor_submit vanilla.sub
            
        Check Check on the job:
            $ $ condor_q
            
        After After a few minutes it should run. Then check the output:
            $ $ cat test_*.out
        You You should see a hostname that looks like it came from Amazon.
        
  13. Modify the image and register a copy
        This This is how you create your own custom image.
        
        At .
    At this point you can install whatever you want on the running worker
        nodenode. You might want to install programs, libraries, and tools used
        by by your workflow. If you don't want to install anything that's OK, you
        can can complete this step without modifying the image.
        
        In In the "Instances" area of the console click on the running instance
        and and select "Instance Actions" -> "Create Image (EBS AMI)".
        
        Give Give it a name and a description and click "Create Image".
        
        In In the AMIs area of the console clear all the filters (set to "Owned by
        MeMe", "All Platforms") and hit refresh. You should see a new image pop
        upup. After some time the state should change from "pending" to
         "available". You may need to hit refresh a few times.
        
        IMPORTANT IMPORTANT: The image could stay in "pending" status for a long time.
        HoweverHowever, if it is still pending after an hour something is wrong.
        
  14. Shut down your instance
        In In the "Instances" area of the console click on the running instance
        and and select "Instance Actions" -> "Terminate".
        
        VERY VERY IMPORTANT: Amazon keeps charging until the status is "terminated".
  15. Configure pegasus
        Add Add an ec2 site to your sites.xml:
            <site <site handle="ec2" sysinfo="INTEL64::LINUX">
                <profile <profile namespace="env" key="PEGASUS_HOME">/usr/local/pegasus/default</profile>
                <profile <profile namespace="env" key="GLOBUS_LOCATION">/usr/local/globus/default</profile>
                <profile <profile namespace="env" key="LD_LIBRARY_PATH">/usr/local/globus/default/lib</profile>
                <profile <profile namespace="pegasus" key="bundle.stagein">1</profile>
                <profile <profile namespace="pegasus" key="bundle.stageout">1</profile>
                <profile <profile namespace="pegasus" key="transfer.proxy">true</profile>
                <profile <profile namespace="pegasus" key="style">glidein</profile>
                <profile <profile namespace="condor" key="universe">vanilla</profile>
                <profile <profile namespace="condor" key="requirements">(Arch==Arch)&&(Disk!=0)&&(Memory!=0)&&(OpSys==OpSys)&&(FileSystemDomain!="")</profile>
                <lrc <lrc url="rls://example.com"/>
                <gridftp <gridftp url="gsiftp://" storage="" major="2" minor="4" patch="0"/>
                <jobmanager universe="vanilla" url="example.com/jobmanager-pbs//" storage="" major="2" minor="4" patch="30"/>
                <jobmanager <jobmanager universe="transfervanilla" url="example.com/jobmanager-forkpbs" major="2" minor="4" patch="3"/>
                <workdirectory>/shared</workdirectory>
            </site>
            
        Add the path to your proxy to the "local" site in sites.xml:
            <!- This is needed so Pegasus can transfer the proxy to EC2 for gridftp ->
            <profile namespace="env" key="X509_USER_PROXY">/tmp/x509up_u724</profile>
        
        In " patch="3"/>
    <jobmanager universe="transfer" url="example.com/jobmanager-fork" major="2" minor="4" patch="3"/>
    <workdirectory>/shared</workdirectory>
    </site>
    In your pegasus.properties file, disable thirdparty transfer mode:         #
  16. Comment-out the next line to run on site "ec2"
            #pegasus #pegasus.transfer.*.thirdparty.sites=*
            
        If If you installed your application code in the image, then modify your
        Transformation Transformation Catalog to include the new entries. (Tip: Make sure the
        sysinfo sysinfo of your "ec2" site matches the new transformations you add to
        the the TC)
  17. Plan your workflow
        Prepare Prepare your workflow to run on EC2.
        
        We We assume you know how to do this already. Use "ec2" as the target site.
        If If you run into any problems debug them before moving on to the next step.
        If If you have problems contact: pegasus-support@isi.edu
        
  18. Launch a larger virtual cluster
        You You will do basically the same thing you did to launch the first worker.
        
        This This time you will start a virtual cluster with 2 nodes. Instead of using the Pegasus image, use the new image you created earlier.
        
        In In the "AMIs" area select your new image and click "Launch".
        
        Select Select 2 instances, m1.large.
        
        Set Set "User Data" to:
            CONDOR CONDOR_HOST=host.example.com
        VERY VERY IMPORTANT: Don't just copy-paste the above, you need to replace
        replace "host.example.com" with the actual DNS name of your submit host.
        
        Choose Choose your keypair and security group as before and launch the cluster.
        
        Wait Wait until you see the workers show up in condor_status before proceeding.
        You You should see twice as many as you did last time. You may want to run
        your your vanilla.sub test job again to make sure they work.
        
  19. Submit your workflow
        
        At At this point you should submit your workflow.
        If If you have problems contact: pegasus-support@isi.edu
        
        VERY VERY IMPORTANT: You are virtually guaranteed to have problems at this
        pointpoint. Please contact us and we will help.
  20. Clean Up
        
        Hopefully Hopefully your workflow will run to completion. When you are finished
        make make sure you terminate any running instances in the "Instances" area
        of of the console.