DWBI-TECH BLOGS (Pradeep Kannadiga): Redshift

Showing posts with label Redshift. Show all posts

Thursday, 7 July 2016

Useful Queries for troubleshooting amazon redshift

USEFUL QUERIES FOR TROUBLESHOOTING IN AMAZON REDSHIFT

Here are some of my queries for troubleshooting in amazon redshift. I have collected this from different sources.

TO CHECK LIST OF RUNNING QUERIES AND USERNAMES:

select a.userid, cast(u.usename as varchar(100)), a.query, a.label, a.pid, a.starttime, b.duration,
b.duration/1000000 as duration_sec, b.query as querytext
from stv_inflight a, stv_recents b, pg_user u
where a.pid = b.pid and a.userid = u.usesysid

select pid, trim(user_name), starttime, substring(query,1,20) from stv_recents where status='Running'

TO CANCEL A RUNNING QUERY:

cancel <pid>

You can get pid from one of the queries above used to check running queries.

TO LOOK FOR ALERTS:

select * from STL_ALERT_EVENT_LOG
where query = 1011
order by event_time desc
limit 100;

TO CHECK TABLE SIZE:

select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by b.mbytes desc, a.db_id, a.name;

TO CHECK FOR TABLE COMPRESSION:

analyze <tablename>;

analyze compression <tablename>;

TO ANALYZE ENCODING:

select "column", type, encoding
from pg_table_def where tablename = 'biglist';

TO CHECK LIST OF FILES COPIED:

select * from stl_load_errors

select * from stl_load_commits

select query, trim(filename) as file, curtime as updated, *
from stl_load_commits
where query = pg_last_copy_id();

TO CHECK LOAD ERRORS

select d.query, substring(d.filename,14,20),
d.line_number as line,
substring(d.value,1,16) as value,
substring(le.err_reason,1,48) as err_reason
from stl_loaderror_detail d, stl_load_errors le
where d.query = le.query
and d.query = pg_last_copy_id();

TO CHECK FOR DISKSPACE USED IN REDSHIFT:

select owner as node, diskno, used, capacity
from stv_partitions
order by 1, 2, 3, 4;
select query, trim(querytxt) as sqlquery
from stl_query
order by query desc limit 5;

SOME IMPORTANT AWS COMMANDS:

To resize the redshift cluster (node type and number of nodes always required):

aws redshift modify-cluster --cluster-identifier <cluster name> --node-type dw2.8xlarge --number-of-nodes 3

To get filelist on S3:

aws s3 ls $BUCKET/ > ./filecount.out

To get status of cluster and other information of cluster in text format:

aws redshift describe-clusters --output text

Tuesday, 10 November 2015

User management- Creating groups , users , owners in Amazon redshift

Below are some sample commands to create user groups, adding users, managing permission on table in Amazon redshift.

1) To create a user use the below command in amazon redshift:

create user my_user password 'anaconda123';

In the above example, user with name my_user is created with password anaconda123.

You can later change the password using the command:

alter user my_user password 'mynewpassword123'

2) To create a user_group use the below command in amazon redshift:

create group my_group with user my_user;

In the above command, a new group with name my_group is created and the user my_user is added to that group.

3) To create a database schema use the below command in Amazon redshift:

create schema myschema authorization my_user;

In the above example, schema myschema is created with my_user being provided all the permission on the schema.

4) To assign user group to a schema use the below command in Amazon redshift:

grant all on schema myschema to group my_group;

5) To add or drop users to a user group use the below command in Amazon redshift:

ALTER GROUP group_name ADD USER username;

ALTER GROUP group_name DROP USER username;

ALTER GROUP group_name RENAME TO new_name;

6) To check permissions on the tables in amazon redshift:

select * from pg_tables /* provides all table owner info */
select * from pg_views /* provided table owner info */
select * from pg_shadow /* provides user info */
select * from pg_group /* provided user group info */

7) To add permissions on the tables in amazon redshift:

Use sample below to add permissions on a table in redshift

Create table MYTABLE (
COLUMN1 CHAR(1) NOT NULL,
COLUMN2 CHARACTER VARYING(100) NOT NULL)
distkey (COLUMN1);

GRANT ALL ON TABLE MYTABLE TO mstr_user;

ALTER TABLE MYTABLE owner TO my_user;

Sunday, 14 June 2015

How to connect to amazon redshift using Informatica power center?

You can connect to amazon redshift from the drivers available on Informatica cloud. However, if you want to connect Informatica power center 9.1 and prior version to redshift, using Postgres odbc drivers are an option. You can use DataDirect 6.1 PostgreSQL Wire Protocol and later versions for this purpose.

Step1: Install the postgres driver and create an entry in .odbc.ini

Some of the parameters you might have to set in your odbc (.odbc.ini) connection is below: Check you driver for the complete settings:

[PostgreSQL]
Driver=/informatica/server91/ODBC6.1/lib/DWpsql25.so
Description=DataDirect 6.1 PostgreSQL Wire Protocol
Database=mydatabasename
0HostName=myreshifthost.cqa0adasda-east-1.redshift.amazonaws.comLogonID=myuser_name
Password=mypassword
PortNumber=yourportnumber
TrustStore=/export/redshift-ssl-ca-cert.pem
PacketSize=32

You can contact Informatica or read the driver documentation for the full settings.

2) Create a relational odbc connection from Informatica workflow manager as shown below:

Friday, 16 January 2015

Easy steps to set encodings for table compression in Amazon redshift.

If you are loading data from s3 to redshift then you can use the copy command with COMPUPDATE ON option to automatically set the compression on the target based on the data that is getting loaded. Redshift will analyze some 200K and set the compression/encoding that is best for the data that is getting loaded.

If you are loading the data from other tables in redshift and need to explicity set the encodings then following the steps outlined below might help:

a) Load some sample set of records to the target table on redshift that you want to set the encoding/compressions on. Use Analyze compression <tablename> command to check the encodings that is suitable for this target table.

For example: We are loading data to the table employee below.

redshift> Analyze compression employee

table_name      column Encoding
-----------------------------------
'Employee','employee_key','raw'
'Employee','employee_number','lzo'
'Employee','employee_type','runlength'
'Employee','employee_status','lzo'
'Employee','employee_source_code','lzo'
'Employee','employee_tier','lzo'
'Employee','enterprise_tier','lzo'
'Employee','serving_flag','runlength'
'Employee','employee_flag','lzo'
'Employee','account_update_date','delta32k'
'Employee','tenure_months','bytedict'
'Employee','creation_date','delta32k'
'Employee','category','bytedict'
'Employee','extended_flag','runlength'
'Employee','business_date','runlength'

b) New recreate the table employee on redshift with those encodings:

CREATE TABLE EMPLOYEE
(
   employee_key                  INTEGER                       NOT NULL encode raw,
    employee_num         CHAR(11)                       NOT NULL encode lzo,
    employee_type   CHAR(1)                       NOT NULL encode runlength,
    employee_status    CHAR(1)                       NOT NULL encode lzo,
    employee_tier    CHAR(1)                       NOT NULL encode lzo,
     serving_flag           char(1)                          NOT NULL encode runlength,
    employee_flag            char(1)                         NOT NULL encode lzo,
    account_update_date           DATE                       NOT NULL encode delta32k,
    tenure_months     INEGER                       NOT NULL encode bytedict,
    creation_date DATE NOT NULL encode delta32k,
    category char(5) NOT NULL encode bytedict,
    extended_flag CHAR(1)                       NOT NULL encode runlength,
    business_date DATE                       NOT NULL encode delta32k
)
distkey (employee_key);

c) Now reload the data into the table and you should see the encodings applied for this table.

redshift> Analyze compression employee

Monday, 12 January 2015

Amazon Redshift - date manipulation functions

Amazon redshift- date manipulation functions?

--to select the current date and timestamp in redshift use:

select current_date
select current_timestamp

--to add months and convert date to integer use in redshift:

select to_char(add_months(to_date(current_date, 'YYYY-MM-DD'), -1), 'YYYYMMDD')

Note: Do not use to_date(current_date, 'YYYYMMDD') since redshift has a bug with this function and handles some dates wrong. Atleast this is what I found from my experience.

--to add months and get first of last month:

select to_char(add_months(to_date(current_date, 'YYYY-MM'), -1), 'YYYYMM') ||'01'

--to add months and convert date to integer use:

select to_number(to_char(add_months(current_date, -1),'YYYYMMDD'),'99999999') --to get date id

select to_number(to_char(add_months(current_date, -1),'YYYYMM'),'999999') --to get month id
select date(current_date - cast('1 month' as interval))

--to extract year part from the current date

select extract (year from current_date)
select date_trunc('year', current_date)
select to_number(to_char(current_date,'YYYY'),'9999') --integer format

--to extract day and month from the current date

select extract (day from current_date)

--to get first and last day of the year:

select to_date((extract (year from current_date)) || '0101','YYYYMMDD')
select to_date((extract (year from current_date)) || '1231','YYYYMMDD')

--to get first of the month given a date in redshift use:

select date_trunc('month', current_timestamp)

--to get last month end date

select date_trunc('month', current_timestamp)-1

--to add 7 days to current date

select date(current_date + cast('7 days' as interval))
select date(current_date + cast('7 days' as interval))

--to get timestamp in a specific format

select to_timestamp( current_timestamp, 'YYYY-MM-DD HH24:MI:SS')
select to_timestamp( '2014-12-01 23:12:12', 'YYYY-MM-DD HH24:MI:SS')

--to add hours to a date

select TO_CHAR(date((current_date + cast('1 hour' as interval))),'YYYY-MM-DD')

--to converts the date 02 Dec 2011 into the default date format:

select to_date ('02 Dec 2011', 'DD Mon YYYY');

Friday, 19 December 2014

How to select between Hadoop vs Netezza vs Redshift- comparison?

Common question that we here is, among Hadoop, Netezza, and Redshift, which one is better and which one is going to replace everything else? All these technologies have their own pros and cons and it is essential to know those before we try find out ideal use cases for them and how to use them in your ecosystem.

Hadoop:

Pros:

As we know hadoop is mainly used for big data purpose i.e purpose where huge volume is involved often peta byte scale or more.
The data can be unstructured, stuctured or semi structured.
The data in hadoop can be stored on commodity hardware and the cost of storing and processing is less considering the volume of data it can handle. Now lets says we are receiving huge number of XML files, flat files from huge number of sources (for example, sensors from air plane, logs fromfirewalls/IDS systems, tweets from twitter, etc) and needs to be loaded at a rapid pace and processed then hadoop is an ideal candidate.
The hadoop clusters are highly scalable and it can be horizontally or vertically scaled by adding more commodity hardware and nodes that are not expensive.
The hadoop clusters can also be made to handle streaming data, collect data from sensors, store the streaming data, playback etc.
Data redundancy is built into the cluster.

Cons:

Point to be noted here are the files if stored on HDFS are going to be in flat file format and though you can use Hive or Impala to process these files as structured data there are some limitation to it. The hive SQL do not offer full functionality that is found in traditional databases.
Updating records will be tough and you will have to use tool like HBASE to get this update functionality.
Response time of hadoop hive queries are not going to fast as Netezza because at the back ground these program run map reduce jobs that are not as fast as the Netezza. Also, they are meant for huge volumes and used in places where response time are not that important i.e for batch jobs.
Administration of the cluster is not simple like administration of netezza.

Conclusion:

If you are doing ELT (Extract Load transform) with huge data loads that can be stored as flat files on HDFS then Hadoop is a good candidate. You can load all the files and then process or transform etc.

Volume and scalability are also the deciding factors and Hadoop is deal for big data purpose

Netezza:

Pros:

It is a traditional data warehouse appliance that supports traditional SQL statements i.e all DML and DDL statements. Inserts, updates are all possible.

Very fast and made for gigabyte scale data warehousing purpose. Depending on the number of SPUs, it can handle gigabytes of data and process it at super fast speed. The data is distributed on multiple SPUs and they all work together to crunch the data.

Administration overhead is very less compared to oracle, DB2 , etc data bases.

Cons:

Mainly meant for structured data warehouse applications.
Probably cannot handle unstructured data of peta byte volumes effectively like Hadoop.

Conclusion:

If you are building traditional data warehouse where you also need to ETL (Extract - Transform- Load) then Netezza is better since you can insert, update, aggregate, etc with ease and write your SQL for reporting and building cubes.

Redshift:

Pros:

Peta byte scale computing and cost model based on On demand basis.
Traditional data warehousing and SQLs supported.
Can handle volume even greater than Netezza and build on the same model as Netezza.
On the cloud and minimal administration required
Easily scalable and nodes (computing power ) and space can be added with some mouse clicks.
Easy to create data snapshots and bring the redshift cluster up and down as and when required.

Cons:

On the cloud and data has to be transported to the cloud redshift cluster.
Reading huge volumes from ODBC/JDBC connection is still a problem.
If there is lot of back and forward movement between the cluster and local network then it creates some latency.
Have to get security clearance to get data out of the network

Conclusion:

Good for ELT purpose where data is loaded to redshift cluster once and then processing is done on the cluster and aggregate/summarized data is read from the cluster or if the other downstream applications are sourcing the data directly from the redshift cluster.
Good for traditional data warehousing purpose.

Friday, 28 November 2014

How to connect to Amazon Redshift cluster using psql in a unix script?

How to connect to Amazon Redshift cluster using psql in a unix script and execute a sql statement?

You need to first define the following variables in your unix script and enter appropriate values and then follow it up with the psql command with whatever statement you need to connect to Amazon Redshift cluster using psql ;

AWS_ACCESS_KEY=<ENTER ACCESS KEY>
SECRET_ACCESS_KEY=<ENTER SECRET ACCESS KEY>
db=default
AWS_SERVER=mytest-cluster2.adsadsadh.us-east-1.redshift.amazonaws.com
username=admin

psql -h $AWS_SERVER_ADDRESS -p 8443 -d $db -U $username -c "TRUNCATE TABLE PUBLIC.TABLENAME"

Note: if you have AWS_ACCESS_KEY and SECRET_ACCESS_KEY defined as environment variable then you can use psql command as shown below and instead of variables directly enter the address like below to connect to Amazon Redshift cluster using psql.

localhost> export AWS_ACCESS_KEY=<ENTER ACCESS KEY>
localhost> export SECRET_ACCESS_KEY=<ENTER SECRET ACCESS KEY>
localhost> psql -h mytest-cluster2.adsadsadh.us-east-1.redshift.amazonaws.com -p 8443 -d default -U $username -c "TRUNCATE TABLE PUBLIC.TABLENAME"

Monday, 17 November 2014

Performance tuning in amazon redshift - Simple tricks (table/query design)

Performance tuning in amazon redshift - Simple tricks

The performance tuning of a query in amazon redshift just like any database depends on how much the query is optimised, the design of the table, distribution key and sort key, the type of cluster (number of nodes, disk space,etc) which is basically the support hardware of redshift, concurrent queries, number of users, etc. However, the good news is it is not that hard compared to database such as oracle to do performance tuning.Amazon Redshift is a peta byte scale and massively parallel database and very high performance can be achieved with simple configuration steps. In this blog, we have explained in detailed how to achieve high performance in amazon redshift.

INVESTIGATION:
To give an idea about the performance issues we were facing, have a look at the cost of the query and the disk usage that were resulting from the queries we were running on redshift. Explain command can be used to get the cost of the query and the execution plan. The execution plan will show those parts of the query that are very costly and needs to be tuned. In the below query, you can see just by looking at the numbers beside the cost the first two outer joins are very expensive and the last inner join is not that costly. It is not only because of the outer join, but because of the amount of data distribution that is happening in the query.

explain select * from table1 a11 outer join table2 a12 on (a11.EMPLOYEE_KEY = a12.EMPLOYEE_KEY) outer join table3 a13 on (a11.name = a13.name ) join table3 a14 on (a11.dept_key = a14.dept_key)

.>XN Hash Left Join DS_BCAST_INNER (cost=683864529053.53..9912121211212999999999999996733616880411.00 rows=8226899613 Width=1388)                                                                                                                                  ,"Hash Cond: ("outer"".employee_key="inner".employee_key)
   -> XN Hash Left Join DS_DIST_BOTH (cost=    683646717104.93..999999121211121999999999967336168804116 rows=8226899613 width=1372)
              ,"Hash Cond: ("outer"".name="inner".name)
       -> XN Hash Left Join DS_DIST_NONE (cost= 346363.65 rows=822689 width=1372)
                                  ,"Hash Cond: ("outer"".dept_key="inner".dept_key)

Data is distributed among various computing nodes in amazon redshift and the amount of data movement between nodes plays a large part in query performance. In your query, you see a lot of DS_DIST_NONE in your query then you have least amount of data distribution and if you have other keywords like DS_DIST_BOTH or DS_BCAST_INNER then there are more data redistribution happening and the performance can be bad.

The meaning of Explain plan attributes as per amazon redshift website is below:

DS_BCAST_INNER- means that broadcast a copy of the entire inner table to all compute nodes.
DS_DIST_ALL_NONE-No redistribution is required because the inner table was distributed to every node using DISTSTYLE ALL
DS_DIST_NONE- No tables are redistributed: collocated joins are possible because corresponding slices are joined without moving data between nodes.
DS_DIST_INNER-    The inner table is redistributed.
DS_DIST_ALL_INNER -    The entire inner table is redistributed to a single slice because the outer table uses DISTSTYLE ALL
DS_DIST_BOTH- Both tables are redistributed.

The disk usage for the above query was looking like below. The disk usage can be seen from the amazon redshift web console in the performance tab. The disk space usage was reaching close to 100% and this is not good news at all since that single query is consuming all the disk space. This is indication of poor query performance. This query was run on a dw2_8xLarge cluster which is a large node type with relatively high disk space.

Also, another way you can check if your query is performing poorly is check if there are lot of writes to the disk using query such as below: 2044 below is the query id of the query.

select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 2044 order by workmem desc;

SOLUTION:

We had to make a lot of changes to fix the performance issues. All the factors that influence the query performance and the required changes in discussed below:

1) Table design (Data distribution and Sorting) - In redshift the data is distributed on all the nodes. Whenever it is executing the query it has to bring the data from the different nodes and use it in joins, aggregation, sorting, group by, etc. If the amount of data that has to grabbed from these nodes is high then it results in lot of traffic and poor query performance. To reduce the amount of traffic between nodes, the data has to be distributed properly between the nodes. This can be achieved through the proper table design, data compression, distribution key, sort keys, and query joins. First lets looks at the table design. Below is the table create statement. The distribution style (diststyle), distribution key (distkeys), sort keys are very important in defining how the data is distributed among the nodes.

CREATE [ [LOCAL ] { TEMPORARY | TEMP } ]
TABLE table_name [ ( column_name1 encode , ]
| DISTSTYLE { EVEN | ALL | KEY }
| DISTKEY ( distkey_identifier )
| SORTKEY ( sortkey_identifier [, ... ] )

DISTSTYLE ALL will put a copy of the data in all the nodes and make it available for joins without having to move the data between nodes. This is ideal for dimension tables which does not have many records in it. DISTSTYLE KEY will distribute data based on a defined key column and better for tables that are huge and you can use a column that is used in joins to distribute the data. DISTSTYLE EVEN is used mostly to evenly distribute the data and when you cant deside between ALL or KEY.

*Trick*: USE DISTSTYLE ALL for all the dimension tables that are not big.

DISTKEY (column name)- The column name used here should be the one that is used in joins and should be defined for all the large fact tables. The distribution key (distkey) will help in distributing the data based on the distkey column and during joins only the required data is brought the nodes. There can be only one distkey (distribution key) column defined.

*Trick*: Define Distkey for all the large fact tables and the dist key column should be the one used in the joins.

If the distribution key from the large fact table is joined to the any column of a dimension table that is in diststyle ALL then in your explain plan you will see the attribute DS_DIST_ALL_NONE which mean there is no data redistribution and the cost of the query is very low. This is the state you want most part of the query to be in for best performance.

SORT KEY (column name1, ..): There can be multiple sort keys and the sort keys help a lot in storing the data in the nodes in sorted order which helps in group by or order by operation.

*Trick*: If there are columns such as date or timestamp that you use in order by operation then define those as sort keys.

2) Table design (Compression) - In the amazon redshift tables the encoding that can be used for compression can be defined along with the column name. Format shown below. There are various encodings available.

CREATE TABLE table_name (column_name
data_type ENCODE encoding-type)[, ...]

*Trick*: Easy way to decide on the encoding if you are using the copy command to load data from s3to redshift is to use COMPUPDATE ON option set in the copy command as shown below. The copy command choosed the best compression to use for the columns that it is loading data to.

copy <tablename> from 's3://mybucket/files.txt'
credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
delimiter '|' COMPUPDATE ON

If you want to explicitly define the encoding like when you are inserting data from another table or set of tables, then load some 200K records to the table and use the command ANALYZE COMPRESSION <tablename> to make redshift suggest the best compression for each of the columns. You can use those suggestion while recreating the table.

3) Table design (Constraints) - Defining the primary keys and foreign keys though is not enforced by redshift apparently is used by the query planner to execute the query efficiently. Hence, define those keys in your create table statements for better performance.

4) Query design - Now we have discussed about table design, the actual query design will all use all those table design features to execute the query. As discussed before the trick in designing good queries is to have proper joins. i.e use distribution key from the large fact table to join to any column of a dimension table that is in diststyle ALL. If dimension table is not in (distribution style) diststyle all, then use the distkey (distribution key) from the dimension table in the joins.

5) WLM queue setting:

The WLM queue setting on your cluster configuration determines the number of queries running on your cluster and essential part of performance tuning. Keeping it low means not many queries can run at the same time and provided more memory for each query. Set the concurrency and memory setting for each user group in the cluster for better performance

6) Copy Command Performance tuning:
If you want your copy command to be faster then split the files into multiple files so that they can get loaded in parallel to the redshift database.

All the above discussed steps should help in running the queries more efficiently. If you need more details refer to aws link:
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables.html

Also check:
http://dwbitechguru.blogspot.ca/2014/11/amazon-redshift-top-concerns.html

Thursday, 13 November 2014

Troubleshooting amazon redshift connections issues:

a) Could not connect or resolve host name

Resolution/Cause: This means the cluster is not up or your cluster address is wrong. Check the obbc/jdbc setting. Bring the cluster up.

b) ][ODBC PostgreSQL Wire Protocol driver]Seeding the PRNG failed, most likely because the system does not have /dev/random. Connection String.

][Database Classes][Error] Failed to connect to data source. Error message: Connect failed.

Cause: cluster down for maintenance or is not available.

c) [ODBC PostgresSQL Wire protocol driver] FATAL: no pg_hba.conf entry for host "10.10.140.1". user "something", database "something", SSL off

Resolution: Have the SSL mode On i.e. in Allow mode. This is a client side setting. If you are using AQT, Workbench, or any client to connect, you need to have SSL mode on if you are getting the above error.

d) [ODBC PostgreSQL Wire Protocol driver]SSL I/O Error.

Resolution: This could be caused by the cluster not being up, or the odbc not being configured properly or the connection being dropped from huge volume of data. Most likely cause being the connection being dropped because the odbc driver is trying to read large chuncks of data.

Tuesday, 11 November 2014

Amazon Redshift - Top concerns and issues

Amazon redshift cloud data warehouse is going to change the data warehousing landscape sooner or later. With more and more companies wanting to expand their data warehousing capabilities and wanting high powered devices that can churn huge loads, Amazon redshift is a very attractive proposition. It offers a peta byte scale data warehouse appliance which is very cheap and payable on demand on hourly basis with zero or no administration or maintenance required and over very high performance. All we have to do find a secure reliable way to upload data to the amazon redshift cloud which is not hard by the way, and then go crazy doing any kind of data processing. It is a great solution for sure, but before you jump into the bandwagon, a few concerns to be noted. The amazon redshift top concerns are listed below:

1) How much data do you have to upload and download and at what frequency ? - If you are uploading gigabytes of data to amazon s3 and then loading to amazon redshift, there is some amount of time required to upload the data and which also consumes huge bandwidth. Again not a problem if you are doing once a week or month. If doing daily or very constantly then yes this will add latency and network cost.

2) Integration with ETL/BI tools: If you are reading/writing huge volume data for ETL/BI processing directly over odbc/jdbc then most likely this will be challenging since the network is going to slow it down and there are not many odbc drivers (atleast I havent come across so far) that can do bulk loads directly to redshift. Odbc drivers do not compress data and if you are reading millions of records then it is going to run out of buffer or the connection will fail at some point. The recommended solution for huge volumes is to upload/download data as flat files through s3 to amazon redshift.

3) Security clearance - Some companies like insurance/banks have policies over not allowing data to be uploaded to cloud. If you can get all the security clearance then it is possible to upload data to cloud in a secure and encrypted way and keep it encrypted on the cluster in such a way that even amazon cannot see it. In any case, there is some hurdles to be crossed before companies get all the internal clearance to upload data to the amazon redshift cloud.

3) Special characters- Redshift stores data in UTF-8 characters sets. It can copy data from files in UTF-8, UTF16 formar but it converts them all into UTF-8 character set. That means some characters like those french characters with accents etc will not get loaded properly. The solution redshift is offering is to replace all of them with a single utf-8 character while copying. If any one has figured out a way to load french characters to amazon redshift, please message me. At this moment this is a big concern for me.

Also, check out:
http://dwbitechguru.blogspot.ca/2014/11/performance-tuning-in-amazon-redshift.html